Csc 220 - Data Structures
Project
Due: Thursday, September 21, 2006
 

Interactive Word Indexer


Assignment

Create a program that reads in any text file, and creates an interactive index of each word in the file. Your program must input the name of the file interactively from the user. It must then read in each word from the file, storing them in sorted order in a tree, without duplicates. Each word in the tree must contain a reference to every occurrence in the original line of text that contains that word. >>>> The following requirements removed 8/12/06:  After constructing the tree from the entire file, your program must print the height of the tree, the the total number of unique words in the file, the total number of lines in the file, and the 16 most common words and how many times each appears.

Then your program must go into a loop, asking the user for a word and printing out all lines from the original file that contain the word, with their line numbers.  If a word appears twice in the same line, your program may just print the same line twice, rather than adding any special output.  (i.e. the data structure that holds the references to the lines can have duplicates).  Your program does not have to indicate where in the line the word is, but it may if you want to add that feature.  If the user types in a word that is not in the file, print an appropriate message and ask for another word. Continue the loop, asking for words and printing the proper lines, until the user enters a null string by pressing enter without typing any characters.

Requirements

Your program must process the words without regard to case.  You may ignore all numbers and symbols, that is all characters except ASCII 'A' through 'Z' and 'a' through 'z'.  Within reason, you may decide exactly what a "word" is.  If you want, a "word" may be any contiguous series of alphabetic characters in the file, whether they form a recognized English word or not.  In that case, "don't" would be considered two words, "don" and "t". The lines shown containing the word entered by the user must contain the line number, and must otherwise be exactly as it is in the file, with all capitalization and punctuation. 

You may write your program either in C or C++, your choice.  If you use C, it must be written as well designed procedural code with a short, simple int main() and well designed functions, each with a specific, clearly defined purpose.  If you use C++, it must be written using well designed object oriented code, with no global functions except a short, simple int main() and "bomb proof" classes that implement data encapsulation and are never left in an inconsistent state when a method ends.  You may not use any global variables, for any reason except as debugging aids.  Any debugging globals should be commented out in the program you submit for grading.  You may mix C++ code (but not design) into your C program, especially the <iostream> input and output such as  cout and cin.  (Never use <iostream> and <stdio> in the same program -- you are just asking for problems if you do).  You may want to use printf and scanf  exclusively in a C++ program.  Don't use  #define for constants in a C++ program.  Use global const statements instead, which is now the recommended style for both C++ and C.

For this project, you may not use the Standard Template Library, except for class <string>.  If you did, the project would be about 30 lines long

The tree, all lines from the file, and the references to those lines must be stored on the heap using new (C++) or malloc (C).  Your program must not use any arrays, except for simple character arrays used to store an individual word or line of text.  You don't even need character arrays if you use class <string>.  For space efficiency, your program must not duplicate any lines of text or tree nodes in the heap memory.  When your program ends, it must properly delete (C++) or free (C) all dynamic memory.  (Yes, do this even though all the memory is released anyway when a C or C++ program ends).   

To get full credit, your program must use an AVL Height Balanced Tree.  As a programming development strategy, I suggest you complete the program with a simple binary search tree, and add the AVL height balancing last.  With the type of data we are using, a properly balanced AVL tree is about half the height of a simple binary search tree.

Your program must also be properly indented and commented as shown in class, including function header comments.  It must be written in .c  or .cpp  and  .h  files according to the standard practice for C or C++. 

You must use a system and an environment so that viewing the source code, compiling and linking, and running your program can be done from my office during grading. 


Grading

Project 1 is worth 22 points toward the final grade.  It will be graded according to the criteria on the Project 1 Grade Report. 

You will be given a hard-copy of your Grade Report during the Interactive Grading (see Date immediately following).  You are required to keep the Grade Reports for all your projects until you have your final grade at the end of the semester, in case there is a question about the grade of any of your projects.  

Date

Your project will be interactively graded on Thursday, September 21st.  On Tuesday, September 19th each student will sign up for a specific 20 minute period for grading on the following Thursday.  You may have your project graded before September 21st, if you wish, by making arrangements with the instructor

Back to Csc 220 - Data Structures
  Scott Badman   Office: B132   Phone: 353-2250   sbadman@parkland.edu  

Parkland College, 2400 W. Bradley Avenue, Champaign, IL 61821