Question: For this project, you will: Be introduced to DataCount stemming such as removing 's' from the end of a word can lead to erroneous results
For this project, you will: Be introduced to DataCount stemming such as removing 's' from the end of a word can lead to erroneous results (such as "bu" from "bus") and require special logic. Even our simple rules cause problems; for instance, "it's" and "its" are now the same word. Implementing a better stemming algorithm (like Porter Stemming) is above and beyond work. Signature Generation A fundamental part of your work lies in computing the "signature" of a document. The professor has provided you with a sample WordCount program that reads in a document and counts the number of times that a stemmed word appears, assuming that the document's words are already stemmed. The output of this program looks like this: 970 the 708 and 666 of 632 to 521 at 521 i 521 into 466 a 444 my 391 in 383 buffalo ... where the number in column 1 is the frequency that the corresponding string in column 2 occurs in the text. Note that the WordCount program sorts its output primarily in decreasing order by frequency count, secondarily by alphabetical order. The ordering would be identical if it had sorted by frequency fraction first (i.e. frequency_count/num_total_words). Document Correlation Document correlation is a reasonably large area of study. Perhaps its most visible application is in search engines which rank the correlation of webpages to a set of keywords that you provide. One model often used for correlating documents is Latent Semantic Indexing (LSI) where a collection of documents is considered together (rather than independently) and a word's usefulness is determined by how frequently it appears in the documents (for instance, "the" isn't very useful because it appears in most documents). We will not be doing LSI. We will do a simpler document comparison: Calculate word counts for the two documents and normalize the frequencies so that they can be meaningfully compared between different documents (hint: use frequency percentages or fractions.) As in LSI, remove words whose relative frequencies are too high or too low to be useful to your study. A good starting point is to remove words with normalized frequencies above 0.01 (1%) and below 0.0001 (0.01%), but feel free to play around with these numbers. You should, however, report results with these frequencies so we can compare outputs with a standard set of numbers. For every word that occurs in both documents, take the difference between the normalized frequencies, square that difference, and add the result to a running sum. The final value of this running sum will be your difference metric. This metric corresponds to the square of the Euclidean distance between the two vectors in the space of shared words in the document. Note that this metric assumes that words not appearing in both documents do not affect the correlation. Partners You are encouraged (although not required) to work with a partner of your own choosing for this project. No more than two students total can be on a team together. Requirements There are five steps in this project: 1. Write two DataCounter dictionary implementations (AVL, Hash) 2. Modify WordCount to be able use your DataCounter implementations, and to select the implementation at runtime. 3. Write a document correlator that will print a difference score between two documents 4. Benchmark your data structures and correlator 5. Analyze and write up the results of your performance tests The analysis and writeup will be significantly longer in this project. Be sure to allocate time for it. It is worth 1/3 of your grade, and you will not be able to do it in an hour or two. DataCounter Implementations For this assignment, you must implement two data structures (you may reuse previous implementations): AVL tree Your implementation MUST extend BinarySearchTree included in the zip files regardless if you choose to modify the existing code for BinarySearchTree. Hash table You must implement your own hash code for the hash table. Do not use String.hashCode() Bothof these data structures must implement the DataCounter interface, which is a specialized version of DictionaryADT. You do not need to implement remove in any of these structures. You can implement any hash tables discussed in class; the only restriction is that it should not restrict the size of the input domain or the number of inputs (i.e. your hash table must grow). WordCount The WordCount program will read in a text file and tally up all the words that appear in it. The WordCount program given to you currently uses an unbalanced binary search tree as its backing DataCounter implementation and contains an insertion sort. You may base your WordCount program on it, or write your own. You need to add additional DataCounter implementations. The commandline form for WordCount will be as follows: java WordCount [ -b | -a | -h ] [ -frequency | -num_unique ]
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
