Question: In Java The CiteSeer UMD collection is a standard text document collection, consisting of abstracts of research artides from Computer Science, which are sampled from

 In Java The CiteSeer UMD collection is a standard text document

In Java

The CiteSeer UMD collection is a standard text document collection, consisting of abstracts of research artides from Computer Science, which are sampled from the CiteSeer digital library. The dataset is available for download from blackboard. Tasks: 1. Write a program that preprocesses the collection. This pre processing stage should specifically include a function that tokenizes the text. In doing so, tokenize on whitespace and remove punctuation. For this task, please use your own implementation of a tokenizer 2. Determine the frequency of occurrence for all the words in the collection. Answer the following questions: 1. What is the total number of words in the collection? 2. What is the vocabulary size? (i.e, number of unique terms). 3. What are the top 20 words in the ranking? (1.e., the words with the highest frequencies). 4. From these top 20 words, which ones are stop-words? 5. What is the minimum number of unique words accounting for 15% of the total number of words in the collection? Example: if the total number of words in the collection is 100, and we have the fol lowing word-frequency pairs: Word the of 20 10 10 8 data mining the answer to this question will be (1 word accounts for 15% of the total 100 words). The CiteSeer UMD collection is a standard text document collection, consisting of abstracts of research artides from Computer Science, which are sampled from the CiteSeer digital library. The dataset is available for download from blackboard. Tasks: 1. Write a program that preprocesses the collection. This pre processing stage should specifically include a function that tokenizes the text. In doing so, tokenize on whitespace and remove punctuation. For this task, please use your own implementation of a tokenizer 2. Determine the frequency of occurrence for all the words in the collection. Answer the following questions: 1. What is the total number of words in the collection? 2. What is the vocabulary size? (i.e, number of unique terms). 3. What are the top 20 words in the ranking? (1.e., the words with the highest frequencies). 4. From these top 20 words, which ones are stop-words? 5. What is the minimum number of unique words accounting for 15% of the total number of words in the collection? Example: if the total number of words in the collection is 100, and we have the fol lowing word-frequency pairs: Word the of 20 10 10 8 data mining the answer to this question will be (1 word accounts for 15% of the total 100 words)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!