Question: INFO 153 HW5 This assignment is to process a set of text files and compute related TF and IDF statistics. Data Files Please collect about
INFO 153 HW5
This assignment is to process a set of text files and compute related TF and IDF statistics.
Data Files
Please collect about 20 text data instances (e.g. brief news reports or research abstracts) and save them as individual .txt files. Files names should be named in a sequential order such as 1.txt, 2.txt, 3.txt, ... and 20.txt.
Create Document abstract data type/class
The Document class should have:
A dictionary variable to keep track of all unique words and their frequency in the document;
A tokenize(text) method that:
oSplits text into single words using space and punctuation as delimiter;
oUse a loop to go through all the words, and for each word:
If it does not appear in the dictionary, add it to the dictionary and set its count/frquency to 1;
If it is already in the dictionary, increment its count/frequency by adding 1 to it;
Create save_dictionary function
The function should accept two arguments:
One argument for the dictionary with data to be saved;
Second argument about the file pathname to save the data;
The function saves all data/statistics in the dictionary to text files, with each key-value pair in one text line separated by a tab ("\t"). The output file should look like:
Key1value1
Key2value2
Key3value3
...
Create vectorize function
The function should:
Take a string argument as the path to where the text data files are;
Process all data files in the path and produces TF and IDF statistics;
Here are steps in the function:
Create dictionary variable to keep track all unique words and their DF (document frequency);
List all .txt files in the path argument;
For each file:
oCreate Document object (based on the Document class);
oRead the content (text lines) from the text file;
oCall the document object's tokenize function to process the text content;
oCall the save_dictionary function to save the document's dictionary with TF (term frequencies) to a file, where the filename should be tf_DOCID.txt in the same path.
For example, after processing 1.txt file, the data should be saved to tf_1.txt file in the same directory.
oCreate a nested loop, and for each word in the document's dictionary:
If it does not appear in the dictionary for DF, then add the word to the DF dictionary;
If it is already in the DF dictionary, increment its DF value by adding 1 to itself;
After all files are processed, call the save_dictionary function again to save the DF dictionary to a file named df.txt in the same path with the input text files.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
