Question: Create a function to process a document in the following steps: 1) Tokenize the words using NLTK 2) Use the Porter stemmer 3) Counts the

Create a function to process a document in the following steps: 1) Tokenize the words using NLTK 2) Use the Porter stemmer 3) Counts the term frequency tf for each item 4) calculates the weighting term frequency wf for each item, as follows: wf = 0 if tf =0 wf = 1 +ln(tf), otherwise Apply this function to every document in the collection. Generate an index for the collection merging the terms for all the documents. Then, calculate the document frequency df to include the number of documents in the collection containing each index term. Then calculate the inverse document frequency idf for each term in the index. Note idf = ln(n/df), where n is the number of documents. Then assign a wf.idf weight to each index term i in each document d. w = wf x idf Note this is the term X document matrix with rows indexed by the terms in the index and columns indexed by the documents.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!