Question: Write a function called tfidf_matrix(docs,stopwords). Given a list of documents and a list of stopwords, construct the TF-IDF matrix for the corpus. Using NLTK stopwords

Write a function called tfidf_matrix(docs,stopwords). Given a list of documents and a list of stopwords, construct the TF-IDF matrix for the corpus. Using NLTK stopwords and Brown corpus from NLTK as documents.

Documents should first be pre-processed in the following manner - 1) lowercasing words, 2) excluding stopwords, and 3) including alphanumeric strings only. All the words remaining in the documents after pre-processing will constitute the vocabulary.

The function should return the TF-IDF matrix of dimension (num_docs, num_words) and the vocabulary as inferred from the corpus. The vocabulary returned should be sorted in alphabetical order. The dimensions of the TF-IDF matrix should match the order of the documents and the order of the sorted vocabulary. i.e. (i,j)cell of the matrix should denote the TF-IDF score of the ith document and jth word in the sorted vocabulary.

TF-IDF scoring function has several forms. To ensure reproducibility, you have to implement raw count term frequency, and smoothened inverse document frequency.

 

Then write a function get_idf_values(documents, stopwords) to get the idf values.

 

These should be the following output

from nltk.corpus import brown
documents = [brown.words(fileid) for fileid in brown.fileids()]

stopwords_list = stopwords.words('english')

tfidf_matrix, vocabulary = create_tfidf_matrix(documents, stopwords_list)
idf_values = get_idf_values(documents, stopwords_list)
print(tfidf_matrix.shape)
# (500, 40881)

print(tfidf_matrix[np.nonzero(tfidf_matrix)][:5])
# [4.53281493 2.55104645 4.60517019 2.47693848 3.91202301]

print(vocabulary[2000:2010])

# ['amoral', 'amorality', 'amorist', 'amorous', 'amorphous', 'amorphously', 'amortization', 'amortize', 'amory', 'amos']

Step by Step Solution

3.39 Rating (155 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

import numpy as np from collections import Counter from nltkcorpus import stopwords from nltkcorpus import brown from nltk import wordtokenize from nl... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!