Write a function called tfidf_matrix(docs,stopwords). Given a list of documents and a list of stopwords, construct the

Fantastic news! We've Found the answer you've been seeking!

Question:

Write a function called tfidf_matrix(docs,stopwords). Given a list of documents and a list of stopwords, construct the TF-IDF matrix for the corpus. Using NLTK stopwords and Brown corpus from NLTK as documents.

Documents should first be pre-processed in the following manner - 1) lowercasing words, 2) excluding stopwords, and 3) including alphanumeric strings only. All the words remaining in the documents after pre-processing will constitute the vocabulary.

The function should return the TF-IDF matrix of dimension (num_docs, num_words) and the vocabulary as inferred from the corpus. The vocabulary returned should be sorted in alphabetical order. The dimensions of the TF-IDF matrix should match the order of the documents and the order of the sorted vocabulary. i.e. (i,j)cell of the matrix should denote the TF-IDF score of the ith document and jth word in the sorted vocabulary.

TF-IDF scoring function has several forms. To ensure reproducibility, you have to implement raw count term frequency, and smoothened inverse document frequency.

Then write a function get_idf_values(documents, stopwords) to get the idf values.

These should be the following output

from nltk.corpus import brown
documents = [brown.words(fileid) for fileid in brown.fileids()]

stopwords_list = stopwords.words('english')

tfidf_matrix, vocabulary = create_tfidf_matrix(documents, stopwords_list)
idf_values = get_idf_values(documents, stopwords_list)
print(tfidf_matrix.shape)
# (500, 40881)

print(tfidf_matrix[np.nonzero(tfidf_matrix)][:5])
# [4.53281493 2.55104645 4.60517019 2.47693848 3.91202301]

print(vocabulary[2000:2010])

# ['amoral', 'amorality', 'amorist', 'amorous', 'amorphous', 'amorphously', 'amortization', 'amortize', 'amory', 'amos']