Question: Write a function called tfidf_matrix(docs,stopwords). Given a list of documents and a list of stopwords, construct the TF-IDF matrix for the corpus. Using NLTK stopwords
Documents should first be pre-processed in the following manner - 1) lowercasing words, 2) excluding stopwords, and 3) including alphanumeric strings only. All the words remaining in the documents after pre-processing will constitute the vocabulary.
The function should return the TF-IDF matrix of dimension (num_docs, num_words) and the vocabulary as inferred from the corpus. The vocabulary returned should be sorted in alphabetical order. The dimensions of the TF-IDF matrix should match the order of the documents and the order of the sorted vocabulary. i.e. (i,j)cell of the matrix should denote the TF-IDF score of the ith document and jth word in the sorted vocabulary.
TF-IDF scoring function has several forms. To ensure reproducibility, you have to implement raw count term frequency, and smoothened inverse document frequency.
Then write a function get_idf_values(documents, stopwords) to get the idf values.
These should be the following output
from nltk.corpus import brown
documents = [brown.words(fileid) for fileid in brown.fileids()]
stopwords_list = stopwords.words('english')
tfidf_matrix, vocabulary = create_tfidf_matrix(documents, stopwords_list)
idf_values = get_idf_values(documents, stopwords_list)
print(tfidf_matrix.shape)
# (500, 40881)
print(tfidf_matrix[np.nonzero(tfidf_matrix)][:5])
# [4.53281493 2.55104645 4.60517019 2.47693848 3.91202301]
print(vocabulary[2000:2010])
# ['amoral', 'amorality', 'amorist', 'amorous', 'amorphous', 'amorphously', 'amortization', 'amortize', 'amory', 'amos']
Step by Step Solution
3.39 Rating (155 Votes )
There are 3 Steps involved in it
import numpy as np from collections import Counter from nltkcorpus import stopwords from nltkcorpus import brown from nltk import wordtokenize from nl... View full answer
Get step-by-step solutions from verified subject matter experts
