Write a function called tfidf_matrix(docs,stopwords). Given a list of documents and a list of stopwords, construct the
Fantastic news! We've Found the answer you've been seeking!
Question:
Documents should first be pre-processed in the following manner - 1) lowercasing words, 2) excluding stopwords, and 3) including alphanumeric strings only. All the words remaining in the documents after pre-processing will constitute the vocabulary.
The function should return the TF-IDF matrix of dimension (num_docs, num_words) and the vocabulary as inferred from the corpus. The vocabulary returned should be sorted in alphabetical order. The dimensions of the TF-IDF matrix should match the order of the documents and the order of the sorted vocabulary. i.e. (i,j)cell of the matrix should denote the TF-IDF score of the ith document and jth word in the sorted vocabulary.
TF-IDF scoring function has several forms. To ensure reproducibility, you have to implement raw count term frequency, and smoothened inverse document frequency.
Then write a function get_idf_values(documents, stopwords) to get the idf values.
These should be the following output
from nltk.corpus import brown
documents = [brown.words(fileid) for fileid in brown.fileids()]
stopwords_list = stopwords.words('english')
tfidf_matrix, vocabulary = create_tfidf_matrix(documents, stopwords_list)
idf_values = get_idf_values(documents, stopwords_list)
print(tfidf_matrix.shape)
# (500, 40881)
print(tfidf_matrix[np.nonzero(tfidf_matrix)][:5])
# [4.53281493 2.55104645 4.60517019 2.47693848 3.91202301]
print(vocabulary[2000:2010])
# ['amoral', 'amorality', 'amorist', 'amorous', 'amorphous', 'amorphously', 'amortization', 'amortize', 'amory', 'amos']
Related Book For
Auditing and Assurance services an integrated approach
ISBN: 978-0132575959
14th Edition
Authors: Alvin a. arens, Randal j. elder, Mark s. Beasley
Posted Date: