Question: # TF - IDF and PPMI Code Template This template will help you implement TF - IDF and PPMI calculations using the
# TFIDF and PPMI Code Template
This template will help you implement TFIDF and PPMI calculations using the NLTK library and the Brown corpus.
You will preprocess the corpus, compute term frequencies, document frequencies, TFIDF scores, and create
a word cooccurrence matrix to compute Positive Pointwise Mutual Information PPMI scores.
import nltk
from nltkcorpus import brown
from collections import defaultdict, Counter
import math
import numpy as np
# Download the Brown corpus if not already downloaded
nltkdownloadbrown
# Preprocess the corpus: Tokenize, lowercase, and add startend tokens
def preprocesscorpus:
Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens.
Args:
corpus list: List of sentences from the corpus.
Returns:
list: Preprocessed and tokenized corpus.
tokenizedcorpus
for sentence in corpus:
# TODO: Implement tokenization and lowercasing
# HINT: Use list comprehension and strlower
# TODO: Add at the start and at the end of the sentence
pass # Remove this line after implementing
return tokenizedcorpus
# Calculate Term Frequency TF
def computetfcorpus:
Calculate the term frequency for each word in each document.
Args:
corpus list: Preprocessed corpus where each document is a list of words.
Returns:
dict: Term frequencies for each document.
tf defaultdictCounter
# TODO: For each document, count the occurrences of each word
# HINT: Use enumerate to get document index and Counter to count words
pass # Remove this line after implementing
return tf
# Calculate Document Frequency DF
def computedftf:
Calculate the document frequency for each word across all documents.
Args:
tf dict: Term frequencies for each document.
Returns:
Counter: Document frequencies for each word.
df Counter
# TODO: For each word, count the number of documents it appears in
# HINT: Use a set of words for each document to avoid counting duplicates
pass # Remove this line after implementing
return df
# Calculate TFIDF for each word
def computetfidftf df numdocs:
Calculate the TFIDF score for each word in each document.
Args:
tf dict: Term frequencies for each document.
df Counter: Document frequencies for each word.
numdocs int: Total number of documents.
Returns:
dict: TFIDF scores for each word in each document.
tfidf defaultdictdict
# TODO: For each document and word, calculate TFIDF score
# TFIDF formula: TFword logN DFword
# HINT: Use math.log for logarithm
pass # Remove this line after implementing
return tfidf
# Create a word cooccurrence matrix
def createcooccurrencematrixcorpus windowsize:
Create a word cooccurrence matrix from the corpus.
Args:
corpus list: Preprocessed corpus where each document is a list of words.
windowsize int: The size of the context window.
Returns:
tuple: Cooccurrence matrix, word to index mapping, and index to word mapping.
# TODO: Build the vocabulary of unique words
# HINT: Use a set to store unique words
pass # Remove this line after implementing
# TODO: Initialize cooccurrence matrix
# HINT: Use numpy to create a zero matrix of size vocabsize x vocabsize
pass # Remove this line after implementing
# TODO: Fill in the cooccurrence matrix
# HINT: For each word, consider a window of words around it
pass # Remove this line after implementing
return cooccurrencematrix, wordtoid idtoword
# Calculate PPMI from cooccurrence matrix
def computeppmicooccurrencematrix:
Compute the Positive Pointwise Mutual Information PPMI matrix from the cooccurrence matrix.
Args:
cooccurrencematrix numpyndarray: Cooccurrence matrix of word counts.
Returns:
numpy.ndarray: PPMI matrix.
# TODO: Calculate total sum of all cooccurrences
# HINT: Use numpy.sum
pass # Remove this line after implementing
# TODO: Calculate sum over rows word occurrence counts
# HINT: Use numpy.sum with axis
pass # Remove this line after implementing
# Initialize PPMI matrix with zeros
ppmimatrix npzeroscooccurrencematrix.shape
# TODO: Compute PPMI for each cell in the matrix
# HINT: Use nested loops to iterate over the matrix indices
# Remember to check if pij before computing PMI
pass # Remove this line after implementing
return ppmimatrix
# Main execution
if namemain
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
