TF IDF and PPMI Code Template This template will help you implement TF IDF and PPMI calculations using the NLTK library and the Brown corpus You will preprocess the corpus, compute term frequencies, document frequencies, TF IDF scores, and create a word co occurrence matrix to compute Positive Pointwise Mutual Information ( PPMI ) scores import nltk from nltk corpus import brown from collections import defaultdict, Counter import math import numpy as np Download the Brown corpus if not already downloaded nltk download ( ' brown ' ) Preprocess the corpus Tokenize, lowercase, and add start end tokens def preprocess ( corpus ) Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens Args corpus ( list ) List of sentences from the corpus Returns list Preprocessed and tokenized corpus tokenized corpus for sentence in corpus TODO Implement tokenization and lowercasing HINT Use list comprehension and str lower ( ) TODO Add ' ' at the start and ' ' at the end of the sentence pass Remove this line after implementing return tokenized corpus Calculate Term Frequency ( TF ) def compute tf ( corpus ) Calculate the term frequency for each word in each document Args corpus ( list ) Preprocessed corpus where each document is a list of words Returns dict Term frequencies for each document tf defaultdict ( Counter ) TODO For each document, count the occurrences of each word HINT Use enumerate to get document index and Counter to count words pass Remove this line after implementing return tf Calculate Document Frequency ( DF ) def compute df ( tf ) Calculate the document frequency for each word across all documents Args tf ( dict ) Term frequencies for each document Returns Counter Document frequencies for each word df Counter ( ) TODO For each word, count the number of documents it appears in HINT Use a set of words for each document to avoid counting duplicates pass Remove this line after implementing return df Calculate TF IDF for each word def compute tfidf ( tf , df , num docs ) Calculate the TF IDF score for each word in each document Args tf ( dict ) Term frequencies for each document df ( Counter ) Document frequencies for each word num docs ( int ) Total number of documents Returns dict TF IDF scores for each word in each document tfidf defaultdict ( dict ) TODO For each document and word, calculate TF IDF score TF IDF formula TF ( word ) log ( N ( 1 DF ( word ) ) ) HINT Use math log ( ) for logarithm pass Remove this line after implementing return tfidf Create a word co occurrence matrix def create cooccurrence matrix ( corpus , window size 5 ) Create a word co occurrence matrix from the corpus Args corpus ( list ) Preprocessed corpus where each document is a list of words window size ( int ) The size of the context window Returns tuple Co occurrence matrix, word to index mapping, and index to word mapping TODO Build the vocabulary of unique words HINT Use a set to store unique words pass Remove this line after implementing TODO Initialize co occurrence matrix HINT Use numpy to create a zero matrix of size vocab size x vocab size pass Remove this line after implementing TODO Fill in the co occurrence matrix HINT For each word, consider a window of words around it pass Remove this line after implementing return cooccurrence matrix, word to id , id to word Calculate PPMI from co occurrence matrix def compute ppmi ( cooccurrence matrix ) Compute the Positive Pointwise Mutual Information ( PPMI ) matrix from the co occurrence matrix Args cooccurrence matrix ( numpy ndarray ) Co occurrence matrix of word counts Returns numpy ndarray PPMI matrix TODO Calculate total sum of all co occurrences HINT Use numpy sum ( ) pass Remove this line after implementing TODO Calculate sum over rows ( word occurrence counts ) HINT Use numpy sum ( ) with axis 1 pass Remove this line after implementing Initialize PPMI matrix with zeros ppmi matrix np zeros ( cooccurrence matrix shape ) TODO Compute PPMI for each cell in the matrix HINT Use nested loops to iterate over the matrix indices Remember to check if pij 0 before computing PMI pass Remove this line after implementing return ppmi matrix Main execution if name main

The Answer is in the image, click to view ...

Question: # TF - IDF and PPMI Code Template This template will help you implement TF - IDF and PPMI calculations using the

# TF

-

IDF and PPMI Code Template

" " "

This template will help you implement TF

-

IDF and PPMI calculations using the NLTK library and the Brown corpus.

You will preprocess the corpus, compute term frequencies, document frequencies, TF

-

IDF scores, and create

a word co

-

occurrence matrix to compute Positive Pointwise Mutual Information

(

PPMI

)

scores.

" " "

import nltk

from nltk

.

corpus import brown

from collections import defaultdict, Counter

import math

import numpy as np

# Download the Brown corpus if not already downloaded

nltk

.

download

('

brown

')

# Preprocess the corpus: Tokenize, lowercase, and add start

/

end tokens

def preprocess

(

corpus

)

" " "

Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens.

Args:

corpus

(

list

)

: List of sentences from the corpus.

Returns:

list: Preprocessed and tokenized corpus.

" " "

tokenized

_

corpus

= []

for sentence in corpus:

# TODO: Implement tokenization and lowercasing

# HINT: Use list comprehension and str

.

lower

()

# TODO: Add

''

at the start and

''

at the end of the sentence

pass # Remove this line after implementing

return tokenized

_

corpus

# Calculate Term Frequency

(

)

def compute

_

(

corpus

)

" " "

Calculate the term frequency for each word in each document.

Args:

corpus

(

list

)

: Preprocessed corpus where each document is a list of words.

Returns:

dict: Term frequencies for each document.

" " "

=

defaultdict

(

Counter

)

# TODO: For each document, count the occurrences of each word

# HINT: Use enumerate to get document index and Counter to count words

pass # Remove this line after implementing

return tf

# Calculate Document Frequency

(

)

def compute

_

(

)

" " "

Calculate the document frequency for each word across all documents.

Args:

(

dict

)

: Term frequencies for each document.

Returns:

Counter: Document frequencies for each word.

" " "

=

Counter

()

# TODO: For each word, count the number of documents it appears in

# HINT: Use a set of words for each document to avoid counting duplicates

pass # Remove this line after implementing

return df

# Calculate TF

-

IDF for each word

def compute

_

tfidf

(

,

,

num

_

docs

)

" " "

Calculate the TF

-

IDF score for each word in each document.

Args:

(

dict

)

: Term frequencies for each document.

(

Counter

)

: Document frequencies for each word.

num

_

docs

(

int

)

: Total number of documents.

Returns:

dict: TF

-

IDF scores for each word in each document.

" " "

tfidf

=

defaultdict

(

dict

)

# TODO: For each document and word, calculate TF

-

IDF score

# TF

-

IDF formula: TF

(

word

) *

log

(

/ (1 +

(

word

)))

# HINT: Use math.log

()

for logarithm

pass # Remove this line after implementing

return tfidf

# Create a word co

-

occurrence matrix

def create

_

cooccurrence

_

matrix

(

corpus

,

window

_

size

= 5)

" " "

Create a word co

-

occurrence matrix from the corpus.

Args:

corpus

(

list

)

: Preprocessed corpus where each document is a list of words.

window

_

size

(

int

)

: The size of the context window.

Returns:

tuple: Co

-

occurrence matrix, word to index mapping, and index to word mapping.

" " "

# TODO: Build the vocabulary of unique words

# HINT: Use a set to store unique words

pass # Remove this line after implementing

# TODO: Initialize co

-

occurrence matrix

# HINT: Use numpy to create a zero matrix of size vocab

_

size x vocab

_

size

pass # Remove this line after implementing

# TODO: Fill in the co

-

occurrence matrix

# HINT: For each word, consider a window of words around it

pass # Remove this line after implementing

return cooccurrence

_

matrix, word

_

_

,

_

_

word

# Calculate PPMI from co

-

occurrence matrix

def compute

_

ppmi

(

cooccurrence

_

matrix

)

" " "

Compute the Positive Pointwise Mutual Information

(

PPMI

)

matrix from the co

-

occurrence matrix.

Args:

cooccurrence

_

matrix

(

numpy

.

ndarray

)

: Co

-

occurrence matrix of word counts.

Returns:

numpy.ndarray: PPMI matrix.

" " "

# TODO: Calculate total sum of all co

-

occurrences

# HINT: Use numpy.sum

()

pass # Remove this line after implementing

# TODO: Calculate sum over rows

(

word occurrence counts

)

# HINT: Use numpy.sum

()

with axis

= 1

pass # Remove this line after implementing

# Initialize PPMI matrix with zeros

ppmi

_

matrix

=

.

zeros

(

cooccurrence

_

matrix.shape

)

# TODO: Compute PPMI for each cell in the matrix

# HINT: Use nested loops to iterate over the matrix indices

# Remember to check if pij

> 0

before computing PMI

pass # Remove this line after implementing

return ppmi

_

matrix

# Main execution

__

name

__= = "__

main

_

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

PROBLEM 2 : Vector semantics ( TF - IDF and PPMI vectors ) ( 5 0 points ) 1 . Considering the first 1 0 0 0 sentences of the Brown corpus ( corpus [ 0 : 1 0 0 0 ] ) , regard each sentence as a...

Question 2 a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order: termCount = number of times the term appeared in the document length = total...

NEED ASAP PLEASE. Deliverables: 1. For each programming question, you must hand in: 1. A cpp source file named A2Q\#[stno].cpp, properly documented. So if your student number is 201234566 and this is...

Question must be done in Python 3.6 It is clever to make your own tweets.txt file with fewer tweets. Processing the entire file takes quite a long time. We are using a version of TF-IDF that is not...

Online Code Test || Only 30 minutes remaining Given a corpus C of documents (as a list of strings), a word token and a document index, find the term frequency - inverse document frequency (tfidf) of...

. Following is the outcome the prediction of a logistic model on customer churn. The column + - signals the models prediction and the row signals actual dependent variable value. Calculate the...

2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequenco): D1: BettyBotter bought some Butter D2: Then to Better...

import numpy as np #machine learning tool used for efficient array processing import pandas as pd #machine learning tool used for data sets and data frames from sklearn.model _ selection import train...

Data Science Question: NLP 1. Consider the following documents a. S1-"I am sam sam I am" b. S2 -"sam sam sam I am" a. b. c. Find the term-frequency (tf) matrix for these documents Find the...

This is the CH3_4_newversion.R lecture the instructions refer to In R, the package "gutenbergr" offers over 53,000 free books. Each book has its own id number. For instance, we can use following...

Consider the differential equation dx -2 dy. dx -+5y = 10 sinx. Then the particular integral is unbounded on R bounded on R bounded on a proper subset of R but not on whole R None of the above

We discussed several methods for collecting job analysis dataquestionnaires, the position analysis questionnaire, and so on. Compare and contrast these methods, explaining what each is useful for and...

Question 4 5 pts Paul is approaching retirement and has decided to siphon off funds from his company rather than sell it . From his perspective, the advantage of systematically withdrawing cash from...

Create a question with some background into the question. Discuss your answers in a well-developed reply.Referencing your sources and then citing them in the body of your post