PROBLEM 2 Vector semantics ( TF IDF and PPMI vectors ) ( 5 0 points ) 1 Considering the first 1 0 0 0 sentences of the Brown corpus ( corpus 0 1 0 0 0 ) , regard each sentence as a document and write a program to compute the TF IDF for each word ( 2 0 points ) Compute the TF IDF for the words county, investigation and produced of the first document of the corpus ( 5 points ) 2 Considering the first 1 0 0 0 sentences of the Brown corpus ( corpus 0 1 0 0 0 ) , regard each sentence as a document and write a program to compute the PPMI for each word , context word pair ( 2 0 points ) The context of a word is the window of words consisting of ( i ) five words ( if avalible ) to the left of the word and ( ii ) five words ( if avalible ) to the right of the word Compute the PPMI for three word , context word pairs expected , approve , mentally , in , send , bed ( 5 points ) TF IDF and PPMI Code Template This template will help you implement TF IDF and PPMI calculations using the NLTK library and the Brown corpus You will preprocess the corpus, compute term frequencies, document frequencies, TF IDF scores, and create a word co occurrence matrix to compute Positive Pointwise Mutual Information ( PPMI ) scores import nltk from nltk corpus import brown from collections import defaultdict, Counter import math import numpy as np Download the Brown corpus if not already downloaded nltk download ( ' brown ' ) Preprocess the corpus Tokenize, lowercase, and add start end tokens def preprocess ( corpus ) Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens Args corpus ( list ) List of sentences from the corpus Returns list Preprocessed and tokenized corpus tokenized corpus for sentence in corpus TODO Implement tokenization and lowercasing HINT Use list comprehension and str lower ( ) TODO Add ' ' at the start and ' ' at the end of the sentence pass Remove this line after implementing return tokenized corpus Calculate Term Frequency ( TF ) def compute tf ( corpus ) Calculate the term frequency for each word in each document Args corpus ( list ) Preprocessed corpus where each document is a list of words Returns dict Term frequencies for each document tf defaultdict ( Counter ) TODO For each document, count the occurrences of each word HINT Use enumerate to get document index and Counter to count words pass Remove this line after implementing return tf Calculate Document Frequency ( DF ) def compute df ( tf ) Calculate the document frequency for each word across all documents Args tf ( dict ) Term frequencies for each document Returns Counter Document frequencies for each word df Counter ( ) TODO For each word, count the number of documents it appears in HINT Use a set of words for each document to avoid counting duplicates pass Remove this line after implementing return df Calculate TF IDF for each word def compute tfidf ( tf , df , num docs ) Calculate the TF IDF score for each word in each document Args tf ( dict ) Term frequencies for each document df ( Counter ) Document frequencies for each word num docs ( int ) Total number of documents Returns dict TF IDF scores for each word in each document tfidf defaultdict ( dict ) TODO For each document and word, calculate TF IDF score TF IDF formula TF ( word ) log ( N ( 1 DF ( word ) ) ) HINT Use math log ( ) for logarithm pass Remove this line after implementing return tfidf Create a word co occurrence matrix def create cooccurrence matrix ( corpus , window size 5 ) Create a word co occurrence matrix from the corpus Args corpus ( list ) Preprocessed corpus where each document is a list of words window size ( int ) The size of the context window Returns tuple Co occurrence matrix, word to index mapping, and index to word mapping TODO Build the vocabulary of unique words HINT Use a set to store unique words pass Remove this line after implementing TODO Initialize co occurrence matrix HINT Use numpy to create a zero matrix of size vocab size x vocab size pass Remove this line after implementing TODO Fill in the co occurrence matrix HINT For each word, consider a window of words around it pass Remove this line after implementing return cooccurrence matrix, word to id , id to word Calculate PPMI from co occurrence matrix def compute ppmi ( cooccurrence matrix ) Compute the Positive Pointwise Mutual Information ( PPMI ) matrix from th

The Answer is in the image, click to view ...

Question: PROBLEM 2 : Vector semantics ( TF - IDF and PPMI vectors ) ( 5 0 points ) 1 . Considering the first 1 0

PROBLEM

2

: Vector semantics

(

-

IDF and PPMI vectors

) (50

points

)

1 .

Considering the first

1000

sentences of the Brown corpus

(

corpus

[0

1000]),

regard each sentence as a document and write a program to compute the TF

-

IDF for each word

(20

points

) .

Compute the TF

-

IDF for the words county, investigation and produced of the first document of the corpus.

(5

points

) .

2 .

Considering the first

1000

sentences of the Brown corpus

(

corpus

[0

1000]),

regard each sentence as a document and write a program to compute the PPMI for each

[

word

,

context

-

word

]

pair

(20

points

) .

The context of a word is the

window

of words consisting of

(

)

five words

(

if avalible

)

to the left of the word; and

(

)

five words

(

if avalible

)

to the right of the word. Compute the PPMI for three

[

word

,

context

-

word

]

pairs

[

expected

,

approve

], [

mentally

,

], [

send

,

bed

] (5

points

) .

# TF

-

IDF and PPMI Code Template

" " "

This template will help you implement TF

-

IDF and PPMI calculations using the NLTK library and the Brown corpus.

You will preprocess the corpus, compute term frequencies, document frequencies, TF

-

IDF scores, and create

a word co

-

occurrence matrix to compute Positive Pointwise Mutual Information

(

PPMI

)

scores.

" " "

import nltk

from nltk

.

corpus import brown

from collections import defaultdict, Counter

import math

import numpy as np

# Download the Brown corpus if not already downloaded

nltk

.

download

('

brown

')

# Preprocess the corpus: Tokenize, lowercase, and add start

/

end tokens

def preprocess

(

corpus

)

" " "

Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens.

Args:

corpus

(

list

)

: List of sentences from the corpus.

Returns:

list: Preprocessed and tokenized corpus.

" " "

tokenized

_

corpus

= []

for sentence in corpus:

# TODO: Implement tokenization and lowercasing

# HINT: Use list comprehension and str

.

lower

()

# TODO: Add

''

at the start and

''

at the end of the sentence

pass # Remove this line after implementing

return tokenized

_

corpus

# Calculate Term Frequency

(

)

def compute

_

(

corpus

)

" " "

Calculate the term frequency for each word in each document.

Args:

corpus

(

list

)

: Preprocessed corpus where each document is a list of words.

Returns:

dict: Term frequencies for each document.

" " "

=

defaultdict

(

Counter

)

# TODO: For each document, count the occurrences of each word

# HINT: Use enumerate to get document index and Counter to count words

pass # Remove this line after implementing

return tf

# Calculate Document Frequency

(

)

def compute

_

(

)

" " "

Calculate the document frequency for each word across all documents.

Args:

(

dict

)

: Term frequencies for each document.

Returns:

Counter: Document frequencies for each word.

" " "

=

Counter

()

# TODO: For each word, count the number of documents it appears in

# HINT: Use a set of words for each document to avoid counting duplicates

pass # Remove this line after implementing

return df

# Calculate TF

-

IDF for each word

def compute

_

tfidf

(

,

,

num

_

docs

)

" " "

Calculate the TF

-

IDF score for each word in each document.

Args:

(

dict

)

: Term frequencies for each document.

(

Counter

)

: Document frequencies for each word.

num

_

docs

(

int

)

: Total number of documents.

Returns:

dict: TF

-

IDF scores for each word in each document.

" " "

tfidf

=

defaultdict

(

dict

)

# TODO: For each document and word, calculate TF

-

IDF score

# TF

-

IDF formula: TF

(

word

) *

log

(

/ (1 +

(

word

)))

# HINT: Use math.log

()

for logarithm

pass # Remove this line after implementing

return tfidf

# Create a word co

-

occurrence matrix

def create

_

cooccurrence

_

matrix

(

corpus

,

window

_

size

= 5)

" " "

Create a word co

-

occurrence matrix from the corpus.

Args:

corpus

(

list

)

: Preprocessed corpus where each document is a list of words.

window

_

size

(

int

)

: The size of the context window.

Returns:

tuple: Co

-

occurrence matrix, word to index mapping, and index to word mapping.

" " "

# TODO: Build the vocabulary of unique words

# HINT: Use a set to store unique words

pass # Remove this line after implementing

# TODO: Initialize co

-

occurrence matrix

# HINT: Use numpy to create a zero matrix of size vocab

_

size x vocab

_

size

pass # Remove this line after implementing

# TODO: Fill in the co

-

occurrence matrix

# HINT: For each word, consider a window of words around it

pass # Remove this line after implementing

return cooccurrence

_

matrix, word

_

_

,

_

_

word

# Calculate PPMI from co

-

occurrence matrix

def compute

_

ppmi

(

cooccurrence

_

matrix

)

" " "

Compute the Positive Pointwise Mutual Information

(

PPMI

)

matrix from th

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

For monotone functions f, f0: P Q between posets (P, vP ) and (Q, vQ), let f v f(i) Prove that the binary relation v is a partial order. [3 marks] (ii) For monotone functions between posets p : P 0...

QUIZ... Let D be a poset and let f : D D be a monotone function. (i) Give the definition of the least pre-fixed point, fix (f), of f. Show that fix (f) is a fixed point of f. [5 marks] (ii) Show that...

AutoSave O Off) 15 - 7) - Lab 3 -Vector Addition Virtual Lab (1) . Saved to this PC . Search (Alt+Q) Krocker, Kassidy P KK X File Home Insert Draw Design Layout References Mailings Review View Help...

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Consider the following context-free grammar of expressions E ::= n | (E, E) where n ranges over integers. (a) Present a right-most derivation of the expression ((21, 18), 17). [2 marks] (b) List the...

06.02b Applying the Laws of Sines and Cosines Respond to the following prompt in a word processing document. Describe, in detail, when to use the law of cosines, the law of sines, and the law of...

nnep This counts as "setting the value", even if the value does not change. If its value was last set to v 6= n less than 30 seconds ago, it leaves its value unchanged and replies false. If a client...

Which solution is adopted by Ethernet and what measures are taken to ensure stability in circumstances of high load? [4 marks] 1 [TURN OVER CST.93.5.2 4 Graphics I A certain image contains a number Q...

Give Correct ANSWERS Human-Computer Interaction (a) If you had been one of the original inventors of the WIMP interface, and engineers on the technical team had been sceptical about the advantages...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

Describe how foreign security investments can be made, both indirectly and directly.

Alex Conyers and Shaunika Stevens formed a partnership, dividing income as follows: 1. Annual salary allowance to Stevens of $45,000. 2. Interest of 8% on each partners capital balance on January 1....

Managerial efforts to boost a company's stock price should entall such actions ans increasing the comparn's dividends each year by $ 0 . 2 5 or mose and issuing addtignal ahares of comrath atock ta...

The marketers of Sant, a health drink, have decided to sell a combination package of three flavorsvanilla, chocolate, and strawberry. This package is priced marginally higher than the individual