Question: PROBLEM 1 : N - gram language models ( 3 5 points ) 1 . Build a bigram language model on the whole Brown corpus

PROBLEM

1

: N

-

gram language models

(35

points

)

1 .

Build a bigram language model on the whole Brown corpus and calculate the probability of the sentence: "The dog barked at the cat.".

(15

points

)

Note: To calculate the probability of a sentence, the start token

and end token

should be considered.

2 .

Apply Laplace smoothing

(

add

-

one smoothing

)

to the bigram language model and calculate the probilitily of the sentence:

The dog barked at the cat.

. (10

points

)

3 .

Predict the most probable next

5

words of the sentence prefix

I won

200

using the bigram model.

(10

points

)

Template code:

# Bigram Language Model Template

" " "

This template will help you build a bigram language model using the NLTK library.

You will preprocess the corpus, build the bigram model, calculate probabilities,

and predict the next words given a sentence prefix.

" " "

import nltk

from nltk import bigrams

from nltk

.

tokenize import word

_

tokenize

from nltk

.

corpus import brown

from collections import defaultdict, Counter

# Download required NLTK resources if not already downloaded

nltk

.

download

('

punkt

')

nltk

.

download

('

brown

')

# Preprocess the corpus: Tokenize, lowercase, and add start

/

end tokens

def preprocess

(

corpus

)

" " "

Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens.

Args:

corpus

(

list

)

: List of sentences from the corpus.

Returns:

list: Preprocessed and tokenized corpus.

" " "

tokenized

_

corpus

= []

for sentence in corpus:

# TODO: Tokenize and lowercase the sentence

# HINT: Use list comprehension and str

.

lower

()

# TODO: Add

''

at the start and

''

at the end of the sentence

pass # Remove this line after implementing

return tokenized

_

corpus

# Build the bigram model: Create frequency distributions for unigrams and bigrams

def build

_

bigram

_

model

(

tokenized

_

corpus

)

" " "

Build bigram and unigram frequency distributions.

Args:

tokenized

_

corpus

(

list

)

: Preprocessed and tokenized corpus.

Returns:

tuple: bigram frequencies and unigram frequencies.

" " "

bigram

_

freq

=

defaultdict

(

Counter

)

unigram

_

freq

=

Counter

()

for document in tokenized

_

corpus:

# TODO: Update unigram frequencies

# HINT: Use unigram

_

freq.update

()

# TODO: Update bigram frequencies

# HINT: Use bigrams from nltk and update bigram

_

freq

pass # Remove this line after implementing

return bigram

_

freq, unigram

_

freq

# Calculate bigram probability with optional smoothing

def bigram

_

probability

(

bigram

_

freq, unigram

_

freq, word

1,

word

2,

smoothing

=

False

)

" " "

Calculate the probability of word

2

given word

1

using bigram frequencies.

If smoothing is True, apply Laplace smoothing.

Args:

bigram

_

freq

(

dict

)

: Bigram frequency distribution.

unigram

_

freq

(

dict

)

: Unigram frequency distribution.

word

1 (

str

)

: The preceding word.

word

2 (

str

)

: The current word.

smoothing

(

bool

)

: Whether to apply Laplace smoothing.

Returns:

float: Probability of word

2

given word

1 .

" " "

# TODO: Implement this function

# HINT:

-

If smoothing is True, add

1

to the bigram count and adjust unigram count

-

Vocabulary size V is len

(

unigram

_

freq

)

-

Handle cases where counts might be zero to avoid division by zero

pass # Remove this line after implementing

# Compute the probability of a sentence

def sentence

_

probability

(

bigram

_

freq, unigram

_

freq, sentence, smoothing

=

False

)

" " "

Compute the probability of a sentence using the bigram model.

Args:

bigram

_

freq

(

dict

)

: Bigram frequency distribution.

unigram

_

freq

(

dict

)

: Unigram frequency distribution.

sentence

(

str

)

: The sentence to compute the probability for.

smoothing

(

bool

)

: Whether to apply Laplace smoothing.

Returns:

float: Probability of the sentence.

" " "

# TODO: Tokenize and lowercase the sentence, add start

/

end tokens

# HINT: Use word

_

tokenize and add

''

and

''

tokens

# Initialize the probability to

1.0

# Iterate over the bigrams in the sentence and multiply their probabilities

pass # Remove this line after implementing

# Predict the next N words given a sentence prefix

def predict

_

_

words

(

bigram

_

freq, unigram

_

freq, sentence

_

prefix, N

,

smoothing

=

False

)

" " "

Predict the next N words given a sentence prefix using the bigram model.

Args:

bigram

_

freq

(

dict

)

: Bigram frequency distribution.

unigram

_

freq

(

dict

)

: Unigram frequency distribution.

sentence

_

prefix

(

str

)

: The sentence prefix.

(

int

)

: Number of words to predict.

smoothing

(

bool

)

: Whether to apply Laplace smoothing.

Returns:

str: The predicted next N words.

" " "

# TODO: Tokenize and lowercase the sen

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Only questions 3 and 4: Problem 1. In class we gave the following equation for the bigram probability of a sequence of words Wu}, ..., WU\"): k PT(W(1), __., WU\") = HPT.(W(i)lw(i1) = 10051)) (1)...

Training Data We'll be using the English news Corpus ( 2 0 1 8 year ) as our training data. There are around 1 0 , 0 0 0 sentences. eng _ news _ 2 0 1 8 _ 1 0 K - sentences.txtDownload eng _ news _ 2...

(i) Write down the linear program relaxation for the vertex cover problem and solve the linear program. [6 marks] (ii) Based on the solution of the linear program in (b)(i), derive an integer...

Please help me with these questions C webassign.net water my lonestar lonestar's mail Facebook Apple Bing Google Yahoo MY NOTES ASK YOUR TEACHER 3. [-/5 Points] DETAILS MENDSTAT14 7.E.045. An...

incorrect document, we'll assume that the homework assignment was not actually completed. Background DBVac is considering adding an additional product to their lineup. The company's Research and...

Purpose Learn to build a basic business case by using data analysis techniques and interpreting and presenting their results. Techniques include weighted values, NPV calculation and comparison,...

In this assignment, you'll be writing a business case in which you will recommend to DBVac which of the three vacuums to bring to market, give them a sense for what to expect from its release in...

.... Kindly help out 14-93 The rework time required for a machine was found to 14-95. Consider the following ANOVA table from a two-fac- depend on the speed at which the machine was run (A). the or...

Chart Assignment: After reviewing Ch. 10 of aplia in the text, you will put your knowledge to work. You will be constructing charts based on the information below. You must use the information found...

A seismic probe bores itself into the seabed, going as deep as it can before running out of fuel. This takes about five minutes. It rotates its spiral drill head at rate R(t) that follows a...

Define that how do you mitigate risk on your job or task in which you are involved?

You have been given information regarding an entity. One of your friends is interested in investing in it. Help your friend by calculating some cash-based ratios, explaining what the ratios are...

A grandfathered health plan had a deductible applicable toA grandfathered health plan had a deductible applicable to covered expenses on the date of ACA passage of $100. To what amount may the plan...

Estimate the number of repetitions that new service worker Irene will require to achieve standard if the standard is 18 minutes per repetition. She took 30 minutes to do the initial repetition a