Question: PROBLEM 1 : N - gram language models ( 3 5 points ) 1 . Build a bigram language model on the whole Brown corpus
PROBLEM : Ngram language models points
Build a bigram language model on the whole Brown corpus and calculate the probability of the sentence: "The dog barked at the cat.". points
Note: To calculate the probability of a sentence, the start token and end token should be considered.
Apply Laplace smoothing addone smoothing to the bigram language model and calculate the probilitily of the sentence: The dog barked at the cat. points
Predict the most probable next words of the sentence prefix I won using the bigram model. points
Template code:
# Bigram Language Model Template
This template will help you build a bigram language model using the NLTK library.
You will preprocess the corpus, build the bigram model, calculate probabilities,
and predict the next words given a sentence prefix.
import nltk
from nltk import bigrams
from nltktokenize import wordtokenize
from nltkcorpus import brown
from collections import defaultdict, Counter
# Download required NLTK resources if not already downloaded
nltkdownloadpunkt
nltkdownloadbrown
# Preprocess the corpus: Tokenize, lowercase, and add startend tokens
def preprocesscorpus:
Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens.
Args:
corpus list: List of sentences from the corpus.
Returns:
list: Preprocessed and tokenized corpus.
tokenizedcorpus
for sentence in corpus:
# TODO: Tokenize and lowercase the sentence
# HINT: Use list comprehension and strlower
# TODO: Add at the start and at the end of the sentence
pass # Remove this line after implementing
return tokenizedcorpus
# Build the bigram model: Create frequency distributions for unigrams and bigrams
def buildbigrammodeltokenizedcorpus:
Build bigram and unigram frequency distributions.
Args:
tokenizedcorpus list: Preprocessed and tokenized corpus.
Returns:
tuple: bigram frequencies and unigram frequencies.
bigramfreq defaultdictCounter
unigramfreq Counter
for document in tokenizedcorpus:
# TODO: Update unigram frequencies
# HINT: Use unigramfreq.update
# TODO: Update bigram frequencies
# HINT: Use bigrams from nltk and update bigramfreq
pass # Remove this line after implementing
return bigramfreq, unigramfreq
# Calculate bigram probability with optional smoothing
def bigramprobabilitybigramfreq, unigramfreq, word word smoothingFalse:
Calculate the probability of word given word using bigram frequencies.
If smoothing is True, apply Laplace smoothing.
Args:
bigramfreq dict: Bigram frequency distribution.
unigramfreq dict: Unigram frequency distribution.
wordstr: The preceding word.
wordstr: The current word.
smoothing bool: Whether to apply Laplace smoothing.
Returns:
float: Probability of word given word
# TODO: Implement this function
# HINT:
# If smoothing is True, add to the bigram count and adjust unigram count
# Vocabulary size V is lenunigramfreq
# Handle cases where counts might be zero to avoid division by zero
pass # Remove this line after implementing
# Compute the probability of a sentence
def sentenceprobabilitybigramfreq, unigramfreq, sentence, smoothingFalse:
Compute the probability of a sentence using the bigram model.
Args:
bigramfreq dict: Bigram frequency distribution.
unigramfreq dict: Unigram frequency distribution.
sentence str: The sentence to compute the probability for.
smoothing bool: Whether to apply Laplace smoothing.
Returns:
float: Probability of the sentence.
# TODO: Tokenize and lowercase the sentence, add startend tokens
# HINT: Use wordtokenize and add and tokens
# Initialize the probability to
# Iterate over the bigrams in the sentence and multiply their probabilities
pass # Remove this line after implementing
# Predict the next N words given a sentence prefix
def predictnextwordsbigramfreq, unigramfreq, sentenceprefix, N smoothingFalse:
Predict the next N words given a sentence prefix using the bigram model.
Args:
bigramfreq dict: Bigram frequency distribution.
unigramfreq dict: Unigram frequency distribution.
sentenceprefix str: The sentence prefix.
N int: Number of words to predict.
smoothing bool: Whether to apply Laplace smoothing.
Returns:
str: The predicted next N words.
# TODO: Tokenize and lowercase the sen
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
