Question: PROBLEM 1 : N - gram language models ( 3 5 points ) 1 . Build a bigram language model on the whole Brown corpus

PROBLEM 1: N-gram language models (35 points)
1. Build a bigram language model on the whole Brown corpus and calculate the probability of the sentence: "The dog barked at the cat.". (15 points)
Note: To calculate the probability of a sentence, the start token and end token should be considered.
2. Apply Laplace smoothing (add-one smoothing) to the bigram language model and calculate the probilitily of the sentence: The dog barked at the cat..(10 points)
3. Predict the most probable next 5 words of the sentence prefix I won 200 using the bigram model. (10 points)
Template code:
# Bigram Language Model Template
"""
This template will help you build a bigram language model using the NLTK library.
You will preprocess the corpus, build the bigram model, calculate probabilities,
and predict the next words given a sentence prefix.
"""
import nltk
from nltk import bigrams
from nltk.tokenize import word_tokenize
from nltk.corpus import brown
from collections import defaultdict, Counter
# Download required NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('brown')
# Preprocess the corpus: Tokenize, lowercase, and add start/end tokens
def preprocess(corpus):
"""
Preprocess the corpus by tokenizing, converting to lowercase, and adding and tokens.
Args:
corpus (list): List of sentences from the corpus.
Returns:
list: Preprocessed and tokenized corpus.
"""
tokenized_corpus =[]
for sentence in corpus:
# TODO: Tokenize and lowercase the sentence
# HINT: Use list comprehension and str.lower()
# TODO: Add '' at the start and '' at the end of the sentence
pass # Remove this line after implementing
return tokenized_corpus
# Build the bigram model: Create frequency distributions for unigrams and bigrams
def build_bigram_model(tokenized_corpus):
"""
Build bigram and unigram frequency distributions.
Args:
tokenized_corpus (list): Preprocessed and tokenized corpus.
Returns:
tuple: bigram frequencies and unigram frequencies.
"""
bigram_freq = defaultdict(Counter)
unigram_freq = Counter()
for document in tokenized_corpus:
# TODO: Update unigram frequencies
# HINT: Use unigram_freq.update()
# TODO: Update bigram frequencies
# HINT: Use bigrams from nltk and update bigram_freq
pass # Remove this line after implementing
return bigram_freq, unigram_freq
# Calculate bigram probability with optional smoothing
def bigram_probability(bigram_freq, unigram_freq, word1, word2, smoothing=False):
"""
Calculate the probability of word2 given word1 using bigram frequencies.
If smoothing is True, apply Laplace smoothing.
Args:
bigram_freq (dict): Bigram frequency distribution.
unigram_freq (dict): Unigram frequency distribution.
word1(str): The preceding word.
word2(str): The current word.
smoothing (bool): Whether to apply Laplace smoothing.
Returns:
float: Probability of word2 given word1.
"""
# TODO: Implement this function
# HINT:
# - If smoothing is True, add 1 to the bigram count and adjust unigram count
# - Vocabulary size V is len(unigram_freq)
# - Handle cases where counts might be zero to avoid division by zero
pass # Remove this line after implementing
# Compute the probability of a sentence
def sentence_probability(bigram_freq, unigram_freq, sentence, smoothing=False):
"""
Compute the probability of a sentence using the bigram model.
Args:
bigram_freq (dict): Bigram frequency distribution.
unigram_freq (dict): Unigram frequency distribution.
sentence (str): The sentence to compute the probability for.
smoothing (bool): Whether to apply Laplace smoothing.
Returns:
float: Probability of the sentence.
"""
# TODO: Tokenize and lowercase the sentence, add start/end tokens
# HINT: Use word_tokenize and add '' and '' tokens
# Initialize the probability to 1.0
# Iterate over the bigrams in the sentence and multiply their probabilities
pass # Remove this line after implementing
# Predict the next N words given a sentence prefix
def predict_next_words(bigram_freq, unigram_freq, sentence_prefix, N, smoothing=False):
"""
Predict the next N words given a sentence prefix using the bigram model.
Args:
bigram_freq (dict): Bigram frequency distribution.
unigram_freq (dict): Unigram frequency distribution.
sentence_prefix (str): The sentence prefix.
N (int): Number of words to predict.
smoothing (bool): Whether to apply Laplace smoothing.
Returns:
str: The predicted next N words.
"""
# TODO: Tokenize and lowercase the sen

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!