Question: Q 2 . 1 : Train N - gram language model ( 2 0 pts ) Complete the following train _ ngram _ lm function

Q2.1: Train N-gram language model (20 pts)
Complete the following train_ngram_lm function based on the following input/output specifications. If you've done it right, you should pass the tests in the cell below.
Input:
data: the data object created in the cell above that holds the tokenized Wikitext data
order: the order of the model (i.e., the "n" in "n-gram" model). If order=3, we compute
.
Output:
lm: A dictionary where the key is the history and the value is a probability distribution over the next character computed using the maximum likelihood estimate from the training data. Importantly, this dictionary should include backoff probabilities as well; e.g., for order=4, we want to store
as well as
and
.
Each key should be a single string where the characters that form the history have been concatenated. Given a key, its corresponding value should be a dictionary where each character in the vocabulary is associated with its probability of appearing after the key. For example, the entry for the history 'c1c2' should look like:
lm['c1c2']={'c0': 0.001,'c1' : 1e-6,'c2' : 1e-6,'c3': 0.003,...}
In this example, we also want to store lm['c2'] and lm[''], which contain the bigram and unigram distributions respectively.
Hint: You might find the defaultdict and Counter classes in the collections module to be helpful.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!