Question: Python I am trying to build a bigram model and to calculate the probability of word occurrence . I should: Select an appropriate data structure
Python
I am trying to build a bigram model and to calculate the probability of word occurrence
. I should: Select an appropriate data structure to store bigrams. Increment counts for a combination of word and previous word. This means I need to keep track of what the previous word was. Compute the probability of the current word based on the previous word count.
Prob of curr word = count(prev word, curr word) / count(previous word)
Consider we observed the following word sequences:
finger remarked
finger on
finger on
finger in
finger .
Notice that "finger on " was observed twice. Also, notice that the period is treated as a separate word. Given the information in this data structure, we can compute the probability (on|finger) as 2/5 = 0.4.
Here is what I got so far:
filename = 'blah-blah.txt'
bigrams ={}
unigrams = {}
prev_word = "START"
# opening the filename in read mode
for line in fp:
words = line.split() for word in words: word = word.lower() bigram = prev_word + ' ' + word #print(bigram) if word in unigrams: unigrams[word] +=1 else: unigrams[word] =1 #print(unigrams[word]) if bigram in bigrams: bigrams[bigram] += 1 else: bigrams[bigram] = 1 prev_word = word
output_file = 'bigram_probs.txt' with open(output_file, "w") as fs: for key, value in sorted(bigrams.items()): prob = value / unigrams[word] fs.write(key + ": " + str(prob) + " ")
My program works, but I am not sure if it does what it should do. I appreciate any help!
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
