Question: Python I am trying to build a bigram model and to calculate the probability of word occurrence . I should: Select an appropriate data structure

Python

I am trying to build a bigram model and to calculate the probability of word occurrence

. I should: Select an appropriate data structure to store bigrams. Increment counts for a combination of word and previous word. This means I need to keep track of what the previous word was. Compute the probability of the current word based on the previous word count.

Prob of curr word = count(prev word, curr word) / count(previous word)

Consider we observed the following word sequences:

finger remarked

finger on

finger in

finger .

Notice that "finger on " was observed twice. Also, notice that the period is treated as a separate word. Given the information in this data structure, we can compute the probability (on|finger) as 2/5 = 0.4.

Here is what I got so far:

filename = 'blah-blah.txt'

bigrams ={}

unigrams = {}

prev_word = "START"

# opening the filename in read mode

for line in fp:

words = line.split() for word in words: word = word.lower() bigram = prev_word + ' ' + word #print(bigram) if word in unigrams: unigrams[word] +=1 else: unigrams[word] =1 #print(unigrams[word]) if bigram in bigrams: bigrams[bigram] += 1 else: bigrams[bigram] = 1 prev_word = word

output_file = 'bigram_probs.txt' with open(output_file, "w") as fs: for key, value in sorted(bigrams.items()): prob = value / unigrams[word] fs.write(key + ": " + str(prob) + " ")

My program works, but I am not sure if it does what it should do. I appreciate any help!

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Python Step 1: Create a Unigram Model A unigram model of English consists of a single probability distribution P(W) over the set of all words. 1 . Creating the word dictionary [Coding only: save code...

Can anyone help me with this problem? Python. I have to create a dictionary of bigrams which means it should be in form [previous word, current word] Building an MLE bigram model [Coding only: save...

Python program with NLTK Objective: Use n-gram models for text analysis. Turn in: your Python programs, zipped (just your 2 programs) - There is a hm_files, a ZIP file with files that includes files...

Please use Python NLTK. Use screenshots please. Thanks in advance. Objective : Use n-gram mo dels for text analysis . Turn in: your Python programs, zipped (just your 2 programs) In this homework you...

PYTHON QUESTION...... Overview and Requirements Natural language processing (NLP) refers to computational technique involving language. It is a broad field. For this assignment, we will learn a bit...

A discrete sequence {xn} can be converted into a continuous representation x(t) = ts X n= (t n ts) xn, where ts is the sampling period. (a) State two characteristic properties of Dirac's function. [2...

A creative engineer suggests structuring the TLB so that not all the bits of the presented address need match to result in a hit. Suggest how this might be achieved, and what might be the costs and...

Briefly describe ASCII and Unicode and draw attention to any relationship between them. [3 marks] (b) Briefly explain what a Reader is in the context of reading characters from data. [3 marks] A...

Suppose that R(A, B, C) is a relational schema with functional dependencies F = {A, B C, C B}. (i) Is this schema in 3NF? Explain. [2 marks] (ii) Is this schema in BCNF? Explain. [2 marks] (b)...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

On January 1, 2016, you win $3,120,000 in the state lottery. The $3,120,000 prize will be paid in equal installments of $390,000 over 8 years. The payments will be made on December 31 of each year,...

Americans with a Better Cause (ABC), a nonprofit organization, files a suit against the U.S. Department of Justice (DOJ), claiming that a certain federal statute the DOJ is empowered to enforce,...

Which of the following is NOT a method of forecasting exchange rates? Group of answer choices using the volatility of future exchange rate movements examining current values for economic variables...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

What are Measures in OLAP Cubes?

How do OLAP Databases provide for Drilling Down into data?

How are OLAP Cubes different from Production Relational Databases?