Question: Python Step 1: Create a Unigram Model A unigram model of English consists of a single probability distribution P(W) over the set of all words.

Python

Step 1:

Create a Unigram Model A unigram model of English consists of a single probability distribution P(W) over the set of all words.

1. Creating the word dictionary [Coding only: save code as problem1.py or problem1.java] The first step in building an n-gram model is to create a dictionary that maps words to java map or python dictionary (which well use to access the elements corresponding to that word in a vector or matrix of counts or probabilities). Youll create this dictionary from the given data files (Select one file for training purpose) for all unique words. Youll need to split the sentences (consider each line) into a list of words and convert each word to lowercase, before storing it to the dictionary.

2. Building an MLE unigram model [Update problem1.py or problem1.java] Now youll build a simple MLE unigram model. For each of the word in your dictionary assign a counter and initialize it to zero. Iterate through the sentences and increment counts for each of the words they contain. Finally, to normalize your counts vector and create probabilities, you can simply divide the word count by sum of all counts like so: Prob of word = count(word)/sum(count of all words in dict) Write your new probabilities to a file called unigram_probs.txt.

Step 2: Create a Bigram Model A bigram model of English consists of two probability distributions: P(W0) and P(Wi | Wi-1).

The first distribution is just the probability of the first word in a document. The second distribution is the probability of seeing word Wi given that the previous word was Wi-1

3. Building an MLE bigram model [Coding only: save code as problem2.py or problem2.java]

Now, youll create an MLE bigram model, in much the same way as you created an MLE unigram model. I recommend writing the code again from scratch, however (except for the code initializing the mapping dictionary), so that you can test things as you go. The main differences between coding an MLE bigram model and a unigram model are: Select an appropriate data structure to store bigrams. Youll increment counts for a combination of word and previous word. This means youll need to keep track of what the previous word was. You will compute the probability of the current word based on the previous word count. Prob of curr word = count(prev word, curr word)/ count(previous word) Consider we observed the following word sequences: finger remarked finger on finger on finger in finger . Notice that "finger on " was observed twice. Also notice that the period is treated as a separate word. Given the information in this data structure, we can compute the probability p(on|finger) as 2/5 = 0.4. Similarly, we can compute the probability p(.|finger) as 1/5 = 0.2. When complete, add code to write 100 random (you can select a word and bigram term randomly) probabilities to bigram_probs.txt, one per line p(on|finger) = 0.4 p(.|finger) = 0.2

4. Add- smoothing the bigram model [Coding and written answer: save code as problem3.py or problem3.java]

This time, copy problem2 to problem3. Well just be making a very small modification to the program to add smoothing. In class, we discussed smoothing in detail: add- smoothing, in which some amount is added to every bigram count. You should modify your problem to use add- smoothing with =0.1, i.e., pretending that we saw an extra one-tenth of an instance of each bigram. counts += 0.1 Now change the program to write the same 100 probabilities as before (i.e from file bigram_probs.txt ) to a file called smooth_probs.txt.

Step 3: Using n-grams model

5. Calculating sentence probabilities For this problem, you will use each of the three models youve constructed in problems 13 to evaluate the probability of the other data file for the first 100 sentences. First, youll edit problem1, and add code at the bottom of the script to iterate through each sentence (remember you are calculating probabilities for the first 100 sentences) in the test data file and calculate the joint probability of all the words in the sentence under the unigram model. Then write the probability of each sentence to a file unigram_eval.txt, formatted to have one probability for each line of the output file. To do this, youll be updating the joint probability of the sentence (multiplying the probabilities of each word together). One easy way to do this is to initialize sentprob = 1 prior to looping through the words in the sentence, and then just update sentprob *= wordprob with the probability of each word. At the end of the loop, sentprob will contain the total joint probability of the whole sentence. Next, youll transform this joint probability into a perplexity (of each sentence), and write that to the file instead. To calculate the perplexity, first calculate the length of the sentence in words (be sure to include the punctuations.) and store that in a variable sent_len, and then you can calculate the perplexity as follows: calculate perplexity = 1/(pow(joint_prob, 1.0/sent_len)) Now, write the perplexity of each sentence to the output file instead of the joint probabilities. Now, youll do the same thing for your other two models. Add code to problem2 to calculate the perplexities of each sentence from the test data file and write that to a file bigram_eval.txt. Similarly, add code to problem3 and write the perplexities under the smoothed model to smoothed_eval.txt.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

the following relational algebra questions based on tables S1 and S2 (below) Question 1 What is the output of S1 S2 ? The row identified by sid 22 The rows identified by sid 31 and 58 The row...

********PLEASE ANSWER IN PYTHON ONLY********* PA4 Maps (100 pts) Due: Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement...

Discuss Semantics and the challenges they are in English. 2 Language Structure and Use Learning Outcomes After reading this chapter, you should be able to ... Explain how language contributes to...

*******PLEASE ANSWER IN PYTHON ONLY********* PA4 Maps (100 pts) Due: Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement...

*******PLEASE ANSWER IN PYTHON ONLY********* Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement hash tables and hash...

Hi, Please help me with homework. Thank you !!! Thumbs up for ALL answers. Material: Book Title Social Media Marketing: A Strategic Approach Author Barker, Barker, Bormann, Roberts, Zahay...

(a) In SystemVerilog, what is the difference between: (i) The ternary operator ? and if...then...else statements? [2 marks] (ii) always_ff and always_comb? [2 marks] (iii) Blocking, non-blocking and...

PYTHON QUESTION...... Overview and Requirements Natural language processing (NLP) refers to computational technique involving language. It is a broad field. For this assignment, we will learn a bit...

PA4 Maps (100 pts) Due: Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement hash tables and hash functions Linear probing...

Using the Black-Scholes-Merton model, compute and graph the time value decay of the October 165 call on the following dates: July 15, July 31, August 15, August 31, September 15, September 30, and...

Recently, Las Vegas has seen increased competition from Singapore and Macau (China) for customers in the casino resort industry. One measurement of success is the average length of stay by visitors....

15.35 Use the coal cleansing data of Exercise 15.2 on page 629 to fit a model of the type E(Y ) = 0 + 1x1 + 2x2 + 3x3 , where the levels are x1 , percent solids: 8, 12 x2 , flow rate: 150, 250...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

Know the theoretical foundation from environ-mental psychology that helps us understand how customers and employees respond to service environments.

Know the three main dimensions of the service environment.

Know how service employees and other customers are part of the servicescape.