Question: In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignmewnt. This time you will

In the Pollard assignment you computed a unigram frequency distribution for

In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignmewnt. This time you will do a bigram distribution: >>> import nltk >>> from nltk.corpus import brown >>> from nltk import bigrams >>> brown_bigrams = bigrams (brown.words()) It is instructive to compare brown.words, which we used in the last assignment, with brown.bigrams: >>> brown.words[:10] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] >>> brown_bigrams[:10] [('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), ('Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'), an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")] So brown.words() returns a list of the words, while brown.bigrams() returns a list of word pairs. Notice the the second word of the first pair becomes the first word of the second pair, and the the second word of the second pair, the first word of the third, and so on. Since each word in Brown becaome the first word of a bigram except the last, there is exactly one more word token than there are bogram tokens: >>> len (brown_bigrams) 1161191 >>> len (brown.words()) 1161192 Create a new frequency distribution of the Brown bigrams. Plot the cumulative frequency distribution of the top 50 bigrams. Then do add one smoothing on the bigrams. This will require adding one to all the bigram counts, including those that previously had count 0. You will also need to change the ungram counts appropriately. You will com- pute all possible bigrams using the known vocabulary, so use the keys of the unigram Brown distribution you created before to compute the set of possible bigrams. The vocabulary size from that exercise should be 49815. Then having added 1 to all the bigram counts, you must compute at least the following Probabilities Compute following: 1. P(the lin) before and after smoothing (Pmle and Plaplace); 2. P(in the) before and after smoothing; 3. P(said the) before and after smoothing. 4. P(the I said) before and after smoothing. In some cases you will to use the unigram counts to compute these probabilities. Remember that the unigram counts must change too when smoothing. Turn in these values and the Python code you used to compute them. In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignmewnt. This time you will do a bigram distribution: >>> import nltk >>> from nltk.corpus import brown >>> from nltk import bigrams >>> brown_bigrams = bigrams (brown.words()) It is instructive to compare brown.words, which we used in the last assignment, with brown.bigrams: >>> brown.words[:10] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] >>> brown_bigrams[:10] [('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), ('Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'), an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")] So brown.words() returns a list of the words, while brown.bigrams() returns a list of word pairs. Notice the the second word of the first pair becomes the first word of the second pair, and the the second word of the second pair, the first word of the third, and so on. Since each word in Brown becaome the first word of a bigram except the last, there is exactly one more word token than there are bogram tokens: >>> len (brown_bigrams) 1161191 >>> len (brown.words()) 1161192 Create a new frequency distribution of the Brown bigrams. Plot the cumulative frequency distribution of the top 50 bigrams. Then do add one smoothing on the bigrams. This will require adding one to all the bigram counts, including those that previously had count 0. You will also need to change the ungram counts appropriately. You will com- pute all possible bigrams using the known vocabulary, so use the keys of the unigram Brown distribution you created before to compute the set of possible bigrams. The vocabulary size from that exercise should be 49815. Then having added 1 to all the bigram counts, you must compute at least the following Probabilities Compute following: 1. P(the lin) before and after smoothing (Pmle and Plaplace); 2. P(in the) before and after smoothing; 3. P(said the) before and after smoothing. 4. P(the I said) before and after smoothing. In some cases you will to use the unigram counts to compute these probabilities. Remember that the unigram counts must change too when smoothing. Turn in these values and the Python code you used to compute them

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Create a new frequency distribution of the Brown bigrams. Plot the cumulative frequency distribution of the top 50 bigrams. Then do add one smoothing on the bigrams. This will require adding one to...

PROBLEM 1 : N - gram language models ( 3 5 points ) 1 . Build a bigram language model on the whole Brown corpus and calculate the probability of the sentence: "The dog barked at the cat.". ( 1 5...

CAN YOU PLEASE HELP ME SOLVE THE QUESTION REFERENCING THE ANNUAL REPORT OF BILLABOG 2015 AND 2016. THANKS FEDERATION BUSINESS SCHOOL BUACC1508 ACCOUNTING AND FINANCE ASSESSMENT TASK 2: GROUP...

I need an answer to the questions on the attached file. Case 2.1 (Doughtie's Foods, Inc.) In the late 1970s, William Nashwinter accepted a position as a salesman with Doughtie's Foods, Inc., a...

In 2019, Pollard Corporation purchases and places into service a machine. Pollard elects Sec. 179 expensing for $1.02 million of its $1.32 million cost. The machine has a 7-year MACRS recovery...

Pollard Ltd. is a publisher, specializing in producing fictional books. In January 2023, they decided to produce a new edition of a particular book and contracted Jane and Donald. Jane will provide...

sustainable supply chain case study . you should analyze and solve this case study with references. m l Case Study: Mahli Mahli Background Mahli was founded in 2006 by Arjun Mahli and Georgia Jones,...

Assignment task Part A: Case study Running a business is like guessing lottery numbers Part A of this TMA will be marked out of 80 marks. Your answer to this question should be no more than 1000...

In the late 1970s, William Nashwinter accepted a position as a salesman with Doughtie's Foods, Inc., a publicly owned food products company headquartered in Portsmouth, Virginia. ${ }^{1}$ The...

A steel bar of rectangular section 50 mm x 30 mm and length 1.5 m is subjected to a gradually applied load of 150 kN. Find the strain energy stored in the bar. If the elastic limit of the material of...

Under what conditions is percentage-of-completion accounting recommended for construction contractors?

Expenditures may be divided into two general categories (1) capital ex- penditures and (2) revenue expenditures. (a) Distinguish between these two categories of expenditures and between their...

Firms that can establish and leverage _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ are better positioned to defend their competitive advantage from imitation or erosion.

Are there any questions that you want to ask?

2. A good source for seeing the subtle differences between legal and illegal job interview questions is at this job Web site: www.jobweb.com/Interview/help .aspx?id=1343&terms=illegal+questions. Use...

C Do you consider the material you post on various Web sites private? Do you think its ethical for employers to be looking at your postings?