Question: Create a new frequency distribution of the Brown bigrams. Plot the cumulative frequency distribution of the top 50 bigrams. Then do add one smoothing on

Create a new frequency distribution of the Brown bigrams. Plot the

Create a new frequency distribution of the Brown bigrams. Plot the cumulative frequency distribution of the top 50 bigrams.

Then do add one smoothing on the bigrams. This will require adding one to all the bigram counts, including those that previously had count 0. You will also need to change the ungram counts appropriately. You will compute all possible bigrams using the known vocabulary, so use the keys of the unigram Brown distribution you created before to compute the set of possible bigrams. The vocabulary size from that exercise should be 49815. Then having added 1 to all the bigram counts, you must compute at least the following Probabilities:

P(the | in) before and after smoothing (P_{\text{mle}} and P_{\text{laplace}});
P(in the) before and after smoothing;
P(said the) before and after smoothing.
P(the | said) before and after smoothing.

In some cases you will to use the unigram counts to compute these probabilities. Remember that the unigram counts must change too when smoothing.

To start this assignment download the Brown corpus. [3] import nitk nitk.download('brown') [nltk_data] Downloading package brown to /rootltk_data... [nltk_data] Package brown is already up-to-date! True Background In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignment. This time you will do a bigram distribution: [4] import nitk from nitk.corpus import brown from nitk import bigrams brown_bigrams = list(bigrams (brown.words()) It is instructive to compare brown.words, which we used in the last assignment, with brown.bigrams: [5] brown.words[:10] #I 'The', 'Fulton', 'County', 'Grand', 'Jury', 'said', # 'Friday', 'an', 'investigation', 'of'] ['The', Fulton', "County's "Grand', Jury', said' 'Friday', 'an', 'investigation', 'of'] [6] brown_bigrams[:10] #[('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), # "Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'), #('an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")] [('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), ("Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'), ('an', 'investigation'), ('investigation', 'of'), ('of", "Atlanta's")] So brown.words() returns a list of the words, while brown.bigrams() returns a list of word pairs. Notice the the second word of the first pair becomes the first word of the second pair, and the the second word of the second pair, the first word of the third, and so on. Since each word in Brown becaome the first word of a bigram except the last, there is exactly one more word token than there are bogram tokens: To start this assignment download the Brown corpus. [3] import nitk nitk.download('brown') [nltk_data] Downloading package brown to /rootltk_data... [nltk_data] Package brown is already up-to-date! True Background In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignment. This time you will do a bigram distribution: [4] import nitk from nitk.corpus import brown from nitk import bigrams brown_bigrams = list(bigrams (brown.words()) It is instructive to compare brown.words, which we used in the last assignment, with brown.bigrams: [5] brown.words[:10] #I 'The', 'Fulton', 'County', 'Grand', 'Jury', 'said', # 'Friday', 'an', 'investigation', 'of'] ['The', Fulton', "County's "Grand', Jury', said' 'Friday', 'an', 'investigation', 'of'] [6] brown_bigrams[:10] #[('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), # "Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'), #('an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")] [('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), ("Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'), ('an', 'investigation'), ('investigation', 'of'), ('of", "Atlanta's")] So brown.words() returns a list of the words, while brown.bigrams() returns a list of word pairs. Notice the the second word of the first pair becomes the first word of the second pair, and the the second word of the second pair, the first word of the third, and so on. Since each word in Brown becaome the first word of a bigram except the last, there is exactly one more word token than there are bogram tokens

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

ANSWER FROM PREVIOUS QUESTION: import nltk from nltk.corpus import brown dct = dict() for word in brown.words(): temp = dct.get(word,0) dct[word]=temp+1 a = list(dct.items())...

In the Pollard assignment you computed a unigram frequency distribution for the Brown corpus. You will need that for this assignmewnt. This time you will do a bigram distribution: >>> import nltk >>>...

Describing Data Once we have collected data from surveys or experiments, we need to summarize and present the data in a way that will be meaningful to the reader. We will begin with graphical...

PROBLEM 1 : N - gram language models ( 3 5 points ) 1 . Build a bigram language model on the whole Brown corpus and calculate the probability of the sentence: "The dog barked at the cat.". ( 1 5...

The instructions are attached in the Division Performance Evaluation Division Performance Evaluation Project Using Excel In this assignment you will analyze the performance of PepsiCo or Marriott's...

2.2 51 Summarizing Data for a Quantitative Variable Exercises Methods 11. Consider the following data. WEB 14 19 24 19 16 20 24 20 file Frequency a. b. SELF test 21 22 24 18 17 23 26 22 23 25 25 19...

you will be given a Organization Profile in which you will then answer your questions according to the information in the Organization Profile. The questions should be formatted such that when I send...

Problem 4. Write a function that has a horizontal asymptote at y=2 and vertical asymptotes at z=-1 and z= 5 Problem 5. Write a function that has a slant asymptote y = 2r and vertical asymptotes at...

Central Community College is preparing a brochure to persuade prospective students to consider taking classes. The college doesnt have the money for full-scale document testing. What free or...

Athletics Australia is trying to attract more sports spectators to athletics. It conducts an advertising campaign on sports TV channels to gauge spectator preferences. a. At the 0.10 level of...

Selection of a case project. Appreciate the potential of theory in managerial decision making and problem solving and take managerial responsibility by designing a practical course of action select a...

___ 12. I would rather leave my organization altogether than accept an assignment that would jeopardize my security in that organization.

___ 9. I will feel successful in my career only if I can develop my technical or functional skills to a very high level of competence.

___ 14. I am most fulfilled in my career when I have been able to use my talents in the service of others.