Question: This section of the homework will walk you through coding a Naive Bayes classifier that can distinguish between postive and negative reviews (at some level

This section of the homework will walk you through coding a Naive Bayes classifier that can distinguish between postive and negative reviews (at some level of accuracy).

Question 2.1 (5 pts) To start, implement the update_model function in hw_1.py. Make sure to read the function comments so you know what to update. Also review the NaiveBayes class variables in the def __init__ method of the NaiveBayes class to get a sense of which statistics are important to keep track of. Once you have implemented update_model, run the train model function using the code below. Youll need to provide the path to the dataset you downloaded to run the code.

In [ ]:

nb = NaiveBayes(PATH_TO_DATA, tokenizer=tokenize_doc) nb.train_model() if len(nb.vocab) == 252165: print "Great! The vocabulary size is {}".format(252165) else: print "Oh no! Something seems off. Double check your code before continuing. Maybe a mistake in update_model?" 

Exploratory analysis

Lets begin to explore the count statistics stored by the update model function. Use the provided top_n function to find the top 10 most common words in the positive class and top 10 most common words in the negative class. You don't have to code anything to do this.

In [ ]:

print "TOP 10 WORDS FOR CLASS " + POS_LABEL + ":" for tok, count in nb.top_n(POS_LABEL, 10): print '', tok, count print '' print "TOP 10 WORDS FOR CLASS " + NEG_LABEL + ":" for tok, count in nb.top_n(NEG_LABEL, 10): print '', tok, count print '' 

Question 2.2 (5 points)

Will the top 10 words of the positive/negative classes help discriminate between the two classes? Do you imagine that processing other English text will result in a similar phenomenon?

Answer in one or two sentences here.

Question 2.3 (5 pts)

The Naive Bayes model assumes that all features are conditionally independent given the class label. For our purposes, this means that the probability of seeing a particular word in a document with class label y is independent of the rest of the words in that document. Implement the p_word_given_label function. This function calculates P (w|y) (i.e., the probability of seeing word w in a document given the label of that document is y).

Use your p_word_given_label function to compute the probability of seeing the word fantastic given each sentiment label. Repeat the computation for the word boring.

In [ ]:

print "P('fantastic'|pos):", nb.p_word_given_label("fantastic", POS_LABEL) print "P('fantastic'|neg):", nb.p_word_given_label("fantastic", NEG_LABEL) print "P('boring'|pos):", nb.p_word_given_label("boring", POS_LABEL) print "P('boring'|neg):", nb.p_word_given_label("boring", NEG_LABEL) 

Which word has a higher probability given the positive class, fantastic or boring? Which word has a higher probability given the negative class? Is this what you would expect?

Answer in one or two sentences here

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!