Question: Training the Nave Bayes classifier and classifying new documents Write a Python function that uses a training set of documents to estimate the probabilities in
Training the Nave Bayes classifier and classifying new documents
Write a Python function that uses a training set of documents to estimate the probabilities in the Naive Bayes model. Return some data structure containing the probabilities. The input parameter of this function should be a list of documents with sentiment labels, i.e. a list of pairs like train_docs above. It could look something like this:
def train_nb (training documents): ...
(return the data you need to classify new instances)
Note: When you are calculating the feature probabilities you must apply Laplace smoothing. Try different values and discuss the results you get with different smoothing parameters.
Then write a Python function that classifies a new document. The inputs are 1) the probabilities returned by the first function; 2) the document to classify, which is a list of tokens.
def classify_nb(classifier_data, document):
....
return result
result should be a tuple where the first element is the predicted class label, and the second element is the probability estimate for that class label.
Note the function above should return normalized probability i.e., the sum of the probability estimate for pos and neg should sum up to 1. Also, in this function you need to think about numeric problems. If you multiply many small probabilities you may run into problems with numeric precision: the probability becomes zero. To handle this problem, I recommend that you compute the logarithms of the probabilities instead of the probabilities. To compute the logarithm in Python, use the function log in the math library.
Note that log(P1 * P2) = log(P1) + log(P2), so if you use log probabilities you will sum them instead of multiplying.
Evaluating the classifier
We will evaluate the classifier carefully in the second assignment. In this assignment, we just compute the accuracy, i.e. the number of correctly classified documents divided by the total number of documents. Write a function that classifies each document in the test set, compares each label to the gold-standard label, and returns the accuracy.
def classify_nb(classifier_data, document): ... (return the guess of the classifier)
What accuracy do you get when evaluating the classifier on the test set?
Error analysis
Find a few misclassified documents and comment on why you think they were hard to classify.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
