Question: Training the Nave Bayes classifier and classifying new documents Write a Python function that uses a training set of documents to estimate the probabilities in

Training the Nave Bayes classifier and classifying new documents

Write a Python function that uses a training set of documents to estimate the probabilities in the Naive Bayes model. Return some data structure containing the probabilities. The input parameter of this function should be a list of documents with sentiment labels, i.e. a list of pairs like train_docs above. It could look something like this:

def train_nb (training documents): ...

(return the data you need to classify new instances)

Note: When you are calculating the feature probabilities you must apply Laplace smoothing. Try different values and discuss the results you get with different smoothing parameters.

Then write a Python function that classifies a new document. The inputs are 1) the probabilities returned by the first function; 2) the document to classify, which is a list of tokens.

def classify_nb(classifier_data, document):

....

return result

result should be a tuple where the first element is the predicted class label, and the second element is the probability estimate for that class label.

Note the function above should return normalized probability i.e., the sum of the probability estimate for pos and neg should sum up to 1. Also, in this function you need to think about numeric problems. If you multiply many small probabilities you may run into problems with numeric precision: the probability becomes zero. To handle this problem, I recommend that you compute the logarithms of the probabilities instead of the probabilities. To compute the logarithm in Python, use the function log in the math library.

Note that log(P1 * P2) = log(P1) + log(P2), so if you use log probabilities you will sum them instead of multiplying.

Evaluating the classifier

We will evaluate the classifier carefully in the second assignment. In this assignment, we just compute the accuracy, i.e. the number of correctly classified documents divided by the total number of documents. Write a function that classifies each document in the test set, compares each label to the gold-standard label, and returns the accuracy.

def classify_nb(classifier_data, document): ... (return the guess of the classifier)

What accuracy do you get when evaluating the classifier on the test set?

Error analysis

Find a few misclassified documents and comment on why you think they were hard to classify.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

1. TRAINING THE NAVE BAYES CLASSIFIER FOR MOVIE REVIEW CLASSIFICATION i). Implement in Python a Nave Bayes classier with bag-of-word features and add-1 smoothing. Note: Smoothing should be used for...

Assignment 3: Nave Bayes Classifier for Spam Email Prediction Procedure 1) Follows steps in the given Jupyter Notebook file, named Spam Classification Using Naive Bayes.ipynb, to go through text data...

Need help getting started on these questions. I am supposed to add code where it says "implement me" and write the answer where it says answer in one or two line. Need to fill in the "Implement me"...

Assignment for module 6 In this assignment, you are required to implement a document classifier using Nave Bayes algorithm with your favorite programming language. You will use the provided training...

Need to fill in all parts that say "Implement me" and answer in one or two lines here. The following cell contains code that will be referred to as the Preprocessing Block from now on. It contains a...

Developments in Technology Light is incident from air on the end face of a multimode optical fibre at angle of incidence as shown below. n n 1 2 The refractive indices of the core and cladding are...

Algorithms in Artificial Intelligence (or, the old name: Introduction to Algorithmic Decision Making) Part 1 Based on slides by David Sarne and Lirong Xia Course Tentative Schedule Introduction...

io (a) Give the general formula for estimating transition probabilities from training data. Provide the full transition matrix A for this HMM based on the training data shown. [6 marks] (b) Give the...

The file Accidents.csv below contains information on 42,183 actual automobile accidents in 2001 in theUnited States that involved one of three levels of injury: NO INJURY INJURY, or FATALITY. For...

We can use the NLTK stopwords as below: [ ] import nltk from nltk.corpus import stopwords nltk.download('stopwords') sw_list = stopwords.words('english') sw_list[:10] # show some examples [nltk_data]...

23) Which of the following events would cause a bank to debit a depositor's account? 23) A) There are deposits in transit on the account at month-end. B) There are outstanding checks drawn on the...

For each of the agent types listed in Exercise 2.5, characterize the environment according to the properties given in Section 2.3, and select a suitable agent design. The following exercises all...

Have the quarterly stock price movements of your PepsiCo company over the past three years tell you anything about how the company is doing? How much of the movement do you think is due to your...

please dont use chat gpt 6 5 4

Go to the website of the Federal Reserve Bank of St. Louis (http://www.stlouisfed.org) to find some information about the Fed. Find a map of the Federal Reserve districts. If you live in the United...

Suppose that the T-account for First National Bank is as follows: Assets Liabilities Reserves $100,000 Deposits $500,000 Loans 400,000 a. If the Fed requires banks to hold 5 percent of deposits as...

Imagine that you intend to buy a portfolio of ten stocks with some of your savings. Should the stocks be of companies in the same industry? Should the stocks be of companies located in the same...