Question: List the code and output. 1. Import Python nltk and random packages. Load the movie_reviews corpus (1000 positive files and 1000 negative files) from nltk.

List the code and output.

1. Import Python nltk and random packages. Load the movie_reviews corpus (1000 positive files and 1000 negative files) from nltk. How many words are there in this corpus? What are the two movie review categories? For more details about this corpus, run movie_reviews.readme( ).

2. Create a Python list named documents. Each list element contains the words used in a movie review and the reviews category. Randomly shuffle the list.

3. Create a list named word_features that contains the 2000 most frequent words in the overall corpus. These 2000 words should not include stop words or punctuation marks.

4. Define the document_features function that shows whether each review file contains any of the 2000 most frequent words. Apply the function to each element of the document list, and then create a list named featuresets that combines each review files document features with its category.

5. Split featuresets into test_set (the first 100 review files) and the train_set (the other 1900 review files). Apply nltks NaiveBayesClassifier to the train_set. Whats the trained models out-of-sample prediction accuracy for the test_set? Show the top 15 most informative words for the Nave Bayes classifier.

6. Use twenty fold cross validation to show how the Bayes classifier performs over different subsets of featuresets. Display the twenty out-of-sample prediction accuracy rates and the overall prediction accuracy (i.e., the average of the twenty accuracy rates).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!