Question: This is the example output below: The zip file text.zip contains the following texts, downloaded from Project Gutenberg and the Internet Archive. The texts haven't

This is the example output below: The zip file text.zip contains

This is the example output below: the following texts, downloaded from Project Gutenberg and the Internet Archive. The

The zip file text.zip contains the following texts, downloaded from Project Gutenberg and the Internet Archive. The texts haven't been vetted and so might contain errors, objectionable material, etc. Test Files Alcott-Lousia May-Little Women.txt Austen-Jane-Sense and Sensibility.txt Bronte Charlotte-Villette.txt Conrad-Joseph-Heart of Darkness.txt Dickens-Charles David Copperfield.txt Training Files Alcott-Lousia May-Eight Cousins.txt Alcott-Lousia May-Jos Boys.txt Alcott-Lousia May-Little Men.txt Austen-Jane-Mansfield Park.txt Austen-Jane-Northanger Abbey.txt Austen-Jane-Persuasion.txt Austen-Jane-Pride and Prejudice.txt Bronte-Charlotte-Jane Eyre.txt Bronte-Charlotte Professor.txt Bronte-Charlotte-Shirley.txt Bronte-Charlotte-Villette.txt Conrad-Joseph-Lord Jim.txt Conrad-Joseph-Nostromo.txt Conrad-Joseph-Secret Agent.txt Conrad-Joseph-Secret Sharer.txt Dickens-Charles-Bleak House.txt Dickens-Charles-Christmas Carol.txt Dickens-Charles-Hard Times.txt Dickens-Charles-Life And Adventures Of Nicholas Nickleby.ext Dickens-Charles-Pickwick Papers.txt . For this assignment, use Python and sklearn.naive bayes to create classifiers to identify a text's author. Specifically, Write routines to tokenize the text, breaking each work into groups of 500 tokens (use NLTK's word_tokenize to perform the tokenization). Generate training and testing sets by associating the author of the work witch each sequence of 500 tokens. The prefix of the file name encodes the last and first name of the author. Afterwards, create classifiers using both MultinomialNB and BernoulliNB (See https:/lp.stanford.edu/IR- book/html/htmledition/the-bernoulli-modell.html) to map the token sequences to the author of the original text. Train and test the classifiers on the texts indicated above. Report the accuracy and generate a confusion matrix for each classifier. Experiment further, generating two additional classifiers. Can you improve the accuracy at all on the test set by, e.g., removing stop words, or re-sampling the training set to balance the classes ? Report the accuracy and generate a confusion matrix for the your modified classifiers. Ensure you program can be run, simply by invoking test (directory) where directory is the name of the directory holding the original text files. Your output should look something like the output shown on the next page (don't assume the numbers shown are reasonable Conrad 1 Dickens 3 0 1 Alcott Austen Alcott 372 1 Austen 4 235 Bronte 5 0 Conrad 4 1 Dickens 50 12 Multinomial=0.8994 Bronte 1 3 386 2 85 0 0 75 9 0 559 Alcott Alcott 369 Austen 1 Bronte 0 Conrad 1 Dickens 9 Bernoulli=0.9635 Austen 3 238 3 Bronte 2 1 386 0 14 Conrad 1 1 2 Dickens 3 2 0 1 3 12 7 673 The zip file text.zip contains the following texts, downloaded from Project Gutenberg and the Internet Archive. The texts haven't been vetted and so might contain errors, objectionable material, etc. Test Files Alcott-Lousia May-Little Women.txt Austen-Jane-Sense and Sensibility.txt Bronte Charlotte-Villette.txt Conrad-Joseph-Heart of Darkness.txt Dickens-Charles David Copperfield.txt Training Files Alcott-Lousia May-Eight Cousins.txt Alcott-Lousia May-Jos Boys.txt Alcott-Lousia May-Little Men.txt Austen-Jane-Mansfield Park.txt Austen-Jane-Northanger Abbey.txt Austen-Jane-Persuasion.txt Austen-Jane-Pride and Prejudice.txt Bronte-Charlotte-Jane Eyre.txt Bronte-Charlotte Professor.txt Bronte-Charlotte-Shirley.txt Bronte-Charlotte-Villette.txt Conrad-Joseph-Lord Jim.txt Conrad-Joseph-Nostromo.txt Conrad-Joseph-Secret Agent.txt Conrad-Joseph-Secret Sharer.txt Dickens-Charles-Bleak House.txt Dickens-Charles-Christmas Carol.txt Dickens-Charles-Hard Times.txt Dickens-Charles-Life And Adventures Of Nicholas Nickleby.ext Dickens-Charles-Pickwick Papers.txt . For this assignment, use Python and sklearn.naive bayes to create classifiers to identify a text's author. Specifically, Write routines to tokenize the text, breaking each work into groups of 500 tokens (use NLTK's word_tokenize to perform the tokenization). Generate training and testing sets by associating the author of the work witch each sequence of 500 tokens. The prefix of the file name encodes the last and first name of the author. Afterwards, create classifiers using both MultinomialNB and BernoulliNB (See https:/lp.stanford.edu/IR- book/html/htmledition/the-bernoulli-modell.html) to map the token sequences to the author of the original text. Train and test the classifiers on the texts indicated above. Report the accuracy and generate a confusion matrix for each classifier. Experiment further, generating two additional classifiers. Can you improve the accuracy at all on the test set by, e.g., removing stop words, or re-sampling the training set to balance the classes ? Report the accuracy and generate a confusion matrix for the your modified classifiers. Ensure you program can be run, simply by invoking test (directory) where directory is the name of the directory holding the original text files. Your output should look something like the output shown on the next page (don't assume the numbers shown are reasonable Conrad 1 Dickens 3 0 1 Alcott Austen Alcott 372 1 Austen 4 235 Bronte 5 0 Conrad 4 1 Dickens 50 12 Multinomial=0.8994 Bronte 1 3 386 2 85 0 0 75 9 0 559 Alcott Alcott 369 Austen 1 Bronte 0 Conrad 1 Dickens 9 Bernoulli=0.9635 Austen 3 238 3 Bronte 2 1 386 0 14 Conrad 1 1 2 Dickens 3 2 0 1 3 12 7 673

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Please assist with below question in C Computer Networks Socket Programming Project 1 Introduction Bitcoin was unleashed on the world in 2009 as the first decentralized digital currency. The...

IsUSCurrentPresidentCatholic = True IsUSPreviousPresidentCatholic = False Change the values of the two variables above to print all the statements of the code.below if (IsUSCurrentPresidentCatholic)...

Please provide the code in Java, with the extra credit please! and if you could email the zip file to tug84792@temple.edu as well that would be much appreciated. CIS 1068 Assignment 6 Southie Styles...

Tasks The goal of the project is to complete the code for the NgramAnalyser, MarkovModel, ModelMatcher and MatcherController classes, as detailed below, and to add test code to a new JUnit test...

please I need this completed asap. If you need the username and password to log-in let me know \fInstructions City of Bingham Computerized Cumulative Problem For use with McGraw-Hill/Irwin Accounting...

At the year end, an analysis by the city?s finance department determined the following constraints on resources in the general fund. Prepare the appropriate journal entry in the general fund to...

can someone solve this Modern workstations typically have memory systems that incorporate two or three levels of caching. Explain why they are designed like this. [4 marks] In order to investigate...

There are two problems due this week (each worth 35 points) as follows. Case 5-1David L. Miller: Portrait of a White-Collar Criminal (page 144). In comprehensive paragraphs, answerrequirements 1?6....

Who is chief knowledge officer? What the primary role? A senior executive in an organization responsible for ensuring that firm fully utilizes the value it gets through knowledge- which is the most...

A researcher wants to provide an overview of the gender of the respondents in his sample. The gender is measured like this: What is your gender? ___Male ___Female. What is the best way to provide an...

A misleading graph is a statistical graph that is not drawn appropriately. This type of graph can misrepresent data and lead to false conclusions. In Exercises 39, (a) Explain why the graph is...

12 Work out a b an estimate for the area of the grey section of this rectangle the accurate area of the grey section of this rectangle. 95 58 m E 215 2m

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

KEY QUESTION A firm has fixed costs of $60 and variable costs as indicated in the table on the next page. Complete the table and check your calculations by referring to question 4 at the end of...

LAST WORD What is a sunk cost? Provide an example of a sunk cost other than one from this book. Why are such costs irrelevant in making decisions about future actions?

KEY QUESTION Use the concepts of economies and diseconomies of scale to explain the shape of a firms long-run ATC curve. What is the concept of minimum efficient scale? What bearing can the shape of...