Question: Please implement a Naive Bayes classifier in Python 3 and apply it to the mushroom data sets provided: mushroom - training.data, and mushroom - testing.data.

Please implement a Naive Bayes classifier in Python3 and apply it to the mushroom data sets provided: mushroom-training.data, and mushroom-testing.data. You are to train using just the training set, but provide classification accuracy for both the training and testing data sets after learning has occurred. Apply the classifier to these datasets first using no m-estimate, then consider how using m =1 affects the accuracy.
In addition to providing the code and results just described, please provide thorough and well-considered answers to the following questions:
Is the naive Bayes assumption valid and correct here? If not, does it matter here?
Did using an m-estimate help improve the classifier performance? Why or why not?
You may use and extend the python code I provide to you if you choose, or you may implement your own code from scratch to complete this assignment. Please submit your answers in the BlackBoard assignment submission field text box, but I will grade your source code by pulling from your GitHub repository. So make sure it is pushed by the due date.
The file format is a very simple format to be used for a variety of classification (supervised learning) and clustering (unsupervised learning) tasks. It can represent categorical, ordinal (integer), or numeric (real) data. Attributes are named, and for classification tasks, one attribute is typically called "class". In such cases, the dataset will create a new attribute for each instance called "assigned-class", though no value is bound for that attribute for any instance by default. The "class" attribute is the true concept class value as given in the data file, while the "assigned-class" attribute is intended to facilitate the assignment of a class value by some classifier.
There are two types of lines expected in a data file:
Lines specifying attribute information;
Lines specifying instance data.
In the first case, lines must be in the form:
: :
The attribute name can be any string that does not contain a ":" character, but they must be unique for each attribute. Attribute type can be "cat", "ord", or "num" for categorical, ordinal, or numeric data, respectively. In the case of categorical data, the attribute values specify all the allowable values in that category. In the case of ordinal or numeric data, the Dataset file reader will establish a range based on the minimal and maximal values in the list provided.
In the second case, lines must be in the form:
,,...,
There must be the same number of values in the line as their are attribute lines in the file. The position of the lines does not matter, but the order does. That is, the code will assume that the values are ordered in the same order as attributes are defined in the file. The instance data is stored in the data set as a list of dictionaries. Each instance is a dictionary keyed by attribute name.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!