Question: Implement the ID3 decision tree learning algorithm that we discussed in class. To simplify the implementation, your system only needs to handle binary classification tasks

Implement the ID3 decision tree learning algorithm that we discussed in class. To simplify the implementation, your system only needs to handle binary classification tasks (i.e. each instance will have a class value of 0 or 1). In addition, you may assume that all attributes are binary-valued (i.e. the only possible attribute values are 0 and 1) and that there are no missing values in the training or test data. Some sample training files (train.dat, train2.dat) and test files (test.dat, test2.dat) are available from the assignment page of the course website. In these files, only lines containing non-space characters are relevant. The first relevant line holds the attribute names. Each following relevant line defines a single example. Each column holds an examples value for the attribute named at the head of the column. The last column (labeled class) holds the class label for the examples. In all of the following experiments, you should use this last class attribute to train the tree and to determine whether a tree classifies an example correctly. When building a decision tree, if you reach a leaf node but still have examples that belong to different classes (i.e., the node is not pure), then choose the most frequent class (among the instances at the leaf node). If you reach a leaf node in the decision tree and have no examples left or the examples are equally split among multiple classes, then choose the class that is most frequent in the entire training set. Do not implement pruning. A word on tie breaking: when choosing attributes using information again, if two or more attributes achieve the highest information gain value, break ties by choosing the earliest one in the list of attributes (assuming that the attributes in the first line of a training file are read in a left-to-right manner).

Your program should be able to handle any binary classification task with any number of binary-valued attributes. Consequently, both the number and names of the attributes, as well as the number of training and test instances, should be determined at runtime. In other words, these values should not be hard-coded in your program.

Your program should allow only three arguments to be specified in the command line in- vocation of your program: a training file, a test file, and the maximum number of training instances to consider. There should be no graphical user interface (GUI) of any kind. Any program that does not conform to the above specification will receive no credit.

Use logarithm base 2 when computing entropy and define 0 log2 0 to be 0.

In the input files, only lines containing non-space characters are relevant, as mentioned pre- viously. In particular, empty lines may appear anywhere in an input file, including the be- ginning and the end of the file. Care should be taken to skip over these empty lines.

Input files may be specified with absolute or relative addresses.

Your compiled code should be invocated as either:

python main.py java Main ./a.out

Where your decision tree will be built with all instances found in the specified training file if maxTrainingInstances 0.

Build a decision tree using the training instances and print to stdout the tree in the same format as the example tree shown below.

Implement the ID3 decision tree learning algorithm that we discussed in class.

According to this tree, if wesley = 0 and honor = 0 and barclay = 0, then the class value of the corresponding instance should be 1. In other words, the value appearing before a colon is an attribute value, and the value appearing after a colon is a class value.

Use the learned decision tree to classify the training instances. Print to stdout the accuracy of the tree. (In this case, the tree has been trained and tested on the same data set.) The accuracy should be computed as the percentage of examples that were correctly classified. For example, if 86 of 90 examples are classified correctly, then the accuracy of the decision tree would be 95.6%. (Note that the accuracy on the training instances will be 100% if and only if the training instances are consistent.)

 Accuracy on training set (90 instances): 95.6%

Use the learned decision tree to classify the test instances. Print to stdout the accuracy of

the tree. (In this case, the decision tree has been trained and tested on different data sets.)

 Accuracy on test set (10 instances): 60.0%

Now, we want to investigate how the amount of training data affects the accuracy of the resulting decision tree. Plot a learning curve (i.e., a graph of the accuracy of your algorithm on the test set against different training set sizes) by re-training your learning algorithm on train.dat using training set sizes of 100, 200, 300, . . ., 800. Briefly comment on the shape of the curve. Does it exhibit the usual properties of a learning curve? (We suggest that you plot the graph using Excel, but if you choose to draw the graph by hand, you need to scan it so that you can submit it online. We will not accept hardcopy submissions.)

wesley- 0: honor0 barclay = 1 : I honor = 1 tea = 0 1 1 tea = 1 Wesley = wesley- 0: honor0 barclay = 1 : I honor = 1 tea = 0 1 1 tea = 1 Wesley =

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

ID3 decision tree In this part, you will implement the ID3 decision tree learning algorithm using Java orPython . You cannot use any package orlibrary for this assignment(numpy and scipy is ok). To...

This assignment is focused on machine learning, mainly on the implementation of 2 different algorithms - Stochastic Gradient Descent \ & ID 3 decision tree. The assignment is divided into two...

Classification is a form of data analysis where a model or classifier is constructed to predict class labels or types. Data classification is a two-phase process, 1) a learning phase in which the...

A classification is a form of data analysis where a model or classifier is constructed to predict class labels or types. Data classification is a two-phase process, 1) a learning phase in which the...

Read the above passage and then answer short questions Summarize and elaborate the research method of this article in concise language Application Research Based on Machine Learning in Network...

can someone solve this Modern workstations typically have memory systems that incorporate two or three levels of caching. Explain why they are designed like this. [4 marks] In order to investigate...

(a) In SystemVerilog, what is the difference between: (i) The ternary operator ? and if...then...else statements? [2 marks] (ii) always_ff and always_comb? [2 marks] (iii) Blocking, non-blocking and...

Running head: REFLECTION PAPER 1 Reflection Journal Student Name University Name May 8, 2016 REFLECTION PAPER 2 Reflection Paper Week 1 (January 19-25 ) Time go so fast, I cann't believe myslfe that...

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Explain the impact, if any, of each of the following on the production possibilities curve: a. Europes population fell by 30 to 60 percent following an outbreak of bubonic plague, also known as Black...

EXERCISE: The Pear company produces and sells pPhones. Their production costs are $300000 plus $150 for each pPhone they produce, but they can sell the pPhones for $250 each. Find the revenue, cost...

The component costs of capital are market - determined variables in the sense that they are based oen investoes? required returns. True False

Why objectivity of evidence is high for physical examination?

LO6 Describe how individual pay rates are set.

LO4 Understand the challenges of managing global compensation systems.

1. There are a number of ethical issues related to compensation. Higherpaid employees can sometimes misbehave, employers can work together to limit pay, and compensation itself can prompt worker...