Question: Implement the ID 3 decision tree classification algorithm and apply it to the breast cancer dataset ( breast - cancer.arff ) from UCI Machine Learning

Implement the ID

3

decision tree classification algorithm and apply it to the breast cancer dataset

(

breast

-

cancer.arff

)

from UCI Machine Learning Repository.

Some of the features contain the value

" ? "

which represents missing values. Solve this problem by filling in the missing values with an approach of your choice, and justify your choice.

To avoid overfitting of the tree, use at least one approach for pre

-

pruning and at least one approach for post

-

pruning:

1 .

Pre

-

pruning approaches for the tree:

-

Constant N defining the maximum depth of the tree.

-

Constant K defining the minimum number of training examples in a node

(

leaf

) .

-

Constant G defining the minimum information gain required for a split.

2 .

Post

-

pruning approaches for the tree:

-

Error Estimation or Reduced Error Pruning.

-

Chi

-

Square test

(

2)

pruning

.

-

Minimal Cost

-

Complexity Pruning.

You will be awarded bonus points for implementing additional pruning approaches.

For testing the algorithm, split the data into training and testing sets with an

80

20

ratio

.

Before splitting, shuffle the data. The split must be stratified to maintain the class ratio

(201

without recurrence and

85

with recurrence

)

in the resulting training and test sets.

As input, accept three possible values:

0, 1,

2

0

means that only pre

-

pruning is used.

1

means that only post

-

pruning is used.

2

means both pre

-

pruning and post

-

pruning approaches are used.

If there are multiple pruning methods, they should be specified with corresponding letters after the number. For example, if all three pre

-

pruning approaches are chosen, they will be coded as

0

N K G

.

For post

-

pruning approaches, the letters will be E for Error Estimation, X for Chi

-

Square test, and C for Cost

-

Complexity pruning.

For the output, provide:

1 .

Train Set Accuracy:

-

Accuracy on the training set after training the model.

2 . 10 -

Fold Cross

-

Validation Results:

-

Accuracy for each fold of the

10 -

fold cross

-

validation.

-

Average accuracy and standard deviation for cross

-

validation.

3 .

Test Set Accuracy:

-

Accuracy on the test set.

Example Input and Output:

Input:

0

Output:

1 .

Train Set Accuracy:

Accuracy:

70.90 %

10 -

Fold Cross

-

Validation Results:

Accuracy Fold

1

71.00 %

Accuracy Fold

2

69.72 %

Accuracy Fold

3

71.30 %

Accuracy Fold

4

73.05 %

Accuracy Fold

5

69.53 %

Accuracy Fold

6

69.53 %

Accuracy Fold

7

73.16 %

Accuracy Fold

8

71.53 %

Accuracy Fold

9

69.06 %

Accuracy Fold

10

71.09 %

Average Accuracy:

70.90 %

Standard Deviation:

1.37 %

2 .

Test Set Accuracy:

Accuracy:

69.07 %

Notes:

The solution should be implemented in C

+ + .

The code should be a result of your own understanding of the problem and not from pre

-

executed solutions.

Do not use embedded STL functions to manipulate data.

You are allowed to use data structures like DataFrame if needed.

Bonus: Implement the Random Forest Algorithm.

The data in the breast

-

cancer.data file looks like this:

-

recurrence

-

events,

30 - 39,

premeno,

30 - 34, 0 - 2,

, 3,

left,left

_

low,no

-

recurrence

-

events,

40 - 49,

premeno,

20 - 24, 0 - 2,

, 2,

right,right

_

,

-

recurrence

-

events,

40 - 49,

premeno,

20 - 24, 0 - 2,

, 2,

left,left

_

low,no

-

recurrence

-

events,

60 - 69,

40, 15 - 19, 0 - 2,

, 2,

right,left

_

,

-

recurrence

-

events,

40 - 49,

premeno,

0 - 4, 0 - 2,

, 2,

right,right

_

low,no

-

recurrence

-

events,

60 - 69,

40, 15 - 19, 0 - 2,

, 2,

left,left

_

low,no

. . .

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Requirements Read the give information deeply and Drawing conclusions refers to information that is implied or inferred. ... Using these clues to give for deeper understanding And provide the details...

You can use any software to plot and/or to calculate values/data, but if you do, provide (copy/paste) here the code. Data sets relevant for this HW can be found at the UCI Machine Learning...

Task 2: Perceptron for binary classification. Perceptron is a supervised learning algorithm for classification or regression. In supervised learning, you are given a data set of pairs, where the...

I want test and train accuraciies in one valu Task 2: Perceptron for binary classification. Perceptron is a supervised learning algorithm for classification or regression. In supervised learning, you...

I NEED Test And Train accuracies in ONE Variable , TO show just one line in run screen Task 2: Perceptron for binary classification. Perceptron is a supervised learning algorithm for classification...

Jupiter Notebook We have covered some of the limitations of single layer neural networks in class, but they are still powerful learning systems that provide a good way to begin learning about how to...

2. Practicum Problems It is suggested that a Jupyter/IPython notebook be used for the programmatic components. 2.1 Problem 1 Load the iris sample dataset from sklearn (load_iris()) into Python using...

The database BreastCancer_Wisconsin: .. _breast_cancer_dataset: Breast cancer wisconsin (diagnostic) dataset -------------------------------------------- **Data Set Characteristics:** :Number of...

Analytics Report Overview The purpose of this task is to provide students with practical experience in writing a data analytical report to provide useful insights, pattern and trends in a chosen...

1. Business Understanding Students are expected to identify a data analytics task of your choice. You have to detail the Business Understanding part of your problem under this heading which basically...

The following information was available to reconcile Holiday Handicrafts general ledger balance and the bank statement balance as of December 31, 2015. a. According to the accounting record, the...

amazon questions Mark for follow up Question 6 of 7. During Loadout, how do you find the carts of bags and packages for your route? O Ask a Problem Solver (an Amazon associate) O It is always the...

10. Create an application, using the following names for the solution and project, respectively: Car Solution and Car Project. Save the application in the VB2015\Chap10 folder. Create the interface...

Question 26(1 point) Listen ReadSpeaker webReader: Listen What is a standard form contract? Question 26 options: An agreement where the parties to the agreement negotiate the timing of delivery of...