Question: Implement the ID 3 decision tree classification algorithm and apply it to the breast cancer dataset ( breast - cancer.arff ) from UCI Machine Learning

Implement the ID3decision tree classification algorithm and apply it to the breast cancer dataset (breast-cancer.arff)from UCI Machine Learning Repository.
Some of the features contain the value "?"which represents missing values. Solve this problem by filling in the missing values with an approach of your choice, and justify your choice.
To avoid overfitting of the tree, use at least one approach for pre-pruning and at least one approach for post-pruning:
1.Pre-pruning approaches for the tree:
-Constant N defining the maximum depth of the tree.
-Constant K defining the minimum number of training examples in a node (leaf).
-Constant G defining the minimum information gain required for a split.
2.Post-pruning approaches for the tree:
-Error Estimation or Reduced Error Pruning.
-Chi-Square test (2)pruning.
-Minimal Cost-Complexity Pruning.
You will be awarded bonus points for implementing additional pruning approaches.
For testing the algorithm, split the data into training and testing sets with an 80:20ratio. Before splitting, shuffle the data. The split must be stratified to maintain the class ratio (201without recurrence and 85with recurrence)in the resulting training and test sets.
As input, accept three possible values: 0,1,or 2:
0means that only pre-pruning is used.
1means that only post-pruning is used.
2means both pre-pruning and post-pruning approaches are used.
If there are multiple pruning methods, they should be specified with corresponding letters after the number. For example, if all three pre-pruning approaches are chosen, they will be coded as 0N K G.For post-pruning approaches, the letters will be E for Error Estimation, X for Chi-Square test, and C for Cost-Complexity pruning.
For the output, provide:
1.Train Set Accuracy:
-Accuracy on the training set after training the model.
2.10-Fold Cross-Validation Results:
-Accuracy for each fold of the 10-fold cross-validation.
-Average accuracy and standard deviation for cross-validation.
3.Test Set Accuracy:
-Accuracy on the test set.
Example Input and Output:
Input: 0
Output:
1.Train Set Accuracy:
Accuracy: 70.90%
10-Fold Cross-Validation Results:
Accuracy Fold 1: 71.00%
Accuracy Fold 2: 69.72%
Accuracy Fold 3: 71.30%
Accuracy Fold 4: 73.05%
Accuracy Fold 5: 69.53%
Accuracy Fold 6: 69.53%
Accuracy Fold 7: 73.16%
Accuracy Fold 8: 71.53%
Accuracy Fold 9: 69.06%
Accuracy Fold 10: 71.09%
Average Accuracy: 70.90%
Standard Deviation: 1.37%
2.Test Set Accuracy:
Accuracy: 69.07%
Notes:
The solution should be implemented in C++.
The code should be a result of your own understanding of the problem and not from pre-executed solutions.
Do not use embedded STL functions to manipulate data.
You are allowed to use data structures like DataFrame if needed.
Bonus: Implement the Random Forest Algorithm.
The data in the breast-cancer.data file looks like this:
no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no
no-recurrence-events,60-69,ge40,15-19,0-2,no,2,left,left_low,no
...

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!