Question: Implement the ID 3 decision tree classification algorithm and apply it to the breast cancer dataset ( breast - cancer.arff ) from UCI Machine Learning
Implement the IDdecision tree classification algorithm and apply it to the breast cancer dataset breastcancer.arfffrom UCI Machine Learning Repository.
Some of the features contain the value which represents missing values. Solve this problem by filling in the missing values with an approach of your choice, and justify your choice.
To avoid overfitting of the tree, use at least one approach for prepruning and at least one approach for postpruning:
Prepruning approaches for the tree:
Constant N defining the maximum depth of the tree.
Constant K defining the minimum number of training examples in a node leaf
Constant G defining the minimum information gain required for a split.
Postpruning approaches for the tree:
Error Estimation or Reduced Error Pruning.
ChiSquare test pruning
Minimal CostComplexity Pruning.
You will be awarded bonus points for implementing additional pruning approaches.
For testing the algorithm, split the data into training and testing sets with an :ratio Before splitting, shuffle the data. The split must be stratified to maintain the class ratio without recurrence and with recurrencein the resulting training and test sets.
As input, accept three possible values: or :
means that only prepruning is used.
means that only postpruning is used.
means both prepruning and postpruning approaches are used.
If there are multiple pruning methods, they should be specified with corresponding letters after the number. For example, if all three prepruning approaches are chosen, they will be coded as N K GFor postpruning approaches, the letters will be E for Error Estimation, X for ChiSquare test, and C for CostComplexity pruning.
For the output, provide:
Train Set Accuracy:
Accuracy on the training set after training the model.
Fold CrossValidation Results:
Accuracy for each fold of the fold crossvalidation.
Average accuracy and standard deviation for crossvalidation.
Test Set Accuracy:
Accuracy on the test set.
Example Input and Output:
Input:
Output:
Train Set Accuracy:
Accuracy:
Fold CrossValidation Results:
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Accuracy Fold :
Average Accuracy:
Standard Deviation:
Test Set Accuracy:
Accuracy:
Notes:
The solution should be implemented in C
The code should be a result of your own understanding of the problem and not from preexecuted solutions.
Do not use embedded STL functions to manipulate data.
You are allowed to use data structures like DataFrame if needed.
Bonus: Implement the Random Forest Algorithm.
The data in the breastcancer.data file looks like this:
norecurrenceevents,premeno,noleft,leftlow,no
norecurrenceevents,premeno,noright,rightupno
norecurrenceevents,premeno,noleft,leftlow,no
norecurrenceevents,genoright,leftupno
norecurrenceevents,premeno,noright,rightlow,no
norecurrenceevents,genoleft,leftlow,no
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
