Question: mplement the ID 3 decision tree classification algorithm and apply it to the dataset breast - cancer.arff Some features contain the value ?

mplement the ID3 decision tree classification algorithm and apply it to the dataset breast-cancer.arff
Some features contain the value "?", indicating missing values. Solve the task by filling in these missing values using an approach of your choice, and justify your chosen approach.
To avoid overfitting the tree, use at least one approach for pre-pruning and one for post-pruning:
Pre-Pruning:
Constant N: Defines the maximum depth of the tree.Constant K: Defines the minimum number of training examples required in a node (to turn it into a leaf).Constant G: Defines the minimum information gain required to split a node.
Post-Pruning:
Model error estimation when pruning the tree (Error Estimation or Reduced Error Pruning).Chi-Squared (^2) test.Minimal Cost-Complexity Pruning.
Any additional pruning approaches will be considered as a bonus.
For testing the algorithm, split the dataset into training and testing sets in an 80:20 ratio, ensuring the data is shuffled before splitting. The split should be stratified to maintain the class distribution (201 non-recurrence, 85 recurrence) in the newly formed training and test sets.
Input:
The program should take three possible values as input: 0,1, and 2:
0: Use only the pre-pruning approach.
1: Use only the post-pruning approach.
2: Use both pre-pruning and post-pruning approaches.
If multiple types of pruning approaches are available, they are specified with the corresponding letter after the approach type. For example:
For pre-pruning: N, K, and G.
For post-pruning: E, X, and C.
For example:
Input 0 applies all implemented pre-pruning approaches.
Input 0 K applies only the pre-pruning variant that defines the minimum number of training examples.
Other combinations follow a similar logic.
Output:
The program should output the following:
Train Set Accuracy:
The accuracy of the model on the training set when using the standard split.
10-Fold Cross-Validation Results:
Accuracy for each fold.Average accuracy and standard deviation across the folds.
Test Set Accuracy:
Accuracy on the test set.
Example Input:
0
Example Output:
mathematica
1. Train Set Accuracy: Accuracy: 70.90%10-Fold Cross-Validation Results: Accuracy Fold 1: 71.00% Accuracy Fold 2: 69.72% Accuracy Fold 3: 71.30% Accuracy Fold 4: 73.05% Accuracy Fold 5: 69.53% Accuracy Fold 6: 69.53% Accuracy Fold 7: 73.16% Accuracy Fold 8: 71.53% Accuracy Fold 9: 69.06% Accuracy Fold 10: 71.09% Average Accuracy: 70.90% Standard Deviation: 1.37%2. Test Set Accuracy: Accuracy: 69.07%
Notes:
Use data structures such as DataFrames where appropriate.
Compare the results achieved with different approaches to avoid overfitting.
As a bonus, try to implement the Random Forest algorithm. solve this in C++ AI task first input should be string of the information of breas cancer something like this : no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no, then 0,1 or 2

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!