Question: USE JUPYTER LAB, below is the provided code and at the end are the questions: import numpy as np import pandas as pd import seaborn

USE JUPYTER LAB, below is the provided code and at the end are the questions:
import numpy as np
import pandas as pd
import seaborn as sns
import math
from sklearn import preprocessing
from sklearn import datasets
from sklearn.tree import plot_tree
from sklearn.tree import export_text
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier
import sklearn
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
np.random.seed(1)
----------
Loading the digits dataset (classification).
The digists dataset has 1797 records
Each record is a 8x8 image (64 dimensions) and there are 10 class labels for this dataset
Each image (record) is labeled by the number it represents
The intensities of the original pixels are binned to values ranging from 0 to 16
-----------
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
----
#Learning about how the data is stored
print("X_shape", X.shape, "
y_shape",y.shape)
print(X[0:2,]) #Check the values for the first two images
print(y[0:2]) #Print the class labels for the first two images
-----
#Show the first image
plt.gray()
plt.matshow(X[0,:].reshape(8,8)) #show the first image, first reshape the 64 values vector into an 8x8 matrix
plt.show()
------
#plotting the first 8 images
fig , axes = plt.subplots(nrows=2, ncols=4, figsize=(6,3))
for id, ax in enumerate(axes.flatten()):
image = X[id,:].reshape(8,8)
ax.set_axis_off()
#ax.imshow(image, cmap=plt.cm.gray_r) #You can try this and comment the line below
ax.imshow(image, cmap='gray')
ax.set_title("Label: %i"% y[id], fontsize =9)
plt.tight_layout()
plt.show()
--------
# Split data into 70% train and 30% test subsets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
print("Training Data",X_train.shape)
print("Testing Data",X_test.shape)
counts, bins = np.histogram(y_test)
print("Number of records in each class", counts)
plt.stairs(counts, bins)
-------
Q1-A Train a decision tree on the training data and report the training and testing accuracy of the decision tree.
Q1-B Plot the first 8 images in the testing datasets.
The title of each subfigure should be True: label Predicted: label
Q1-C Plot the first 8 images in the testing datasets that were misclassified.
The title of each subfigure should be True: label Predicted: label
Q1-D Print the classification report using classification_report from metrics in sklearn
Q1-E Plot the confusion matrix using ConfusionMatrixDisplay
Q1-F (5 points) Plot the decision tree using plot_tree
Q1-G Cross Validation
Report the accuracies for the 5-fold cross validation (use cv=5).
The cross validation method takes the decision tree model, the entire dataset, and the class labels.
For this line:
print("%0.2f accuracy with a standard deviation of %0.2f"%(scores.mean(), scores.std()))
this is a sample output
[0.808333330.719444440.796657380.827298050.79108635]
0.79 accuracy with a standard deviation of 0.04
Q1-H Random Forest Classifier
Train a random forest on X_train and report the accuracy on X_test
Use 100 trees in the random forest classifier. Recall that number of records in X_train (1257)
Fine-tune the max_samples (try different numbers) for RandomForestClassifier
to achieve an accuracy higher than 91%(a big improvement from the 78%)
---------------------
Q2 Finding the best split using gini index
data = np.array([[1,2,3,1],[2,3,3,0],[3,2,2,1],[2,2,6,1],[1,2,5,1],[1,3,2,0],[2,3,6,0],[3,3,4,1]])
print("Values
",data[:,:-1])
print("Class Label",data[:,-1])
n = data.shape[0]
d = data.shape[1]-1 #number of columns, ignore the last column (class label)
---------
Q2-A (10 points) Write a function that computes the gini_index of a dataset D
Use math.power(P_positive/n ,2) to calculate (P_positive/n)^2
If the data has zero records, the gini_index is zero The last column of the dataset is the class label
#Write a function that computes the gini_index for a dataset
#1-((P_positive/n)^2+(P_negative/n)^2)
#use math.power(P_positive/n,2) to calculate (P_positive/n)^2
#If the data has zero records, the gini_index is zero
#The last column of the dataset is the class label
def get_gini_index(D):
n = D.shape[0]
gini_index = "calculate it"
#Write your code here
return(gini_index)
print(get_gini_index(data)) #You should get 0.46875
-----

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!