K-Nearest Neighbours Classifier Now we can start building the actual machine learning model. There are many...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
K-Nearest Neighbours Classifier Now we can start building the actual machine learning model. There are many classification algorithms in scikit-learn that we could use. Here we will use a k-nearest neighbors classifier, which is easy to understand. Building this model only consists of storing the training set. To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point. Then it assigns the label of this training point to the new data point. Out [4]: ● All machine learning models in scikit-learn are implemented in their own classes, which are called Estimator classes. The k-nearest neighbors classification algorithm is implemented in the KNeighbors Classifier class in the neighbors module. Before we can use the model, we need to instantiate the class into an object. This is when we will set any parameters of the model. The most important parameter of KNeighbors Classifier is the number of neighbors (i.e., K), which we will set to 1 for our first exploration. Model Training: To build the model on the training set, we call the 'fit' method of the knn object, which takes as arguments the NumPy array X_train containing the training data and the NumPy array y_train of the corresponding training labels. In [4] # Import the KNN classifier from sklearn.neighbors import KNeighborsClassifier # Build a KNN classifier model clf_knn KNeighbors Classifier(n_neighbors=1) # Train the model with the training data clf_knn.fit (X_train, y_train) KNeighbors Classifier KNeighbors Classifier (n_neighbors=1) Prediction: We can now make predictions using this model on new data for which we might not know the correct labels. Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9 cm, a petal length of 1 cm, and a petal width of 0.2 cm. What species of iris would this be? We can put this data into a NumPy array, again by calculating the shape-that is, the number of samples (1) multiplied by the number of features (4): In [5]: # Produce the features of a testing data instance X_new np.array([[5, 2.9, 1, 0.2]]) print("X_new.shape: {}".format(X_new.shape)) #Predict the result label of X_new: y_new_pred = clf_knn.predict (X_new) print("The predicted class is: \n", y_new_pred) X_new.shape: (1, 4) The predicted class is: [0] Our model predicts that this new iris belongs to the class 0, meaning its species is setosa. But how do we know whether we can trust our model? We don't know the correct species of this sample, which is the whole point of building the model! Evaluating Model: This is where the test set that we created earlier comes in. This data was not used to build the model, but we do know what the correct species is for each iris in the test set. So, we can use the trained model to predict these data instances and calculate the accuracy to evaluate how good the model is. Task 1 Write code to calculate the accuracy score In [1] # [Your code here ...] Parameter Tuning with Cross Validation (CV) In this section, we'll explore a CV method that can be used to tune the hyperparameter K using the above training and test data. Scikit-learn comes in handy with its cross_val_score method. We specifiy that we are performing 10 folds with the cv=KFold(n_splits=10, shuffle=True) parameter and that our scoring metric should be accuracy since we are in a classification setting. In each iteration, the training data take 90% of the total data while testing data takes 10%. The average on the accuracies reported from each iteration will make the testing accuracy more robust than just a single split of the data. Manual tuning with cross validation: Plot the misclassification error versus K. You need to figure out the possible values of K. If the number of possible values is too big, you can take some values with a certain step, e.g., K = 1, 5, 10, ... with a step of 5. In [8]: from sklearn.model_selection import cross_val_score, KFold import matplotlib.pyplot as plt cv_scores = [] cv_scores_std = [] k_range= range(1, 135, 5) for i in k_range: clf KNeighbors Classifier(n_neighbors scores = cross_val_score(clf, iris_data.data, iris_data.target, scoring='accuracy', cv=KFold (n_splits=10, shuffl cv_scores.append(scores.mean()) cv_scores_std.append(scores.std()) = i) #Plot the relationship plt.errorbar(k_range, cv_scores, yerr=cv_scores_std, marker='x', label='Accuracy') plt.ylim( [0.1, 1.1]) plt.xlabel('$K$') plt.ylabel('Accuracy') plt. legend (loc='best') plt.show() It can be seen that the accuracy first goes up when K increases. It peeks around 15. Then, it keeps going down. Particularly, the performance (measured by the score mean) and its robustness/stableness (measured by the score std) drop substantially around K=85. One possible reason is that when K is bigger than 85, the model suffers from the underfitting issue severely. Automated Parameter Tuning: Use the GridSearchCV method to accomplish automatic model selection. Task 2 Check against the figure plotted above to see if the selected hyperparameter K can lead to the highest misclassification accuracy. In [2] # [Your code here ...] Task 3 It can be seen that GridSearchCV can help us to the automated hyperparameter tuning. Actually, it also store the intermediate results during the search process. The attribute 'cv_results_' of GridSearchCV contains much such informaiton. For example, this attribute contains the 'mean_test_score' and 'std_test_score' for the cross validation. Make use of this information to produce a plot similar to what we did in the manual way. Please check if the two plots comply with each other. In [3]: # [Your code here ..] 2. Naive Bayes Classifier Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between every pair of features given the value of the class variable. Bayes'theorem states the following relationship, given class variable y and dependent feature vector x₁ through xn, P(y | x₁, Using the naive conditional independence assumption, we have ,xn) P(y)P(x₁, . xn | y) P(x₁,...,xn) n P(y | x₁, ..., Xn) × P(y)¶¶P(xi | y) ∞ i=1 n ŷ = arg max P(y) | P(x₁ | y), y i=1 Then, we can use Maximum A Posteriori (MAP) estimation to estimate P(y) and P(x; | y); the former is then the relative frequency of class y in the training set. References: H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS. P(x¡ | y) = = Gaussian Naive Bayes Gaussian NB implements the Gaussian Naive Bayes algorithm for classification on the data sets where features are continuous. The likelihood of the features is assumed to be Gaussian: 1 2πσ, P ( - (* ₁ = 1,)²2) 203 exp The parameters oy and μy are estimated using maximum likelihood. Demo: In this demo, we show how to build a Gaussian Naive Bayes classifier. In [31] import pandas as pd from sklearn.datasets import make_classification from sklearn.naive_bayes import GaussianNB import warnings warnings.filterwarnings("ignore") In [16]: # Generate a synthetica 2D dataset X, y = make_classification(n_samples=50, n_features=2, n_informative=2, n_redundant=0, n_classes=3, n_clusters_per_class=1, weights=None, flip_y=0.01, class_sep=0.5, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=42) # Data split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42) # Visualize the generated data colors = ['blue', 'yellow', 'green'] for i, color in enumerate (colors): plt.scatter (X_train[y_train i, 0], X_train [y_train i, 1], c-color) plt.scatter (X_test[:, 0], X_test[:,1], c='red', marker='x', label='Testing Data') plt. legend (loc='best') plt.show() == Task 4 Given the training data generated as follows: In [21]: X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) np.array([1, 1, 1, 2, 2, 2]) y = # Firstly, let's do the parameter estimation manually without using the model |X_0_C_1=X [y==1] [:,0] |X_1_C_1=X [y==1] [:,1] X_0_C_2=X [y==2][:,0] X_1_C_2=X [y==2][:,1] manual_means = np.array([[X_0_C_1.mean(), X_1_C_1.mean()], [X_0_C_2.mean(), X_1_C_2.mean()]]) np.set_printoptions(precision=4) print('Means estaimated manually: \n', manual_means) manual_vars = np.array([[X_0_C_1.var(), X_1_C_1.var()], [X_0_C_2.var(), X_1_C_2.var()]]) print('Variances estaimated manually: \n', manual_vars) Means estaimated manually: [[-2. -1.3333] [ 2. 1.3333]] Variances estaimated manually: [[0.6667 0.2222] [0.6667 0.2222]] Train a GaussianNB model and print out the learned model parameters (parameters of probability distributions). And check if the learned parameters comply with the manually estimated ones as shown above. Predict the label of a data [-0.8,-1]. In [4]: # [Your code here ...] Task 5 Given the training data generated as follows. The number of data instances (6) is small while the demensionality of the data is relatively highly (100). In [40]: X = np. random.randint (5, size=(6, 100)) y np.array([1, 2, 3, 4, 5, 6]) Train a MultinomialNB model, and predict the label of a data X_new = [[1,2,1,0,2,3,0,3,2,1,1,3,3,0,4,2,2,0,0,2,2,3,4,4,4,4,0,3,3, 1,1,1,2,3,1,3,0,2,2,0,4,2,4,3,2,0,1,1,1,2,3,0,0,3,4,3,3,4, 2, 1,0,0,0,0,4,1,2,0,0,4,4,0,4,1,3,1,1,1,3,1,1,1,4,3,1,1,3, 2,0,0,0,3,4,1,1,4,3,2,3,4]]: In [6]: # [Your code here ...] In our lecture, we discussed that if there is no occurence of some feature values, zero probabilities will appear. To overcome this issue, Laplace correction (smoothing) is proposed, as shown in the follow formula. In the MultinomialNB implementation, the parameter 'alpha' controls the way we apply smoothing. The default value is 'alpha=1.0'. Please create and train a model with no Laplace smoothing for the above data set. Compare the leaned model parameters (probabilities) with the case 'alpha=1', by checking if there are zero probabilities (note that due to the accuracy issue, zero might be represented as a signficantly small number by the computer). In [7]: # [Your code here ...] p(xyily) = Nyi + a Ny + an In [ ]: ● In [ ]: Comparasion on Iris data In [8]: # [Your code here Task 6 Compare the prediction accuaracy between KNN clasifier (use the optimal K you've identied) and Gaussian Naive Bayes. Use 10-cross validation to report the accuracy mean and standard deviation (Note this is to ensure the comparison is based on robust performace). Which classifidation mdoel is more accurate on Iris data set? Use t-test to show if the difference is statistically significant. .] K-Nearest Neighbours Classifier Now we can start building the actual machine learning model. There are many classification algorithms in scikit-learn that we could use. Here we will use a k-nearest neighbors classifier, which is easy to understand. Building this model only consists of storing the training set. To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point. Then it assigns the label of this training point to the new data point. Out [4]: ● All machine learning models in scikit-learn are implemented in their own classes, which are called Estimator classes. The k-nearest neighbors classification algorithm is implemented in the KNeighbors Classifier class in the neighbors module. Before we can use the model, we need to instantiate the class into an object. This is when we will set any parameters of the model. The most important parameter of KNeighbors Classifier is the number of neighbors (i.e., K), which we will set to 1 for our first exploration. Model Training: To build the model on the training set, we call the 'fit' method of the knn object, which takes as arguments the NumPy array X_train containing the training data and the NumPy array y_train of the corresponding training labels. In [4] # Import the KNN classifier from sklearn.neighbors import KNeighborsClassifier # Build a KNN classifier model clf_knn KNeighbors Classifier(n_neighbors=1) # Train the model with the training data clf_knn.fit (X_train, y_train) KNeighbors Classifier KNeighbors Classifier (n_neighbors=1) Prediction: We can now make predictions using this model on new data for which we might not know the correct labels. Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9 cm, a petal length of 1 cm, and a petal width of 0.2 cm. What species of iris would this be? We can put this data into a NumPy array, again by calculating the shape-that is, the number of samples (1) multiplied by the number of features (4): In [5]: # Produce the features of a testing data instance X_new np.array([[5, 2.9, 1, 0.2]]) print("X_new.shape: {}".format(X_new.shape)) #Predict the result label of X_new: y_new_pred = clf_knn.predict (X_new) print("The predicted class is: \n", y_new_pred) X_new.shape: (1, 4) The predicted class is: [0] Our model predicts that this new iris belongs to the class 0, meaning its species is setosa. But how do we know whether we can trust our model? We don't know the correct species of this sample, which is the whole point of building the model! Evaluating Model: This is where the test set that we created earlier comes in. This data was not used to build the model, but we do know what the correct species is for each iris in the test set. So, we can use the trained model to predict these data instances and calculate the accuracy to evaluate how good the model is. Task 1 Write code to calculate the accuracy score In [1] # [Your code here ...] Parameter Tuning with Cross Validation (CV) In this section, we'll explore a CV method that can be used to tune the hyperparameter K using the above training and test data. Scikit-learn comes in handy with its cross_val_score method. We specifiy that we are performing 10 folds with the cv=KFold(n_splits=10, shuffle=True) parameter and that our scoring metric should be accuracy since we are in a classification setting. In each iteration, the training data take 90% of the total data while testing data takes 10%. The average on the accuracies reported from each iteration will make the testing accuracy more robust than just a single split of the data. Manual tuning with cross validation: Plot the misclassification error versus K. You need to figure out the possible values of K. If the number of possible values is too big, you can take some values with a certain step, e.g., K = 1, 5, 10, ... with a step of 5. In [8]: from sklearn.model_selection import cross_val_score, KFold import matplotlib.pyplot as plt cv_scores = [] cv_scores_std = [] k_range= range(1, 135, 5) for i in k_range: clf KNeighbors Classifier(n_neighbors scores = cross_val_score(clf, iris_data.data, iris_data.target, scoring='accuracy', cv=KFold (n_splits=10, shuffl cv_scores.append(scores.mean()) cv_scores_std.append(scores.std()) = i) #Plot the relationship plt.errorbar(k_range, cv_scores, yerr=cv_scores_std, marker='x', label='Accuracy') plt.ylim( [0.1, 1.1]) plt.xlabel('$K$') plt.ylabel('Accuracy') plt. legend (loc='best') plt.show() It can be seen that the accuracy first goes up when K increases. It peeks around 15. Then, it keeps going down. Particularly, the performance (measured by the score mean) and its robustness/stableness (measured by the score std) drop substantially around K=85. One possible reason is that when K is bigger than 85, the model suffers from the underfitting issue severely. Automated Parameter Tuning: Use the GridSearchCV method to accomplish automatic model selection. Task 2 Check against the figure plotted above to see if the selected hyperparameter K can lead to the highest misclassification accuracy. In [2] # [Your code here ...] Task 3 It can be seen that GridSearchCV can help us to the automated hyperparameter tuning. Actually, it also store the intermediate results during the search process. The attribute 'cv_results_' of GridSearchCV contains much such informaiton. For example, this attribute contains the 'mean_test_score' and 'std_test_score' for the cross validation. Make use of this information to produce a plot similar to what we did in the manual way. Please check if the two plots comply with each other. In [3]: # [Your code here ..] 2. Naive Bayes Classifier Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between every pair of features given the value of the class variable. Bayes'theorem states the following relationship, given class variable y and dependent feature vector x₁ through xn, P(y | x₁, Using the naive conditional independence assumption, we have ,xn) P(y)P(x₁, . xn | y) P(x₁,...,xn) n P(y | x₁, ..., Xn) × P(y)¶¶P(xi | y) ∞ i=1 n ŷ = arg max P(y) | P(x₁ | y), y i=1 Then, we can use Maximum A Posteriori (MAP) estimation to estimate P(y) and P(x; | y); the former is then the relative frequency of class y in the training set. References: H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS. P(x¡ | y) = = Gaussian Naive Bayes Gaussian NB implements the Gaussian Naive Bayes algorithm for classification on the data sets where features are continuous. The likelihood of the features is assumed to be Gaussian: 1 2πσ, P ( - (* ₁ = 1,)²2) 203 exp The parameters oy and μy are estimated using maximum likelihood. Demo: In this demo, we show how to build a Gaussian Naive Bayes classifier. In [31] import pandas as pd from sklearn.datasets import make_classification from sklearn.naive_bayes import GaussianNB import warnings warnings.filterwarnings("ignore") In [16]: # Generate a synthetica 2D dataset X, y = make_classification(n_samples=50, n_features=2, n_informative=2, n_redundant=0, n_classes=3, n_clusters_per_class=1, weights=None, flip_y=0.01, class_sep=0.5, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=42) # Data split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42) # Visualize the generated data colors = ['blue', 'yellow', 'green'] for i, color in enumerate (colors): plt.scatter (X_train[y_train i, 0], X_train [y_train i, 1], c-color) plt.scatter (X_test[:, 0], X_test[:,1], c='red', marker='x', label='Testing Data') plt. legend (loc='best') plt.show() == Task 4 Given the training data generated as follows: In [21]: X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) np.array([1, 1, 1, 2, 2, 2]) y = # Firstly, let's do the parameter estimation manually without using the model |X_0_C_1=X [y==1] [:,0] |X_1_C_1=X [y==1] [:,1] X_0_C_2=X [y==2][:,0] X_1_C_2=X [y==2][:,1] manual_means = np.array([[X_0_C_1.mean(), X_1_C_1.mean()], [X_0_C_2.mean(), X_1_C_2.mean()]]) np.set_printoptions(precision=4) print('Means estaimated manually: \n', manual_means) manual_vars = np.array([[X_0_C_1.var(), X_1_C_1.var()], [X_0_C_2.var(), X_1_C_2.var()]]) print('Variances estaimated manually: \n', manual_vars) Means estaimated manually: [[-2. -1.3333] [ 2. 1.3333]] Variances estaimated manually: [[0.6667 0.2222] [0.6667 0.2222]] Train a GaussianNB model and print out the learned model parameters (parameters of probability distributions). And check if the learned parameters comply with the manually estimated ones as shown above. Predict the label of a data [-0.8,-1]. In [4]: # [Your code here ...] Task 5 Given the training data generated as follows. The number of data instances (6) is small while the demensionality of the data is relatively highly (100). In [40]: X = np. random.randint (5, size=(6, 100)) y np.array([1, 2, 3, 4, 5, 6]) Train a MultinomialNB model, and predict the label of a data X_new = [[1,2,1,0,2,3,0,3,2,1,1,3,3,0,4,2,2,0,0,2,2,3,4,4,4,4,0,3,3, 1,1,1,2,3,1,3,0,2,2,0,4,2,4,3,2,0,1,1,1,2,3,0,0,3,4,3,3,4, 2, 1,0,0,0,0,4,1,2,0,0,4,4,0,4,1,3,1,1,1,3,1,1,1,4,3,1,1,3, 2,0,0,0,3,4,1,1,4,3,2,3,4]]: In [6]: # [Your code here ...] In our lecture, we discussed that if there is no occurence of some feature values, zero probabilities will appear. To overcome this issue, Laplace correction (smoothing) is proposed, as shown in the follow formula. In the MultinomialNB implementation, the parameter 'alpha' controls the way we apply smoothing. The default value is 'alpha=1.0'. Please create and train a model with no Laplace smoothing for the above data set. Compare the leaned model parameters (probabilities) with the case 'alpha=1', by checking if there are zero probabilities (note that due to the accuracy issue, zero might be represented as a signficantly small number by the computer). In [7]: # [Your code here ...] p(xyily) = Nyi + a Ny + an In [ ]: ● In [ ]: Comparasion on Iris data In [8]: # [Your code here Task 6 Compare the prediction accuaracy between KNN clasifier (use the optimal K you've identied) and Gaussian Naive Bayes. Use 10-cross validation to report the accuracy mean and standard deviation (Note this is to ensure the comparison is based on robust performace). Which classifidation mdoel is more accurate on Iris data set? Use t-test to show if the difference is statistically significant. .]
Expert Answer:
Answer rating: 100% (QA)
Based on the images you have provided it appears that you are working through a series of tasks related to machine learning classifiers specifically focusing on the KNearest Neighbors KNN Classifier a... View the full answer
Related Book For
Financial Accounting and Reporting a Global Perspective
ISBN: 978-1408076866
4th edition
Authors: Michel Lebas, Herve Stolowy, Yuan Ding
Posted Date:
Students also viewed these programming questions
-
The ABC team estimated that for the three pilot-test bank branches, the retail and business customer lines experienced the annual activity levels(inthousands) as shown in Exhibit C. For example,...
-
Compare and contrast cognitive psychology and behaviorism? what are some cognitive psychology and behaviorism potential uses in therapy? with examples. how might you use cognitive psychology (or...
-
K-Nearest Neighbours Classifier Now we can start building the actual machine learning model. There are many classification algorithms in scikit-learn that we could use. Here we will use a k-nearest...
-
Exercises 5-8: Sort the list of numbers from smallest to largest and display the result in a table. (a) Determine the maximum and minimum values. (b) Calculate the mean and median. Round each result...
-
The walls of a prison cell are perpendicular to the four cardinal compass directions. On the first day of spring, light from the rising Sun enters a rectangular window in the eastern wall. The light...
-
Valleydale Company incurred the following costs during the month: direct labor, $120,000; factory overhead, $108,000; and direct materials purchases, $160,000. Inventories were costed as follows:...
-
Use a stem-and-leaf plot to display the data, which represent the runs scored by a batsman in a World Cup series. Organize the data using the indicated type of graph. Describe any patterns. 70 75 71...
-
Vasquez Ltd. is a retailer operating in Edmonton, Alberta. Vasquez uses the perpetual inventory method. All sales returns from customers result in the goods being returned to inventory; the inventory...
-
1-What is the significance of these three matrices regarding their relevance for strategic planning specific for southwest airlines. 2- Describe the key information for each and how information from...
-
Given the following state for the Bankers Algorithm. 6 processes P0 through P5 4 resource types: A (15 instances); B (6 instances) C (9 instances); D (10 instances) Snapshot at time T0: a. Verify...
-
What this does this statement mean and how to disagree? Give proper reason. connection is largely contingent: it just so happens, given the particular distributions created by this era of global...
-
Let's consider a football sled of mass 200kg. When coach blows his whistle, a lineman applies a 2500N force for 5s. Assuming nothing stops the sled or slows it down, what would its final velocity be...
-
Provide an additional play-centered game to help develop letter awareness game and oral language.
-
On 1st January, 2015, Vinod drew and Pramod accepted a bill at three months for ~ 2,000. On 4th January, 2015 Vinod discounted the bill with his bank at 15% p.a. and remitted half the proceeds to...
-
The phase of the satellite is 45d and the phase of the receiver is 165d. The wave length is 500 cm. (a) What is the phase difference? (b) What is the corresponding wavelength for the phase difference
-
An aircraft weighs 139,209 lbs. If the acceleration is 7.47 ft/sec^2, what is the net force, in pounds, applied to the aircraft?
-
Let Aut (G) & Inn (G) denote the autmorphism of G and inner automorphism of G respectively. Then which of the following is correct? If G is abelian Aut G is abelian If G is non abelian then G can be...
-
Why is it important to understand the macro-environment when making decisions about an international retail venture?
-
Toyota Motor Corporation (Toyota) primarily conducts business in the automotive industry. Toyota also conducts business in finance and other industries. It is organized in three segments: automotive...
-
Britten Inc. is a large European civil engineering and construction enterprise. They have just finished building, for their own use, a large hangar, which will serve as both a warehouse for their...
-
Which of the following statements best describes the purpose of financial reporting? (Explain why you chose your option and why you rejected the others) (a) Provide a listing of an entitys resources...
-
For the composite properties and environmental conditions described in Examples 3.6, 4.7, and 5.3, determine the hygrothermally degraded values of the longitudinal and transverse tensile strengths....
-
The filament-wound E-glass/epoxy pressure vessel described in Example 4.4 is to be used in a hot-wet environment with temperature \(T=100^{\circ} \mathrm{F}\) \(\left(38^{\circ} \mathrm{C} ight)\)...
-
A carbon/epoxy lamina is clamped between rigid plates in a mold (Figure 5.17), while curing at a temperature of \(125^{\circ} \mathrm{C}\). After curing, the lamina/mold assembly (still clamped...
Study smarter with the SolutionInn App