Question: Breast Cancer Data Analysis Q2-1, Finding good k, (25 points) 'breast cancer.txt' data set has 699 samples with 9 features. Each sample is classified
Breast Cancer Data Analysis Q2-1, Finding good k, (25 points) 'breast cancer.txt' data set has 699 samples with 9 features. Each sample is classified into 2 classes (index '2' for benign and index '4' for malignant). Note that the class information is represented at the last column in the data file i) Apply k-mean clustering with various k = (1, 2, 3,..., 8) and its corresponding J (For J, refer to HW2). (15 pt) ii) Apply Hierarchical clustering. (15 pt) Please make sure that you must ignore the last column when you run the clustering methods. For the additional information of the data, refer the included the excel file or visit https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+ (original) Q2-2 Comparison k-mean and GMM, (30 points) Since you know the actual class of the data, calculate ground-truth accuracy P with k-mean and GMM. Note that fix the number of clusters as 2. the number of correclty predicted samples total number of samples Hint) Use 'predict' function for GMM and 'fit predict' function for k-mean in scikit-learn package. P (1)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
