Question: Part 3 : Clustering : This part is concerned with the file: / DataMining / data / arff / UCI / credit - g .

Part 3: Clustering : This part is concerned with the file:
/DataMining/data/arff/UCI/credit-g.arff.
Clustering of the credit-g data of part 1. For this part use only the attributes duration, age, credit amount and job. The aim is to determine the number of clusters in the data and assess whether any of the clusters are meaningful.
1. Run the K-means clustering algorithm on this data for the following values of K: 1,2,3,4,5,10,20. Analyse the resulting clusters. What do you conclude? Provide your reasoning.
2. Choose a value of K and run the algorithm with different seeds. What is the effect of changing the seed? Provide your explanation.
3. Run the EM algorithm on this data with the default parameters and describe the output and your analysis.
4. The EM algorithm can be quite sensitive to whether the data is normalized or not. Use the Weka normalize filter
(Preprocess --> Filter --> unsupervised --> normalize)
to normalize the numeric attributes. What difference does this make to the clustering runs? Provide your reasoning.
5. The algorithm can be quite sensitive to the values of minLogLikelihoodImprovementCV, minStdDev and minLogLikelihoodImprovementIterating, Explore the effect of changing these values. What do you conclude?
6. How many clusters do you think are in the data? Give a plain English language description of one of them.
7. Compare the use of K-means and EM for these clustering tasks. Which do you think is best? Why?
8. What golden nuggets did you find, if any? Report Length Up to one page.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!