Question: You will be using Churn-Modeling.csv for this one. Here is a description of the variables. Some of the Variables are self-explanatory Variable Description Surname Last
You will be using Churn-Modeling.csv for this one. Here is a description of the variables. Some of the Variables are self-explanatory
| Variable | Description |
| Surname | Last Name of the customer |
| Tenure | Number of Years the customer has been with the bank |
| Balance | Account Balance in USD |
| NumOfProducts | Number of Banks products to which the customer subscribed |
| HasCrCard | 1 if the customer has a credit card with the bank 0 If the customer doesnt have a card with the bank |
| IsActiveMember | Whether customer made a transaction with the bank in the last 1 year |
| EstimatedSalary | The annual salary of the customer in USD |
| Exited | 1 If the customer decided to quit the bank 0 If the customer decided to stay with the bank |
Load and attach the Churn-Modeling.csv dataset into a variable mydata. Conduct an exploration of what variables it contains and what the values look like. How many rows does the dataset have? Convert the variables HasCrCard, IsActiveMember ,Gender and Exited into factors. What is the Surname of the customer with the highest EstimatedSalary? (1 Point)
Use CreditScore, Age and Balance to perform K-means clustering on the trees. You need to use the preProcess() and predict() functions in caret library to normalize the data in the range 0 to 1. Use K=3, 4 and 5. Which one is better (Use the ratio of inter cluster to intra cluster distance for this one)? (2 Points)
Plot the clusters using these variables for K=3, 4 and 5 i.e., three separate plots (Using CreditScore and Balance on X and Y axes and coloring them by the cluster number to which they belong). Which value of K do you think is better now? Use the Silhouette Method for this one (1.5 Points)
Use the table() command to tabulate the relationship between cluster assignment and the variable Exited for the best choice of K from the previous question. Do you see any clusters that contain predominantly Exited clusters. If so, describe the characteristics of customers in that cluster using the values of the variables used in clustering (Hint : Compare the centroid values of this cluster to the other clusters). (1.5 Points)
Use the variables Tenure, NumOfProducts and EstimatedSalary this time to run a k-means clustering algorithm (Use the scale() function instead of preprocess() and predict() this time). Run it for K=3,4, 5 and 6. Use the elbow method to determine the appropriate number of clusters this time (3 Points)
Use the table() command to tabulate the relationship between cluster assignment and the variable Exited for the best choice of K from the previous question. Do you see any clusters that contain predominantly Exited clusters. If so, describe the characteristics of customers in that cluster using the values of the variables used in clustering (Hint : Compare the centroid values of this cluster to the other clusters). (1 Point)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
