Question: 3. (16 points) In this question we will be understanding correlation between the features in the dataset credit risk dataset.csv. Load this dataset from shared/data/credit
3. (16 points) In this question we will be understanding correlation between the features in the dataset credit risk dataset.csv. Load this dataset from shared/data/credit risk dataset.csv. More information about the data can be found here: https://www.kaggle.com/datasets/ laotse/credit-risk-dataset/data (a) (2 points). Check whether there are any missing values i.e. NAs in the data. For this, explore dataframe.isna() function. i. Report the column names having NAs. ii. Drop all those rows which have NAs. (b) (2 points). Now we will be analyzing only a subset of dataframe. Create a subset of dataframe, containing only the columns person age, person income, loan amnt, loan percent income, cb person cred hist length (c) (4 points). Find correlation between the columns in the data using dataframe.corr(). Pick a pair of covariates and interpret their correlations. Which two predictors are the most highly correlated? The least? Does these correlations make sense in context? (d) (1 points) Using matplotlib.pyplot, plot a scatter plot that includes person income on X-axis and loan amnt on Y-axis. (e) (3 points) Study the plot from Q.3(d) i. Do you identify any outliers? ii. If yes, then suggest a transformation of the data that would reduce the influence of those outlier
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
