Question: 3. (16 points) In this question we will be understanding correlation between the features in the dataset credit risk dataset.csv. Load this dataset from shared/data/credit

3. (16 points) In this question we will be understanding correlation between the features in the dataset credit risk dataset.csv. Load this dataset from shared/data/credit risk dataset.csv. More information about the data can be found here: https://www.kaggle.com/datasets/ laotse/credit-risk-dataset/data (a) (2 points). Check whether there are any missing values i.e. NAs in the data. For this, explore dataframe.isna() function. i. Report the column names having NAs. ii. Drop all those rows which have NAs. (b) (2 points). Now we will be analyzing only a subset of dataframe. Create a subset of dataframe, containing only the columns person age, person income, loan amnt, loan percent income, cb person cred hist length (c) (4 points). Find correlation between the columns in the data using dataframe.corr(). Pick a pair of covariates and interpret their correlations. Which two predictors are the most highly correlated? The least? Does these correlations make sense in context? (d) (1 points) Using matplotlib.pyplot, plot a scatter plot that includes person income on X-axis and loan amnt on Y-axis. (e) (3 points) Study the plot from Q.3(d) i. Do you identify any outliers? ii. If yes, then suggest a transformation of the data that would reduce the influence of those outlier

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!