Question: Based on the data we've collected, we would like to conduct a regression analysis and make a prediction on the Median Market Value of the
Based on the data we've collected, we would like to conduct a regression analysis and make a prediction on the Median Market Value of the houses
In [18]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('ggplot') import warnings
warnings.filterwarnings('ignore') In [19]:
import scipy.stats as stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
import datetime
In [20]:
boston_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv'
boston_df = pd.read_csv(boston_url)
boston_df=boston_df.drop(['Unnamed: 0'],axis=1)
Question 1: Buld the Regression model
1. Split the data into train (80%) and test (20%) set Set the random seed = 2600and show the shape of two sets (5 points)
2. Define the dependent variable and independent variables variables in train set (5 points)
3. Get the VIF of the independent variables (10 points)
4. From the VIF output, can you tell if there is multicollinearity problem among predictors? If yes, what predictors are involved? (5 points)
5. Build up the regression model with train set (10 points)
6. Which predictor is the most insignificant one? (5 points)
7. What is the impact of an additional weighted distance to the five Boston employment centres on the median market value of owner occupied homes? (10 points)
Question 2: Residual Analysis
1. Create the residual plot using train set (4 plots in 1) (10 points)
2. Drop all insignficant predictors at level of 0.05 from the full model and build a reduced regression model (10 points)
Question 3: Model Performance Evaluation
1. Predict the value of Y in test set using full and reduced model (10 points)
2. Compute the RMSE of full and reducted model using test set (10 points)
3. Based on the Adjusted R squared and RMSE, which model has better performance (5 points)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
