Question: Question1: # partition the data into training (60%) and validation (40%) sets predictors = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO',

Question1:

# partition the data into training (60%) and validation (40%) sets predictors = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT'] outcome = 'MEDV'

# partition the data #Create a dataframe called X with the columns in the predictors[] list above # Make sure to turn text columns (categorical) values into dummy variable columns #MISSING 1 line of code

#Create a dataframe (technically a Series) called y containing the outcome column #MISSING 1 line of code

#Split the data into 40/60 validation and training datasets with a random state of 1 #MISSING 1 line of code

print('Training set:', train_X.shape, 'Validation set:', valid_X.shape)

output: Training set: (303, 12) Validation set: (203, 12)

Question2:

# backward elimination

def train_model(variables): model = LinearRegression() model.fit(train_X[variables], train_y) return model

def score_model(model, variables): return AIC_score(train_y, model.predict(train_X[variables]), model)

#Run the backward_elimination function #MISSING 1 line of code

print("Best Subset:", best_variables)

output:

Variables: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, LSTAT Start: score=1807.23 Step: score=1805.30, remove AGE Step: score=1803.57, remove INDUS Step: score=1803.57, remove None Best Subset: ['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT'] Question3:

# forward selection # The initial model is the constant model - this requires special handling in train_model and score_model

#Write the train_model function (starting with "def") #MISSING 6 lines of code def .......

#Write the score_model function (starting with "def") #MISSING 4 lines of code def .....

#Run the forward_selection function #MISSING 1 line of code

print("Best Subset:", best_variables)

output:

Variables: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, LSTAT Start: score=2191.75, constant Step: score=1934.91, add LSTAT Step: score=1874.18, add RM Step: score=1842.54, add PTRATIO Step: score=1837.69, add CHAS Step: score=1835.00, add NOX Step: score=1817.90, add DIS Step: score=1811.82, add ZN Step: score=1810.16, add CRIM Step: score=1808.01, add RAD Step: score=1803.57, add TAX Step: score=1803.57, add None Best Subset: ['LSTAT', 'RM', 'PTRATIO', 'CHAS', 'NOX', 'DIS', 'ZN', 'CRIM', 'RAD', 'TAX'] Question 4:

# stepwise (both) method

#Run the stepwise_selection function #MISSING 1 line of code

print("Best Subset:", best_variables)

output:

Variables: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, LSTAT Start: score=2191.75, constant Step: score=1934.91, add LSTAT Step: score=1874.18, add RM Step: score=1842.54, add PTRATIO Step: score=1837.69, add CHAS Step: score=1835.00, add NOX Step: score=1817.90, add DIS Step: score=1811.82, add ZN Step: score=1810.16, add CRIM Step: score=1808.01, add RAD Step: score=1803.57, add TAX Step: score=1803.57, unchanged None Best Subset: ['LSTAT', 'RM', 'PTRATIO', 'CHAS', 'NOX', 'DIS', 'ZN', 'CRIM', 'RAD', 'TAX']Question 5:

# Re-run the Regression but this time fit the model with best subset variables from the # subset reductions from above

#Define the outcome and predictor variables outcome = 'MEDV' predictors = ['LSTAT', 'RM', 'PTRATIO', 'CHAS', 'NOX', 'DIS', 'ZN', 'CRIM', 'RAD', 'TAX']

#Create a dataframe called X containing the new predictor columns #MISSING 1 line of code

#Create a dataframe (Series) called y containing the outcome column. #MISSING 1 line of code

# fit the regression model y on X #MISSING 2 lines of code

# print the intercept #MISSING 1 line of code

#print the predictor column names and the coefficients #MISSING 1 line of code

# print performance measures (training set) print(" Model performance on training data:") #MISSING 1 line of code

# predict prices in validation set, print first few predicted/actual values and residuals #MISSING 1 line of code

result = pd.DataFrame({'Predicted': house_lm_pred, 'Actual': valid_y, 'Residual': valid_y - house_lm_pred})

# print performance measures (validation set) print(" Model performance on validation data:") #MISSING 1 line of code

output:

intercept 38.95615649828231 Predictor coefficient 0 LSTAT -0.514444 1 RM 3.480964 2 PTRATIO -0.804964 3 CHAS 2.359986 4 NOX -17.866926 5 DIS -1.438596 6 ZN 0.066221 7 CRIM -0.114137 8 RAD 0.262455 9 TAX -0.011166

Model performance on training data:

Regression statistics

Mean Error (ME) : -0.0000 Root Mean Squared Error (RMSE) : 4.5615 Mean Absolute Error (MAE) : 3.1662 Mean Percentage Error (MPE) : -3.4181 Mean Absolute Percentage Error (MAPE) : 16.4898

Model performance on validation data:

Regression statistics

Mean Error (ME) : -0.0393 Root Mean Squared Error (RMSE) : 5.0771 Mean Absolute Error (MAE) : 3.5746 Mean Percentage Error (MPE) : -5.1561 Mean Absolute Percentage Error (MAPE) : 16.9733

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

I have an issue with this one- Compare RMSE, MAPE, and mean error- it gave a NaN. I think I know what the problem is but can't figure it out in the process (missing variable when medv is used) Here...

Can anyone help with Nan in R. Everything is fine with an exception in the calculation of RSME and MAPE. It kept giving me a NaN. I tried to filter out the zero columns but still an issue Here is the...

7.3 Predicting Housing Median Prices. The file BostonHousing.csv contains information on over 500 census tracts in Boston, where for each tract multiple variables are recorded. The last column...

Hey, Can I please get an answer to this assignment? thank you. this is done in R studio : Please follow the R code to Please read data and create training data (70%) and validation data (30%) Build a...

Question 1: # load the data into a dataframe called housing data #MISSING 1 line of code housing_df = pd.read_csv('BostonHousing.csv') # display column/variable names #Create a list called columns...

D O, N O T TkeDeep Learning by proximity of networking and advanced programming Criteria Points AVOI Part 1 - Question 1 Normalize the train and test data 2 Part 1 - Question 2 Build and train a ANN...

A. A neural network typically starts out with random coefficients (weights); hence, it produce essentially random predications when presented with its first case. What is the key ingredients by which...

The file BostonHousing.xls contains information on over 500 census tracts in Boston, where for each tract multiple variables are recorded. The last column (CAT.MEDV) was derived from MEDV, such that...

# variables in the data housing_df.columns output: Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV', 'CAT. MEDV'], dtype='object') # Create a...

What are some of the design parameters for developing a neural network?

Alpha and Beta Corporations comprise an affiliated group that has filed separate tax returns prior to the current year. The corporations report the following amounts for the current year: Alphas...

3 2 Multiple Choice 1 point A silver mining company has used futures markets to hedge the price it will receive for everything it will produce over the next 5 years. Which of the following is true?...

ZEDO COMPANY ADDITIONAL FINANCING NEEDED. Assume sales grow 40% in 2021 over 2020; the average collection period increases by 9 days in 2021 compared to 2020 (360 days in the year), inventory...

1. Discuss the role of organization analysis, person analysis, and task analysis in needs assessment.

4. What are the advantages and disadvantages of the Ulysses Program compared to more traditional ways of training leaders such as formal courses (e.g., MBA) or giving them more increased job...

3. How would you determine if the Ulysses Program was effective? What metrics or outcomes would you collect? Why?