Question: 6.1: a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation

6.1: a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for? Answer: We need to partition the data in to training and validation because we have a huge set of data for the projects. By dividing these one can be used to build the model and show the relation between the predictor variables and also the predicted variables and the other one to validate the predicted once. The training dataset is used to train or build a model. The training data is used to build the model. The algorithm 'discovers' the model using this data set. For example, in a linear regression, the training dataset is used to fit the linear regression model, i.e. to compute the regression coefficients. In a neural network model, the training dataset is used to obtain the network weights. Once a model is built on training data, you need to find out the accuracy of the model on unseen data. For this, the model should be used on a dataset that was not used in the training process -- a dataset where you know the actual value of the target variable. The discrepancy between the actual value and the predicted value of the target variable is the error in prediction. Some form of average error (MSE of average % error) measures the overall accuracy of the model. To get a more realistic estimate of how the model would perform with unseen data, you need to set aside a part of the original data and not use it in the training process. This dataset is known as the validation dataset. After fitting the model on the training dataset, you should test its performance on the validation dataset. The validation data is used to 'validate' the model. In this process, the model is used to make predictions with the validation data - data that were not 2 used to fit the model. In this way we get an unbiased estimate of how well the model performs. We compute measures of 'error', which reflect the prediction accuracy. Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. I have added in Vamshi_xlminer(a). b. Write the equation for prediction the median house price from the predictors in the model. Answer: Input variable s Coefficient Std. Error Constant term 48.4812965 4 CRIM 0.01802258 CHAS 1.44161892 RM 11.4647750 9 3.3114519 1 0.4139090 5 1.7030457 3 0.5167611 2 p-value SS 0 108214.820 3 0.9653135 5 0.3983092 6 0 675.945434 6 415.069793 7 8140.24853 5 So from the above regression equation is: MEDV = (-48.48129654+ 0.01802258*CRIM + 1.44161892*CHAS + 11.46477509*RM) c. What is the predicted median house price for a tract in the Boston area that does not bound the Charles river, has a crime rate of 0.1, and the average number of rooms is 6? What is the prediction error? Answer: Substituting the values CRIM as 0.1, CHAS as 0, RM as 6 MEDV = (-48.48129654+ 0.01802258*CRIM + 1.44161892*CHAS + 11.46477509*RM) MEDV = (-48.48129654+ 0.01802258*0.1 + 1.44161892*0 + 11.46477509*6) 3 MEDV = 20.30916 Median house price will be$ 20,309.16. d. Reduce the number of predictors: i. Which predictors are likely to be measuring the same thing among the 14 predictors? Discuss the relationship among INDUS, NOX, and TAX. Answer: There are some variables that measure levels of industrialization; they are projected to be positively correlated. These are INDUS, NOX, and TAX. We expect a positive relationship between NOX, INDUS and TAX, because if there exists a high proportion of non-retail business trend comes with higher taxes and more pollution. ii. Compute the correlation table between the 13 numerical predictors and search for highly correlated pairs. These have potential redundancy and can cause multicollinearity. Choose which ones to remove based on this table. Answer: From the excel that I have add, highly correlated pairs are, Between NOX and INDUS: Correlation coefficient = 0.71945 Between Tax and INDUS: Correlation coefficient = 0.47595 Between AGE and NOX: Correlation coefficient = 0.65142 Between DIS and NOX: Correlation coefficient = -0.70564 Between DIS and AGE: Correlation coefficient = -0.6757 Between TAX and RAD: Correlation coefficient = 0.51235 According to the correlation table, we might remove INDUS, AGE and TAX. 4 iii. Use exhaustive search to reduce the remaining predictors as follows: First, choose the top three models. Then run each of these models separately on the training set, and compare their predictive accuracy for the validation set. Compare RMSE and average error. As well as lift charts. Finally, give the best model. Answer: Based on excel sheets the model with variables CRIM, ZN, CHAS, RM, NOX, DIS, RAD, PTRATIO, B, LSTAT having 10 variables is the best model for predicting Boston house prices

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

I have an issue with this one- Compare RMSE, MAPE, and mean error- it gave a NaN. I think I know what the problem is but can't figure it out in the process (missing variable when medv is used) Here...

Can anyone help with Nan in R. Everything is fine with an exception in the calculation of RSME and MAPE. It kept giving me a NaN. I tried to filter out the zero columns but still an issue Here is the...

Analyze the attached file using the techniques shown the chapter. Also, use the questions from Problem 6.1 on page 169 as a guide for questions that need to be answered.. What do you find? Post any R...

6.1 Predicting Boston Housing Prices. The file Boston Housing.csy contains informa- tion collected by the US Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset...

6.1 on Page 150 Predicting Boston Housing Prices. The file BostonHousing.jmp contains information collected by the US Bureau of the Census concerning housing in the area of Boston, Massachusetts. The...

Please only answer D I, II, and III. A,B, and C are solved. Predicting Boston Housing Prices. The file BostonHousing.csv contains information collected by the US Bureau of the Census concerning...

The file BostonHousing.csv below contains information collected by the US Bureau of the Censusconcerning housing in the area of Boston, Massachusetts. You will use it to practice data mining in R.The...

Predicting Boston Housing Prices. The file BostonHousing.csv contains information collected by the US Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset...

Since 2010, the federal government and many provinces have increased the tax rate on top income earners. This question asks you to think about whether these policies were a good idea. In a 2015 paper...

Find the minimum value of P required to make the cart tip about point A. Assume the rear wheel's brakes (point A) are on. Hint: Think about what happens to the reaction at B when it starts to rotate....

Is your dividend yield different? Why or why not? ( Select the best choice below. ) A . The dividend yield will be different because the selling price impacts the dividend paid. B . The dividend...

2023 1040 form 1.Phillip and Claire are married and file a joint return. Phillip is self-employed as a real estate agent, and Claire is a flight attendant. Phillip and Claire have three dependent...