Question: 6.1: a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation
6.1: a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for? Answer: We need to partition the data in to training and validation because we have a huge set of data for the projects. By dividing these one can be used to build the model and show the relation between the predictor variables and also the predicted variables and the other one to validate the predicted once. The training dataset is used to train or build a model. The training data is used to build the model. The algorithm 'discovers' the model using this data set. For example, in a linear regression, the training dataset is used to fit the linear regression model, i.e. to compute the regression coefficients. In a neural network model, the training dataset is used to obtain the network weights. Once a model is built on training data, you need to find out the accuracy of the model on unseen data. For this, the model should be used on a dataset that was not used in the training process -- a dataset where you know the actual value of the target variable. The discrepancy between the actual value and the predicted value of the target variable is the error in prediction. Some form of average error (MSE of average % error) measures the overall accuracy of the model. To get a more realistic estimate of how the model would perform with unseen data, you need to set aside a part of the original data and not use it in the training process. This dataset is known as the validation dataset. After fitting the model on the training dataset, you should test its performance on the validation dataset. The validation data is used to 'validate' the model. In this process, the model is used to make predictions with the validation data - data that were not 2 used to fit the model. In this way we get an unbiased estimate of how well the model performs. We compute measures of 'error', which reflect the prediction accuracy. Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. I have added in Vamshi_xlminer(a). b. Write the equation for prediction the median house price from the predictors in the model. Answer: Input variable s Coefficient Std. Error Constant term 48.4812965 4 CRIM 0.01802258 CHAS 1.44161892 RM 11.4647750 9 3.3114519 1 0.4139090 5 1.7030457 3 0.5167611 2 p-value SS 0 108214.820 3 0.9653135 5 0.3983092 6 0 675.945434 6 415.069793 7 8140.24853 5 So from the above regression equation is: MEDV = (-48.48129654+ 0.01802258*CRIM + 1.44161892*CHAS + 11.46477509*RM) c. What is the predicted median house price for a tract in the Boston area that does not bound the Charles river, has a crime rate of 0.1, and the average number of rooms is 6? What is the prediction error? Answer: Substituting the values CRIM as 0.1, CHAS as 0, RM as 6 MEDV = (-48.48129654+ 0.01802258*CRIM + 1.44161892*CHAS + 11.46477509*RM) MEDV = (-48.48129654+ 0.01802258*0.1 + 1.44161892*0 + 11.46477509*6) 3 MEDV = 20.30916 Median house price will be$ 20,309.16. d. Reduce the number of predictors: i. Which predictors are likely to be measuring the same thing among the 14 predictors? Discuss the relationship among INDUS, NOX, and TAX. Answer: There are some variables that measure levels of industrialization; they are projected to be positively correlated. These are INDUS, NOX, and TAX. We expect a positive relationship between NOX, INDUS and TAX, because if there exists a high proportion of non-retail business trend comes with higher taxes and more pollution. ii. Compute the correlation table between the 13 numerical predictors and search for highly correlated pairs. These have potential redundancy and can cause multicollinearity. Choose which ones to remove based on this table. Answer: From the excel that I have add, highly correlated pairs are, Between NOX and INDUS: Correlation coefficient = 0.71945 Between Tax and INDUS: Correlation coefficient = 0.47595 Between AGE and NOX: Correlation coefficient = 0.65142 Between DIS and NOX: Correlation coefficient = -0.70564 Between DIS and AGE: Correlation coefficient = -0.6757 Between TAX and RAD: Correlation coefficient = 0.51235 According to the correlation table, we might remove INDUS, AGE and TAX. 4 iii. Use exhaustive search to reduce the remaining predictors as follows: First, choose the top three models. Then run each of these models separately on the training set, and compare their predictive accuracy for the validation set. Compare RMSE and average error. As well as lift charts. Finally, give the best model. Answer: Based on excel sheets the model with variables CRIM, ZN, CHAS, RM, NOX, DIS, RAD, PTRATIO, B, LSTAT having 10 variables is the best model for predicting Boston house prices