Question: The datasets for this assignment are available in the Excel files Housing_Train.xlsx and Housing_Test.xlsx. The data provides information (collected several years ago) on median housing
The datasets for this assignment are available in the Excel files "Housing_Train.xlsx" and "Housing_Test.xlsx". The data provides information (collected several years ago) on median housing values from neighborhoods in Boston, MA. It includes the median home value (which will be our y-variable) and 11 predictor variables (which will be our prospective x-variables). The definitions of the variables are: CRIM - per capita crime rate by town ZN - proportion of residential land zoned for lots over 25,000 sq.ft. INDUS - proportion of non-retail business acres per town. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise) NOX - nitric oxides concentration (parts per 10 million) RM - average number of rooms per dwelling AGE - proportion of owner-occupied units built prior to 1940 DIS - weighted distances to five Boston employment centers RAD - index of accessibility to radial highways PTRATIO - pupil-teacher ratio by town LSTAT - % lower status of the population MEDV - Median value of owner-occupied homes in $1000's Scenario You are working as a business consultant for a real estate finance firm that is interested in creating a proprietary model of housing prices and has collected historic data to use. Your task is to develop a model to accomplish this goal using multiple regression. The firm would like to measure the performance of any proposed model based on the root mean square error (RMSE). This is calculated as = ( )2/ where is the actual value for MEDV and is the predicted value for a given model (e.g. multiple regression equation). Because the firm is interested in how any model will perform on new data where the current housing price is unknown, the company has separated the historic data into two sets. You should fit models (i.e. run the regression and find the coefficients) using the training dataset (Housing_Train.xlsx), and calculate the RMSE to evaluate model performance using the testing dataset (Housing_Test.xlsx). In the analysis, you will conduct some initial descriptive analysis (visualization and correlation matrix) in addition to fitting 3 models (1 with all variables, and 2 models with your choice of a subset of variables). The tasks in the next sections will provide instructions to walk you through the process.
Part I: Visualize the Data 1. (20 points) Using the training data, visualize the distribution of housing values by creating a histogram of MEDV. What do you notice about the distribution? Provide a 2-3 sentence description. Include the figure in the Word file with your answers. Part II: Correlation Matrix & Multicollinearity 2. (20 points) Construct a correlation matrix for all the variables, including MEDV. The best way to accomplish this task is to use the data analysis add-in (the same add-in used to run regressions), and select "Correlation" from the list of options in the initial window. The result will be a table that gives the correlation of all pairwise combinations of variables, formatted as follows (example provided for 3 variables): x1 x2 x3 x1 1 x2 CORREL(x2,x1) 1 x3 CORREL(x3,x1) CORREL(x3,x2) 1 Multicollinearity exists if there is high correlation between independent (x) variables. Use a threshold of 0.7 for high correlation between independent (x) variables. Does multicollinearity exist in the dataset? If so, between which variables? (Do not include correlation with MEDV, the y-variable, here. We want a high correlation with the y-variable.) Include the correlation table in the Word file with your answers.
Part III: Fitting Models 3. (20 points) Model A: Fit a multiple regression model that includes all 11 x-variables. a. How good is the model fit? b. Are any of the variables NOT statistically significant in the model? If so, which ones? c. Use this model to calculate the RMSE for the testing dataset. Include the value of the RMSE in the Word file. Include your regression summary in the Word file with your answers. 4. (20 points) Model B: Next, consider a model with fewer variables. Select 4-6 x- variables to include where none of the variables selected exhibit multicollinearity with each other and where all variables were statistically significant in step 3. Fit a multiple regression with these variables using the training data. (This will be Model B.) a. How good is the model fit? b. Are any of the variables NOT statistically significant in the model? If so, which ones? c. Using the testing data, calculate the RMSE for this model. 5. (20 points) Model C: Repeat #4, to fit a second model with a different subset of variables. (This will be Model C.) Answer parts a-c for the new model. Part IV: Interpretation & Insights 6. (50 points) Consider the models that you have fit. Write a summary and interpretation of the results that includes the following: a. An examination of whether the coefficients for the variables in Models B and C are the same or different than their respective coefficient in Model A. Comment on similarities and differences. b. Discuss how multicollinearly can affect estimated coefficients. c. What variables appear especially important in the prediction of house values in this dataset? Explain. d. A comparison of the performance of the three models. Which model performs best? Which performs worst? e. Given the results, how would you advise the company to proceed? Parts I-III may be completed as responses in full sentences to each question. Part IV should be organized as a small report (1-2 pages) with an introductory paragraph. Each item (a-e) could be organized within its own paragraph, or a few topics may be combined within a paragraph. Do not write the response as one block of text; use a clear organizational structure with separate paragraphs for separate ideas.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
