Question: The datasets for this assignment are available in the Excel files Housing_Train.xlsx and Housing_Test.xlsx. The data provides information (collected several years ago) on median housing

The datasets for this assignment are available in the Excel files "Housing_Train.xlsx" and "Housing_Test.xlsx". The data provides information (collected several years ago) on median housing values from neighborhoods in Boston, MA. It includes the median home value (which will be our y-variable) and 11 predictor variables (which will be our prospective x-variables). The definitions of the variables are: CRIM - per capita crime rate by town ZN - proportion of residential land zoned for lots over 25,000 sq.ft. INDUS - proportion of non-retail business acres per town. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise) NOX - nitric oxides concentration (parts per 10 million) RM - average number of rooms per dwelling AGE - proportion of owner-occupied units built prior to 1940 DIS - weighted distances to five Boston employment centers RAD - index of accessibility to radial highways PTRATIO - pupil-teacher ratio by town LSTAT - % lower status of the population MEDV - Median value of owner-occupied homes in $1000's Scenario You are working as a business consultant for a real estate finance firm that is interested in creating a proprietary model of housing prices and has collected historic data to use. Your task is to develop a model to accomplish this goal using multiple regression. The firm would like to measure the performance of any proposed model based on the root mean square error (RMSE). This is calculated as = ( )2/ where is the actual value for MEDV and is the predicted value for a given model (e.g. multiple regression equation). Because the firm is interested in how any model will perform on new data where the current housing price is unknown, the company has separated the historic data into two sets. You should fit models (i.e. run the regression and find the coefficients) using the training dataset (Housing_Train.xlsx), and calculate the RMSE to evaluate model performance using the testing dataset (Housing_Test.xlsx). In the analysis, you will conduct some initial descriptive analysis (visualization and correlation matrix) in addition to fitting 3 models (1 with all variables, and 2 models with your choice of a subset of variables). The tasks in the next sections will provide instructions to walk you through the process.

Part I: Visualize the Data 1. (20 points) Using the training data, visualize the distribution of housing values by creating a histogram of MEDV. What do you notice about the distribution? Provide a 2-3 sentence description. Include the figure in the Word file with your answers. Part II: Correlation Matrix & Multicollinearity 2. (20 points) Construct a correlation matrix for all the variables, including MEDV. The best way to accomplish this task is to use the data analysis add-in (the same add-in used to run regressions), and select "Correlation" from the list of options in the initial window. The result will be a table that gives the correlation of all pairwise combinations of variables, formatted as follows (example provided for 3 variables): x1 x2 x3 x1 1 x2 CORREL(x2,x1) 1 x3 CORREL(x3,x1) CORREL(x3,x2) 1 Multicollinearity exists if there is high correlation between independent (x) variables. Use a threshold of 0.7 for high correlation between independent (x) variables. Does multicollinearity exist in the dataset? If so, between which variables? (Do not include correlation with MEDV, the y-variable, here. We want a high correlation with the y-variable.) Include the correlation table in the Word file with your answers.

Part III: Fitting Models 3. (20 points) Model A: Fit a multiple regression model that includes all 11 x-variables. a. How good is the model fit? b. Are any of the variables NOT statistically significant in the model? If so, which ones? c. Use this model to calculate the RMSE for the testing dataset. Include the value of the RMSE in the Word file. Include your regression summary in the Word file with your answers. 4. (20 points) Model B: Next, consider a model with fewer variables. Select 4-6 x- variables to include where none of the variables selected exhibit multicollinearity with each other and where all variables were statistically significant in step 3. Fit a multiple regression with these variables using the training data. (This will be Model B.) a. How good is the model fit? b. Are any of the variables NOT statistically significant in the model? If so, which ones? c. Using the testing data, calculate the RMSE for this model. 5. (20 points) Model C: Repeat #4, to fit a second model with a different subset of variables. (This will be Model C.) Answer parts a-c for the new model. Part IV: Interpretation & Insights 6. (50 points) Consider the models that you have fit. Write a summary and interpretation of the results that includes the following: a. An examination of whether the coefficients for the variables in Models B and C are the same or different than their respective coefficient in Model A. Comment on similarities and differences. b. Discuss how multicollinearly can affect estimated coefficients. c. What variables appear especially important in the prediction of house values in this dataset? Explain. d. A comparison of the performance of the three models. Which model performs best? Which performs worst? e. Given the results, how would you advise the company to proceed? Parts I-III may be completed as responses in full sentences to each question. Part IV should be organized as a small report (1-2 pages) with an introductory paragraph. Each item (a-e) could be organized within its own paragraph, or a few topics may be combined within a paragraph. Do not write the response as one block of text; use a clear organizational structure with separate paragraphs for separate ideas.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

Students will find the York City Case Study provided for review. There is a maximum of three pages in this work. All ethical decisions affect others (by definition) and, as Aristotle points out,...

I'm looking to see if someone can help me with this homework. I have attached the documents that are needed. I'm not good with financial reports. Thanks the home depot annual report 2012 AnnuAl...

Use the formulae bellow to find ROCE for HOME DEPOT using there 10-k. ROCE = RNOA + LEVERAGE * (RNOA - NET BORROWING COST) 10-K bellow:- Table of Contents UNITED STATES SECURITIES AND EXCHANGE...

3 COLLEGE ALGEBRA - TRIGONOMETRY Business and Finance (MAT115) This course will start with a review of basic algebra (factoring, solving linear equations, and equalities, etc.) and proceed to a study...

The Evolving Strategy of Police: A Minority View By Hubert Williams and Patrick V. Murphy ed ~~fbg g? l!q'q . .. there is an underside to every age about which history does not often speak, because...

\fSUPPORT FOR THE CULTURAL DIVERSITY AND POLICING (CDAP) PROJECT AND ALL OF ITS PRODUCTS HAVE BEEN PROVIDED BY THE BUREAU OF JUSTICE ASSISTANCE GRANT #2001-DD-BX-K003. OPINIONS STATED IN THIS PAPER...

ber81438_exr_001-068.qxd 8/11/09 7:38 PM Page 51 EXERCISE 10.3: D EVELOPING AN E MPLOYEE B ENEFITS P ROGRAM * Overview Chapter 10 provides an overview of types of employee benefits programs. The...

Download the most recent financial statements of Medtronic and Boston Scientific.From the attached ratio sheet pick 6 relevant ratios and compare the two companies.Then from the ratios you selected...

Find the equivalent reluctance seen by the windings. d1= 2 cm d2= 5 cm d3= 1.5 cm d4= 3 cm d5= 0.8 cm d6= 3 d7= 0.5 cm d8= 2.5 cm d9= 15 cm cm d10= 0.6 cm ur= 250 k= 2 cm (20 Points) Thickness: kd;...

Refer to the previous exercise. The following printout shows results for the females in the sample, where X = the number answering yes. Explain how to interpret each item, in context. Sample x N...

Chester Corporation earned a profit of $ 5 6 . 8 3 5 mil last year. Chester's profit would be placed in which category on the cash flow statement? Select : 1 Save Answer Nowhere. Profits are not cash...

What is SQL DROP TABLE Statement??