Question: ST 352 Assignment 6: Inference using Multiple Linear Regression One way colleges measure success is by graduation rates. A committee was wondering what factors might
ST 352 Assignment 6: Inference using Multiple Linear Regression One way colleges measure success is by graduation rates. A committee was wondering what factors might have an effect on graduation rates at small colleges or universities (enrollments under 5000). The committee randomly selected 22 colleges or universities in the United States with enrollments under 5000 students. They collected information on the 6-year graduation rate, the median SAT score of students accepted to the college, student-related expenses per full-time student, and whether the college had both males and females or just one sex. The data are in the graduation.csv data set available on Canvas. The variable names and descriptions are given below. 1. 2. 3. 4. 5. college rate: SAT: expenses: students: name of college 6-year graduation rate median SAT score of students accepted to the college student-related expense per full-time student (dollars) 0 if college has both males and females 1 if college has only males or only females You are to do a multiple linear regression analysis. Use this data set to answer the questions that follow. Start the regression analysis by checking the conditions. Answer the following questions: Check of conditions: 1) (8 points total) Questions relating to checking for high correlation between the explanatory variables: a] Why do we need to check for \"high correlation\" between the explanatory variables? (1 pt) b] Create a scatterplot matrix in R and include it here. When creating this graph, include the response variable as well (so that scatterplots that include the response variable appear in the top row of the scatterplot matrix). (1 pt) c] Recall that the relationship between two quantitative variables must be linear to assess correlation. Using the scatterplot matrix created in part b, is the relationship between SAT and expense linear? (1 pt) d] If you feel the relationship between SAT and expense is not linear, a log transformation of one of those two variables should be performed. To decide which, use the scatterplot matrix created in part b and look at the scatterplots of rate versus SAT and rate versus expense. Which of those two scatterplots looks less linear? Transform the explanatory variable of the scatterplot that looks less linear. (1 pt) e] If you did a transformation of one of the explanatory variables in part d, re-create the scatterplot matrix with the transformed explanatory variable and include the new scatterplot matrix here. Assess the correlation between explanatory variables. (Recall that \"students\" is categorical and that assessing correlation really can't be done (or may be difficult to do) with categorical variables. Therefore, you may only be able to assess the \"high correlation between explanatory variables\" for the two quantitative explanatory variables.) By looking at the scatterplot matrix, do any of the explanatory variables appear to be highly correlated with each other? Explain. (1 pt) Obtain a correlation coefficient for the comparison of the two quantitative variables. Does the correlation coefficient support your answer to the first bulleted item above? Explain. (1 pt) Even if you did not feel that the two quantitative explanatory variables were highly correlated with each other, do part f anyway. f] (2 pts) If two explanatory variables are highly correlated, one should be removed. Although the choice of which one to remove is up to you, here is a way that may help in making the decision: In R, do a multiple regression analysis with one of the quantitative explanatory variables as the response variable and all other explanatory variables as the \"predictors\". For example, let's suppose you decided that expenses should be transformed in part d above. Run a multiple regression with SAT as the response variable and log(expenses) and students as the \"predictors\". Record the R-square from that analysis. Repeat using the other quantitative explanatory variable as the response variable. For example, run a multiple regression analysis with log(expenses) as the response variable and SAT and students as the \"predictors\". Record the R-square from that analysis. The idea is that an explanatory variable may be highly correlated with all other explanatory variables. If it is, then the R-square value when that variable is the response variable will be high (indicating that a high percentage of the variation in that explanatory variable is being explained by the other explanatory variables which implies that it is \"highly correlated\" with all of the other explanatory variables). Which of the two quantitative explanatory variables had a higher Rsquare when it was the response variable? This is the variable that would be removed if you felt that one should be removed in part e above. If you felt that one of the quantitative explanatory variables should be removed because of high correlation, do so and continue the rest of the analysis without the highly correlated variable. Before assessing the other conditions, obtain the following graphs: Normal probability plot of the residuals Residual Plot of the residuals versus predicted (a.k.a. \"fitted\") values. Include these graphs here. (2 pts total for graphs) 2) Outlier condition: (3 points total) a] Using the residual plot of the residuals versus the predicted values, are there any outliers? If so, identify the outlier(s). (You may need the residual plots versus each explanatory variable to help identify any outliers. If you believe there are outliers and you use the residual plots versus each explanatory variable to help identify the outliers, you do NOT need to include them here.) (1 pt) b] In general, what should be done if there are outliers? (1 pt) c] If you believe there is an outlier in this problem, do you feel it should be removed from the analysis? Explain. (1 pt) 3) Linearity condition: (2 points total) a] What graph(s) is(are) used to determine if a linear relationship exists between the response variable and each of the explanatory variables? (1 pt) b] Using that graph(s), do you feel that the linearity condition is satisfied? Explain. (1 pt) 4) Constant Variation condition: (2 points total) a] What graph(s) is(are) used to determine if the spread in the residuals is constant? (1 pt) b] Using that graph(s), do you feel that the constant variation condition is satisfied? Explain. (1 pt) 5) Normality condition: (2 points total) a] What graph(s) is(are) used to determine if the residuals are normally distributed? (1 pt) b] Using that graph(s), do you feel that the normality condition is satisfied? Explain. (1 pt) 6) Do you feel that a transformation of the response variable is necessary? Explain. (1 pt) Whether you felt a transformation of the response variable was necessary or not, answer the remaining questions in this problem assuming no transformation of the response variable was necessary. (However, if you did a log transformation of one of the explanatory variables in #1 above, that explanatory variable should remain transformed.) 7) (4 points total) Conduct a test to determine if at least one of the explanatory variables are useful in predicting graduation rate. a] Write the null and alternative hypotheses for this test. (1 pt) b] Using the output from R, give the appropriate test-statistic (with degrees of freedom) (but do not include the output). (1 pt) c] State your conclusion in terms of the problem supported with a p-value. (2 pts) 8) (8 points total) Perform a backwards selection process to find a model that includes only significant predictors of the response variable. Use the model obtained after performing the backwards selection process to answer the following questions: a] Write the regression equation, defining the terms in the equation. (2 points) b] Predict the graduation rate for a small college that includes both male and female students, has a median SAT of 950, and student related expenses of $12,500. (Note: your final model may not include all of these variables. If it does not, then ignore the values given in this problem for variables not in your final model.) (2 points) c] What percent of the variation in the response variable is explained by your final regression model? (1 pt) d] What recommendation would you make to the committee as far as what factor or factors may have a strong impact on graduation rates at small colleges and universities? Briefly discuss. (3 points)