Question: This problem may be considered a complete SAS project. Import an Excel data set named air pollution.xls using proc import to create a SAS data
This problem may be considered a complete SAS project. Import an Excel data set named air pollution.xls using proc import to create a SAS data set. The data shown in Table B.7 consists of air pollution measured as SO2 content in the air and related other variables for 41 U.S. cities. SO2 is the response variable y, and the explanatory variables are AvTemp (x1), NumFirms (x2), Population (x3), WindSpeed (x4), AvPrecip (x5), and PrecipDays(x6), respectively.
In a second SAS program, access this SAS dataset and perform a variable subset selection procedure using proc reg as described below.
a. Use proc sgscatter to obtain a scatter plot matrix and proc reg to perform a preliminary assessment of the pairwise relationships between y and x1, x2, x3, x4, x5, and x6. On this basis alone, select a few explanatory variables that may good predictors in a multiple regression model. Using the plot and the correlation matrix, find the four variables that are most strongly correlated among the explanatory variables. Based on above analysis alone suggest the explanatory variables that are most strongly involved in multicollinearity when fitting the full model.
b. Use a SAS procedure to fit a first-order multiple regression model to all 41 cities. Discuss the fit of this model using the ANOVA table, R2, and the estimates table. Use other diagnostic tools including output statistics and plots of residuals to examine the adequacy of the model
(use the diagnostic panel of plots). Examining the plots, Obs #31 to be an influential y-outlier. Use the diagnostic statistics to confirm this conjecture.
c. Identify any least squares estimates of the regression coefficients (βˆ’s)
from the fit of the model in part b), that have a sign (positive or negative) that is different from what you would expect for the parameter—
an indication of multicollinearity? Use the standard errors of the parameter estimates to show that these are poorly estimated. Do the variance inflation factors (VIF’s) identify these parameters?
d. Remove the explanatory variables x3 and x6 from the five-variable model and use a multiple regression model to relate y to x1, x2, x4 and x5 only. What can you observe about the multicollinearity in the new model? Is there improvement in the accuracy of estimation of parameters of this model (e.g., decreases standard errors, more t-statistics are significant etc.)? Justify your answers.
e. Remove the case you determined above in part
b) to be a possible outlier from the data and use all 6 variables for the analyses described in the following three parts:
i. Use a SAS procedure to do all possible regressions containing no less than 2 and no more than 4 explanatory variables. Print statistics for only the 4 best models in each case. Construct a plot of the Cp statistic for all models with “reasonable” Cp values. Select a single model, each with 2, 3, and 4 explanatory variables, respectively, for the purpose of predicting annual mean concentration of sulfur dioxide in a city, indicating your reasons for selection of each model. There may be several possible choices, i.e., there may be many “good” models but give arguments for each of your choices. Primarily, use s2, R2, and Cp in your arguments. Select one of these models as your final model and provide arguments supporting your choice.
ii. Use the backward elimination subset selection procedure with significance level of 0.05 for deleting variables to select a possible model. State the model selected and report estimates of parameters and the analysis of variance table for this model.
iii. Use the stepwise subset selection procedure, with significance levels of 0.10 for entry and 0.05 for deletion of variables, respectively, to select a possible model. State the model selected and report estimates of parameters and the analysis of variance table for this model.
A selected subset of the SAS data set baseball available from the SASHELP library containing data on baseball player salaries in 1986/87 is used in the following two problems. There are 21 variables and 71 observations in this data set. To obtain information about the variables, run the SAS code ods select position;proc contents data=baseball varnum; run;. Note that the variable attributes table is named position when the option varnum is used with proc contents. You may print the first 5 observations by running the SAS code proc print data=baseball(obs=5);run; to observe a few values for the above variables. Fit logsalary as a first order multiple regression model of the variables nAtBat, nHits, nHome, nRuns, nRBI, nBB, yrMajor, crAtBat, crHits, crHome, crRuns, crRbi, crBB, nOuts, nAssts and nError. Use the glmselect procedure to perform model selection using two different approaches as described below:
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
