Question: 1 Understanding Regression Part 2 - Multiple Regression By Dr. Ardith Baker Oral Roberts University In the Part 1 paper, we explored simple linear regression
1 Understanding Regression Part 2 - Multiple Regression By Dr. Ardith Baker Oral Roberts University In the Part 1 paper, we explored simple linear regression with respect to understanding the basic concepts and output results. In this paper, we will extend those same concepts to multiple regression. Remember that in simple linear regression, there is only one independent variable (X) and one dependent variable (Y). In addition, in simple linear regression, we chose the independent variable because we believed that it influenced the dependent variable in some way. In other words, we believed that there was a relationship between the independent and dependent variables. In multiple regression, this same relationship holds except that we have more than one independent variable (but still only one dependent variable). Because we believe that each of the multiple independent variables has a relationship with the dependent variable, we can use those relationships to try to better explain what is happening with our dependent variable. So, this main difference between simple linear regression and multiple regression is the number of independent variables. The interpretation of the multiple regression output is also similar, but there are some differences that we need to discuss. Let's begin with stating the hypothesis. Hypothesis test: Multiple regression is also a type of hypothesis test in which we evaluate the significance of the slopes of the regression lines. Note that each independent variable will have its own trendline and associated slope. Therefore, there are multiple slopes, but only one intercept. Similar to simple linear regression, if there is no relationship, then the individual slope of a regression line will be very close to 0 (a flat trend line). However, if there is a relationship, then the individual slope can be positive (increasing trend line) or negative (decreasing trend line). Note that the slopes are independent of each other. In other words, one slope may be positive while the other may be negative or they may both be positive or they may both be negative. [The key is that the independent variables are not correlated with each other; they should only be correlated with the dependent variable. Independent variables that are correlated are called collinear. If this is the case, one of the collinear independent variables should be removed from the model because it is redundant. The assumption of multicollinearity addresses this issue but is not covered in this paper.] Let's illustrate multiple regression with an example. Let's say that we are interested in predicting the average bank balance of our customers and we collect the following data on 102 of our customers. In this example, the dependent variable is Average Bank Balance (because we believe that it is being influenced) and the independent variables are Median Age, Median Years Education, Median Income, Median Home Value, and Median Household Wealth (because we believe that they are the influencers). A snapshot of this data is given below. Copyright 2015 Oral Roberts University and Dr. Ardith Baker 2 First, we will set up our hypotheses similar to simple linear regression except that we will have more than one slope (one for each independent variable). Remember that (the Greek letter Beta) is the unknown population slope while b is the known sample slope and that we can calculate each bi (one for each independent variable, i = 1, ..., j) from the sample data and use it to test the hypothesis about the associated unknown population parameter i. Since we have more than one , we will label them with numbered subscripts, where subscript i = 1 represents the first independent variable in the list, i = 2 the second independent variable in the list, and so on down the list to i = j, the last independent variable. In our example, we have 5 independent variables, therefore we will have 1 , 2 , 3 , 4, and 5 representing independent variables Median Age, Median Years Education, Median Income, Median Home Value, and Median Household Wealth, respectively (respectively means in that order). So the hypothesis is stated as: H0: 1 = 2 = 3 = 4 = 5 = 0 (i.e., no relationships, all 5 of the trendlines are flat lines) H1: At least one i 0 (i.e., there is at least one relationship; NOTE: all 5 's do not have to be significant in order to reject H0.) If we fail to reject H0, then we conclude that there are no significant relationships and therefore we cannot use our chosen independent variables to predict the dependent variable. We would throw out all the independent variables and start over. If we reject H0, then we conclude that there is at least one significant relationship between one or more independent variables and the dependent variable. The problem is that at this point, we don't know which of the independent variables has the significant relationship. So, we must look further into the regression output to determine this. First, let's run the analysis. Similar to simple linear regression, we will go to the Data menu and use the option Data Analysis - Regression to start the analysis. Copyright 2015 Oral Roberts University and Dr. Ardith Baker 3 In the resulting menu (see below), select your dependent variable (Y) first and then your independent variables (X1 through X5 ; NOTE: Excel labels them in the order that you highlight them). Note in the above dialog that I checked the option Labels; this is because when I selected each variable, I also highlighted the variable name. By selecting this option, Excel knows to use the first row as the variable names and will use those names in the regression output. This is very useful for identifying the variables in the output. Otherwise, Excel uses X's and Y as the variable names which is confusing. A word of caution! If you select Labels option but don't highlight the variable names, then Excel will still use the first row as your variable names, only it won't make any sense at all and your data set will be reduced by one. If you don't select Labels but highlight them, then Excel won't know what to do because some of your data will be nonnumeric. So, be careful!! Note also that I checked the options Residuals, Residual Plots, and Normal Probability Plots. Let's talk about these with respect to the regression output. So, after clicking OK, the output appears in a separate worksheet as shown below. To read the output, let's start from the top down. Copyright 2015 Oral Roberts University and Dr. Ardith Baker 4 Regression Statistics Start with this section of the output. Let's examine a few indicators of a good regression model (i.e., choice of independent variables). (NOTE: It's always good to check the number of Observations to make sure you did the analysis correctly. If there is one less than there should be, then perhaps you checked the Labels option without highlighting the labels.) 1. Multiple R = 0.9731. This indicator of a good regression model examines the correlation between the dependent variable over all of the independent variables. Multiple R answers two questions: (1) Is there a relationship? and (2) If there is a relationship, is it strong, moderate, or weak and which direction is it going? We can interpret Multiple R exactly as we did r, the individual correlation coefficient. o Multiple R ranges between 1 < R < +1. o If R is positive, the relationship is increasing (i.e., if X increases, Y increases). o If R is negative, the relationship is decreasing (i.e., if X increases, Y decreases). o If R is close to either 1 or +1, then we consider the relationship to be strong. o If R is close to 0, then we consider the relationship to be weak or even no relationship. The problem with this is that the decision of whether there is a relationship or not and how strong the relationship, is very subjective. For example, I may think that R = 0.6 is a strong relationship while you may think it is only moderately strong. So, while Multiple R is a good indicator of relationship, it is not definitive. In our example Multiple R = 0.9731 which is very close to +1. We could interpret that as a strong positive relationship, so we can take that as a good sign that we are on the right track Copyright 2015 Oral Roberts University and Dr. Ardith Baker 5 with our choice of independent variables to predict our dependent variable. However, if Multiple R is unsatisfactory to you, then stop here and find different independent variables. If it is satisfactory, then go to the next indicator. 2. Adjusted R-square-= 0.9441. The Adjusted R-square (or R2 ) is a very important indicator of a good regression model because it tells you how much of the variability in the dependent variable is explained by all of the independent variables. In multiple regression, we use the Adjusted R-square rather than R-square because the Adjusted R-square calculation adjusts the R-square value based on the number of independent variables and their contribution to predicting the dependent variable (for more information, see http://www.investopedia.com/ask/answers/012615/whats-differencebetween-rsquared-and-adjusted-rsquared.asp ). We calculate the Adjusted R-square by squaring Multiple R, the overall correlation coefficient. So while the correlation coefficient can be positive or negative, Adjusted R-square is only positive: 0 < Adjusted R-square < 1 or 0% < Adjusted R-square < 100%. The closer Adjusted R-square is to 1 (or 100%), the better the relationship because it means that our choice of independent variables is explaining almost all of the changes or variability in the dependent variable. Again, how close is close is very subjective. So, if the Adjusted R-square is satisfactory to you, then go on to the ANOVA section. Otherwise, stop and find new independent variables. In our example, Adjusted R-square = 0.9441 which means that approximately 94% of the variability (change) in the dependent variable Average Bank Balance can be explained by the independent variables (Median Age, Median Years Education, Median Income, Median Home Value, and Median Household Wealth). This is very good because it means that almost all of the variability is explained. However, it also means that approximately 6% of the variability in Average Bank Balance can be explained by one or more other independent variables. So, we could choose to add more independent variables and rerun the modelor we could choose to remove some of the independent variables and rerun the model. In order to decide what to do next, let's go to the ANOVA section of the output and examine that further. ANOVA section: Our indicators so far are good. But we need more information. So, let's proceed further in the analysis by looking at the ANOVA (Analysis of Variance) section. Copyright 2015 Oral Roberts University and Dr. Ardith Baker 6 This section of the output is where we test the hypothesis: H0: 1 = 2 = 3 = 4 = 5 = 0 (i.e., no relationships, all 5 of the trendlines are flat lines) H1: At least one i 0 (i.e., there is at least one relationship) As mentioned earlier, we use the F distribution to test these hypotheses. The regression output gives us the F statistic that comes from the data but does not give us the Critical F value with which to compare it, so we can't use that to make a decision about our hypotheses. However, the output does give us the Significance F which is essentially the p-value, which we can compare to our chosen value for alpha, the significance level (typically, alpha = 0.05, 0.01, or 0.10). NOTE: Unless you are told otherwise, use alpha = 0.05. Before we can do that, we need to know how to make the decision of whether we reject or fail to reject H0. Decision Rule: o If the Significance F is less than or equal to alpha, then we reject H0 and conclude the alternative hypothesis (H1) that at least one slope is not equal to 0. In other words, there is at least one significant relationship. If this is the case, then we know that at least one significant relationship exists. However, which one(s)? This hypothesis test does not tell us that so we would need to look further in the analysis in order to determine which of the independent variables to keep in the model and which to remove from the model. o If the Significance F is greater than alpha, then we fail to reject H0 and conclude that there is not enough evidence to prove otherwise, so there are no significant relationships. If this is the case, then we did not choose good independent variables. This means that the independent variables are not good predictors of our independent variable and using the equation of the regression line to predict Graduation % would do us no good; therefore, we would have to discard our independent variables and choose new ones. So, using our example above, Significant F = 1.5184E59 and let's use alpha = 0.05. At first glance you might think that 1.5184is greater than 0.05 and make the decision to fail to reject H0. However, you would be wrong. In reality, when numbers are very, very large or very, very small, Excel switches to scientific notation. The \"E59\" means that you move the decimal place 59 places to the left which means the p-value is definitely smaller than alpha = 0.05. Therefore, we reject H0 and conclude that there is at least one significant relationship. To determine which of the independent variables have the significant relationship, we need to go to the next part of the output - the individual t-tests. Copyright 2015 Oral Roberts University and Dr. Ardith Baker 7 Remember that the F test is a test of the hypothesis for the overall model and even if we reject H0, it could be that some of the independent variables, if examined individually, would not have a significant relationship with the dependent variable. Thus, we need to test each individually (H0: i = 0, H1: i 0) using a t-test. The results of this test are displayed in the t Stat and P-value columns above. Similar to the F test, the output does not give you the Critical t value with which to compare to the t Stat, so we can't use that to make a decision about the individual hypotheses. Instead, let's look at the individual P-values. We will compare these to alpha = 0.05 and follow the decision rules outlined above. Notice that Excel displays the variable names in the first column instead of X1, X2, X3, X4, and X5 because we highlighted the column headings and selected the Labels option in the Regression dialog. To examine this output, first skip the Intercept since we are only interested in the individual independent variables. Start with Median Age, the p-value = 1.01152E06 which is a very small value less than alpha = 0.5. Thus we would reject H0 and conclude that there is a significant relationship between Median Age and Average Bank Balance. In other words, Median Age would be a good predictor of Average Bank Balance. Let's examine the remaining independent variables below: Independent Variable Median Age p-value 1. 01152E06 Alpha 0.05 Decision Reject H0: = 0 Median Years Education Median Income 0.0541 0.05 .0005 0.05 Fail to Reject H0: = 0 Reject H0: = 0 Median Home Value Median Household Wealth .4075 0.05 1.84838E09 0.05 Fail to Reject H0: = 0 Reject H0: = 0 Conclusion Significant relationship Keep in model Insignificant relationship Remove from model Significant relationship Keep in model Insignificant relationship Remove from model Significant relationship Keep in model All of the indicators up this this point (i.e., Multiple R is good, Adjusted R-Square is good, and we rejected H0 for the overall model) look good. However, you can see in the table above that some of the individual independent variable's p-values are not significant. Therefore, we now conclude that the overall model is not a good one and we should not use it. Instead, we should remove the insignificant independent variables and rerun the model. Let's give that a try and see what results we get. Copyright 2015 Oral Roberts University and Dr. Ardith Baker 8 Because all the independent variables must be in contiguous columns (i.e., side by side), I deleted the insignificant variables. Now, go to Data - Regression and highlight only Median Age, Median Income, and Median Household Wealth as the independent variables and Average Bank Balance as the dependent variable. When you click OK, you get the following results: Starting at the top: Multiple R = 0.9704, which indicates a very strong positive correlation. Good so far! Adjusted R Square = 0.9399, which indicates that approximately 94% of the variation in Average Bank Balance can be explained by Median Age, Median Income, and Median Household Wealth. This is a good indicator because only approximately 6% of the variability can be explained by one or more other variables. Copyright 2015 Oral Roberts University and Dr. Ardith Baker 9 Significance F = 2.54304E60. This p-value is very, very small when compared to alpha = 0.05. Thus, we would reject H0: 1 = 2 = 3 = 0 and conclude that there is at least one significant relationship. t-tests: When looking at the individual t-tests for the independent variables (skip the Intercept), you can see that all of the p-values are less than alpha. Therefore, we would keep all three independent variables in the model. Decision: We have a good model that we can now use to predict Average Bank Balance. Assumptions: Well, everything is great so far but we still need to check one more thingour assumptions. Assumptions are statements that we make because we believe that they are true. For example, I assume that you have already read Understanding Regression Output Part 1. In other words, I believe this to be true. However, I would have to check with you to verify that this is true. Similarly, we make several assumptions in Regression. The three we are going to examine are Linearity, Homoscedasticity, and Normality of Errors. Linearity - Multiple regression is still linear regression, just with multiple independent variables. Therefore, we assume that the data are linear in nature. That is, if we plot each X-Yi pair in a scatterplot, we can show that the two variables are related and that this relationship can be best explained by a straight line (i.e., linear trendline). If this assumption is not met, then we cannot use Linear Regression to obtain the regression equation. This can be verified using the Residual Plot. Note that in multiple regression, there is one residual plot for each independent variable and they must each be checked independently of the other. Homoscedasticity - This is the assumption of equal variances across all values of the Independent variable. If this assumption is not met, then we cannot use Linear Regression. This can also be verified using the Residual Plot. Normality of Errors - We assume that the Errors (also referred to as Residuals) are Normally distributed (i.e., distributed according to the Normal distribution). Residuals or Errors are calculated by subtracting the predicted value (found using the regression equation) from the actual value. A perfect prediction would have an error of 0 (zero). So, the larger the error, the worse the prediction. We can verify this using the Normal Probability Plot. Testing the Assumptions: Residual Plots: Remember that we had to select this option in the Regression dialog in order to get these plots of Residual vs. values of the Independent variable (one residual plot for each independent variable). What we are looking for are Copyright 2015 Oral Roberts University and Dr. Ardith Baker 10 randomly scattered data points about the 0 line. For example, in our Residual Plot for Median Age, the data points do not appear to form any distinct pattern. Therefore, we can say that our assumptions of Linearity and Homoscedasticity are met. Median Age Residual Plot Residuals 10000 5000 0 -5000 0.0 10.0 20.0 -10000 30.0 40.0 50.0 Median Age Similarly, we can examine the other two residual plots and come to the same conclusionthe data points are randomly scattered about 0, thus the assumptions of Homoscedasticity and Linearity are met. Median Income Residual Plot Residuals 10000 5000 0 -5000 $0 $20,000 -10000 $40,000 $60,000 $80,000 $100,000 $120,000 Median Income Median Household Wealth Residual Plot Residuals 10000 0 $0 $50,000 $100,000 $150,000 $200,000 $250,000 $300,000 $350,000 -10000 Median Household Wealth Normal Probability Plot: Again, we have to select this option in the Regression dialog in order to get this plot. In order for this assumption to be met, the data points need Copyright 2015 Oral Roberts University and Dr. Ardith Baker 11 to form a relatively straight line as shown with our data below (slight deviation is OK). So, we can conclude that the assumption of normality of errors is met. Average Bank Balance Normal Probability Plot 60000 40000 20000 0 0 20 40 60 80 100 120 Sample Percentile NOTE: For more information about all of the assumptions in Regression Analysis and what to do if one of the assumptions is not met, see: http://people.duke.edu/~rnau/testing.htm. However, this is not required reading. Making the Prediction: Now that we have decided that the model is good, we need to state the equation of the regression line. In multiple regression, this equation takes the form: Y = 0 + 1 X1 + 2 X2 + 3 X3 Where Y = Independent Variable Average Bank Balance X1 , X2 , and X3 are values of the dependent variables Median Age, Median Income, and Median Household Wealth, respectively. 1 , 2 , and 3 are the slopes of the dependent variables Median Age, Median Income, and Median Household Wealth, respectively. Note that in the above equation, the 's come from the regression output (column labeled Coefficients). Copyright 2015 Oral Roberts University and Dr. Ardith Baker 12 You can choose any values for the X's in order to predict the dependent variable Y. In our example, we have the following: Intercept 0 3115.40 Slopes 1 301.87 2 0.2119 3 0.0638 Therefore, the equation of the regression line is: Average Bank Balance = 3115.40 + 301.87*Median Age + 0.2119*Median Income + 0.0638*Median Household Wealth Let's say we are interested in the following values: Median Age = 56, Median Income = $50,000, and Median Household Wealth = $75,000; what would be the predicted Average Bank Balance? Simply substitute these values into the regression equation and calculate. Average Bank Balance = 3115.40 + 301.87*56 + 0.2119*$50,000 + 0.0638*$75,000 = $29,169.32 Conclusion: So, from the top down: Multiple R - good R-Square - good Significance F - we rejected H0, good Individual p-values - good Assumptions - all met Conclusion: We have a good model. Now we can comfortably use the Regression equation to predict our dependent variable. However, if at any point we would have said \"bad\" or \"not met\" or \"fail to reject H0\ Problem 5-2 Name: Date of Sale Winning Bid Estimated Value 22-May-10 1,028,885 1,176,800 6-Jun-10 372,455 427,100 13-Jun-10 940,016 1,090,200 29-Jun-10 288,703 336,700 14-Jul-10 2,166,426 2,445,300 6-Aug-10 751,420 848,100 13-Aug-10 1,305,550 1,553,500 28-Sep-10 211,412 278,300 5-Oct-10 331,677 385,900 28-Oct-10 2,749,662 3,074,100 12-Nov-10 1,156,443 1,284,900 28-Nov-10 1,956,165 2,203,100 5-Dec-10 348,470 431,300 20-Dec-10 1,530,358 1,735,600 5-Jan-11 1,883,154 2,105,300 20-Jan-11 393,344 446,100 12-Feb-11 609,573 695,400 27-Feb-11 4,920,658 5,806,700 6-Mar-11 377,600 472,200 13-Apr-11 466,776 593,000 29-Apr-11 892,531 1,032,900 14-May-11 1,025,087 1,146,000 22-May-11 517,964 629,500 Number of Bids Type of Received Property 6 0 4 1 7 1 6 1 7 1 4 0 6 1 9 1 5 0 7 0 5 0 5 0 7 1 7 0 6 1 6 0 6 0 9 1 5 0 5 1 3 0 8 0 7 0 Buyer 0 1 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 1 Type of Property 0 = Commercial 1 = Residential 1. What factors affect the winning bid? To answer this, perform a multiple regression analysis with all of the independent variables in the model. Then, look at the p-values for the individual coefficients and determine which independent variables should be kept in the model (use alpha = 0.05). NOTE: Label the resulting worksheet tab "Original Regression". Leave in which independent variable(s): Take out which independent variable(s): 2. Build a model to predict winning bid. To answer this, remove the non-significant independent variables from th model and rerun it. NOTE: For Excel's Data Analysis, independent variables must be in contiguous columns. So, you may have to copy and paste only the variables that you want in a new worksheet and then run the regress NOTE: Label the reduced dataset worksheet "Reduced Data" and the resulting worksheet tab "Reduced Regressio What is the adjusted R2 for this new model? What does the R2 tell you about this model? Are all the slopes for the individual coefficients significant for this reduced model? Based on the above evidence, is the reduced model a good model? What is the resulting regression equation. Use the actual coefficient values and variable names. 3. What is the point prediction for the winning bid for a residential property coming up for auction on July 10, 1011 with an estimated value of $1,500,000? Use the regression equation in part 2 for the point prediction. Buyer: Nonresident Resident Winning Bid Forecast 4. Does this prediction change if there is a lot of interst in the property (that is, many bids are expected)? Compute a 95% prediction interval (PI) about the point predictions in part 3. Buyer: Nonresident Upper PI Lower PI Winning Bid 95% PI Resident Upper PI Lower PI 5. Does the prediction change if the resident is expected to be a nonresident? 0 = Commercial 1 = Residential Buyer 0 = Nonresident 1 = Resident dual coefficients ndent variables from the tiguous columns. nd then run the regression. ab "Reduced Regression". Copy and Paste from the Regression output: Intercept: Estimated Value Slope: Buyer Slope: are expected)? Copy and Paste from the Regression output: Standard Error