Question: Explain the five-step process for evaluating a supervised machine learning model using multiple regression analysis. The first step is to determine whether the model is

 Explain the five-step process for evaluating a supervised machine learning model

Explain the five-step process for evaluating a supervised machine learning model using multiple regression analysis. The first step is to determine whether the model is logical. This is done by looking at the algebraic sign for each coefficient in the model (all of the bivalues). These signs must be consistent with business/economic logic. If not you should NOT use the model. If you are looking at a model of car sales as a function of income and the interest rate business logic would tell you that as incomes go up car sales would also go up. Thus, the coefficient for income should be positive (+). The opposite would be true for the interest rate. When interest rates go up it makes buying a car more expensive (for most people ... relatively few car sales are cash sales). Thus, the coefficient for the interest rate should be negative (-). The second step is to determine whether the relationship is statistically significant at the desire level of confidence (usually 95%). We do this based on a t-test. If the model is logical and the calculated t-ratio has an absolute value greater than 1.645 we have evidence of a statistically significant relationship. If this test fails, we do not believe that the slope is nonzero and thus there is not a statistically significant relationship The third step is to determine the explanatory power of the model. The coefficient of determination tells us this information. In multiple regression the coefficient of determination is the adjusted R2. This is interpreted as a percentage. Suppose the adjusted R2 is 0.856. That tell us the 85.6% of the variation in the dependent variable is explained by the model The fourth step is to test for serial correlation. This can be done using a Durbin Watson (DW) test. The DW statistic is always between zero and four. The closer DW is to four the more likely there is negative serial correlation. The closer DW is to zero the more likely there is positive serial correlation. The ideal DW is two. As a rule of thumb is 1.5 S DWS 2.5 there may not be a serial correlation problem. The fifth step is to check for multicollinearity. Multicollinearity exists when there is an overlap in the way two or more independent variables influence the dependent variable. For example, suppose that it is believed that income affects auto sales. This seems quite reasonable. But suppose further that as independent variables we include national income, disposable personal income and disposable personal income per capita. All three of these captures one central concept of buying power. The three independent variable would be highly correlated, likely at about 0.95. The result of having multicollinearity is that some coefficients will be over stated while others are understated. In the example of using three measures of income one or two may even have a negative coefficient when in fact we expect auto sales to be directly related to income. This is an example of a model being over specified. One way to determine whether there could be multicollinearity between pairs of independent variables is to look at the correlation matrix of all independent variables. If we see a correlation between two independent variables that has an absolute value in the 0.9s one of those variables should be removed from the model. Generally, we also want to avoid having correlations in the 0.8s as well. Avoiding correlations in the in the 0.7s is also wise. Unfortunately, with business data often independent variables are highly correlated so we must be careful not to over fit a model. Explain the five-step process for evaluating a supervised machine learning model using multiple regression analysis. The first step is to determine whether the model is logical. This is done by looking at the algebraic sign for each coefficient in the model (all of the bivalues). These signs must be consistent with business/economic logic. If not you should NOT use the model. If you are looking at a model of car sales as a function of income and the interest rate business logic would tell you that as incomes go up car sales would also go up. Thus, the coefficient for income should be positive (+). The opposite would be true for the interest rate. When interest rates go up it makes buying a car more expensive (for most people ... relatively few car sales are cash sales). Thus, the coefficient for the interest rate should be negative (-). The second step is to determine whether the relationship is statistically significant at the desire level of confidence (usually 95%). We do this based on a t-test. If the model is logical and the calculated t-ratio has an absolute value greater than 1.645 we have evidence of a statistically significant relationship. If this test fails, we do not believe that the slope is nonzero and thus there is not a statistically significant relationship The third step is to determine the explanatory power of the model. The coefficient of determination tells us this information. In multiple regression the coefficient of determination is the adjusted R2. This is interpreted as a percentage. Suppose the adjusted R2 is 0.856. That tell us the 85.6% of the variation in the dependent variable is explained by the model The fourth step is to test for serial correlation. This can be done using a Durbin Watson (DW) test. The DW statistic is always between zero and four. The closer DW is to four the more likely there is negative serial correlation. The closer DW is to zero the more likely there is positive serial correlation. The ideal DW is two. As a rule of thumb is 1.5 S DWS 2.5 there may not be a serial correlation problem. The fifth step is to check for multicollinearity. Multicollinearity exists when there is an overlap in the way two or more independent variables influence the dependent variable. For example, suppose that it is believed that income affects auto sales. This seems quite reasonable. But suppose further that as independent variables we include national income, disposable personal income and disposable personal income per capita. All three of these captures one central concept of buying power. The three independent variable would be highly correlated, likely at about 0.95. The result of having multicollinearity is that some coefficients will be over stated while others are understated. In the example of using three measures of income one or two may even have a negative coefficient when in fact we expect auto sales to be directly related to income. This is an example of a model being over specified. One way to determine whether there could be multicollinearity between pairs of independent variables is to look at the correlation matrix of all independent variables. If we see a correlation between two independent variables that has an absolute value in the 0.9s one of those variables should be removed from the model. Generally, we also want to avoid having correlations in the 0.8s as well. Avoiding correlations in the in the 0.7s is also wise. Unfortunately, with business data often independent variables are highly correlated so we must be careful not to over fit a model

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Finance Questions!