Question: Please Read the article by clicking the link below then Read the article Using Binary Logistic Regression to Investigate High Employee Turnover available under Content

Please Read the article by clicking the link below then

Read the article "Using Binary Logistic Regression to Investigate High Employee Turnover" available under Content forModule 6 (The 2 power point presentation in the attachment)and answer the following question:

  1. What were the reasons for high turnover at the manufacturing company?
  2. HowdidJeffParkswasabletodiagnosethecause?
  3. Whatareyourrecommendationsthatthecompanyshoulddotoreducetheturnoverratewithoutanylegalchallengesbythepeoplewhowerenothired?
  4. http://blog.minitab.com/blog/real-world-quality-improvement/using-binary-logistic-regression-to-investigate-high-employee-turnover/comment-submitted
Please Read the article by clicking the link below thenRead the article

Module 6: Logistic Regression Statistical Techniques D e p e n d e n t V a ri a b l e Independent Variable Continuous X Categorical X Continuous Y Regression Categorical Y Logistic Regression T-Test ANOVA Contingency Tables/ChiSquare Multiple Regression - Dependent Variable is continuous Logistic Regression - Dependent variable is categorical (Generally Binary) Independent Variables - combination of both continuous or categorical variables What is Logistic egression? Popular and powerful classification technique Can be used for profiling or classification Ex: Identify factors that differentiate male and female CEO Supervised technique When to use Logistic Regression? - When the target variable is categorical & not continuous - In some cases we may choose to convert continuous data or data with multiple outcomes (levels) into binary data Binary (exactly 2 values: response vs no response; default vs. no default, etc.) Ordinal (more than 2 values, ordered from low to high: low vs medium vs. high risk, etc.) Nominal (more than 2 values, no particular order: product A vs. product B vs. product C vs. product D, etc.; Buy/Hold/Sell stocks) LR Applications Examples Will a credit card applicant pay off a bill or not? (Yes/No) Will a mortgage applicant default? (Yes/No) Will someone who receives a direct mail solicitation respond to the solicitation? (Yes/No) When we call the customer, will they pick up the phone? (Yes/No) Whether a patient has a given disease (diabetes) based on a observed characteristics of the patient (age, gender, BMI, blood tests,..) Whether a voter will vote for democratic or republican candidate based on age, gender, income, race, state of residence,... LR Applications (continued) Classifying customers as returning or nonreturning (classification) Finding factors that differentiate between male and female top executives (profiling) Predicting the approval or disapproval of a loan based on information such as credit scores, demographics (classification) Dependent variable: Buy/don't buy, success/failure, defaulto-default, fraudo-fraud (not limited to binary variables) Titanic Passengers On April 15, 1912, during the maiden voyage, The Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons for so many people to die was that there were not enough lifeboats for the passengers and crew. Some group of people were more likely to survive than others. Questions related to Titanic tragedy What sorts of people were likely to survive? Can we accurately predict which passengers survived the tragedy? Were there some key characteristics of the survivors? Were some passenger groups more likely to survive than others? Logistic Regression will be an appropriate technique to answer these questions https://www.kaggle.com/c/titanic Logistic Regression (LR) Extension of linear regression concepts to situations where the outcome variable is categorical. The predicted value from a LR is a estimated probability of dependent variable values - probability of committing fraud by people - probability of people evacuating their homes when hurricane hits. LR can be used to determine which (independent) variables are important or critical in predicting dependent variable. Differences between Multiple Regression and Logistic Regression Multiple Regression - Dependent Variable is continuous - Seeks to predict the numerical value of the dependent variable Y based on the values of independent variables (both categorical and continuous) - Many assumptions Logistic Regression - Dependent variable is categorical - Seeks to predict the probability (LOGIT) that the dependent variable will fall into a category based on the values of independent (predictor) variables. - Fewer assumptions than Regression model Logistic Regression (Logit Model) Goal is to find a function of the dependent variable Y called the logit that relates them to 0/1 outcome Fewer assumptions are made than regression model Logistic (or Logit) function 1.0 0.8 Logistic Regression e x P( y x) 1 e x Linear Regression 0.6 y 0.4 0.2 0.0 x The term logistic regression comes from the logistic curve that is fit to this model rather than a straight line as in linear regression. Calculation of Odds ratios In LR, the dependent variable (called the logit) is calculated as follows: Where p is the probability that the Dependent Variable Y = 1 and (1-P) is the probability Y = 0 The ratio p/(1-p) is called the odds of belonging to category 1 (Y = 1); odds=p/(1-p) or p = odds/(1+odds) EX: if the probability of winning a game is 0.8 (p=0.8), the odds of winning are 0.8/(1-0.8)= 0.8/0.2 = 4 You would win four times for every one time you lose, on average. log( p/ (1 p)) 0 1 X1 k Xk Relationship between Odds and probability Probability = Odds/(1+Odds) Odds = p/(1-p) Probability ranges from 0 to 1 Odds ranges from 0 to infinity When the probability = 0, odds = 0 When the probability = .5, odds = 1 When the probability = 1, odds = Infinity Odds Ratio The odds ratio is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. The term is also used to refer to sample-based estimates of this ratio. These groups might be men and women, an experimental group and a control group, or any other dichotomous classification. If the probabilities of the event in each of the groups are p1 (first group) and p2 (second group), then the odds ratio is: where qx = 1 px. Odds Ratio (Contd..) An odds ratio of 1 (prob = .5) indicates that the condition or event under study is equally likely to occur in both groups. An odds ratio greater than 1 indicates that the condition or event is more likely to occur in the first group. And an odds ratio less than 1 indicates that the condition or event is less likely to occur in the first group. The odds ratio must be nonnegative if it is defined. Odds Ratio - Example Suppose that in a sample of 100 men, 90 drank beer in the previous week, while in a sample of 100 women only 20 drank beer in the same period. The odds of a man drinking beer are 90 to 10, or 9:1=9, while the odds of a woman drinking beer are only 20 to 80, or 1:4 = 0.25. The odds ratio of men vs. women drinking beer is thus 9/.25, or 36, showing that men are much more likely to drink beer than women. Assumptions with Logistic Regression \"In logistic regression no assumptions are made about the distributions of the independent variables. However, the independent variables should not be highly correlated with one another (collinearity) because this could cause problems with estimation. - If a variable which you think should be statistically significant is not, consult the correlation coefficients. - If two variables are highly correlated then try dropping the least theoretically important of the two variables. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1065119/ 16 Titanic - Who survived? Answers to questions based on LR analysis First class and second class females, particularly if they were young and traveling alone, had a very good chance of surviving. Women and children had a higher odds of surviving than others. Next several slides will analyze the Titanic data Titanic Visualization Mosaic Plot - Survived By Sex Among the survived, 67.8% were female and 32.2% were male. Among the female, 72.75% survived. Among the male, 19.1% survived Mosaic Plot - Survived By Passenger Class Among the survived, 40% were 1st class, 23.8% were 2nd class, and 36.2% were third class. Among the 1st class passengers 61.92% survived Among the 2nd class passengers 42.96% survived Among the 3rd class passengers 25.53% survived Simple LR - Survived by Age Survival rate for younger passengers were higher than older passengers Also, the model is not significant at Alpha = 0.05 but it is significant at Alpha = 0.1 Simple LR - Survived by Fare Passengers who paid lower fare had a lower probability of surviving or higher probability of not surviving (.75). People who paid more than $200 had a probability of more than .80 of surviving That means higher class -> higher fare -> higher survival rate. Also, the model is significant because of the low p-value (less than Alpha = .05) Change the value ordering of variable survived - for the order in which you want the data to appear in the reports (Confusion Matrix, Odds Ratio, etc. - 1 (Yes) first; 0 (No) second Select column survived, go to Columns/ Column Info, under Column Properties select Value Ordering and Move \"1\" up and click Apply and OK. Multiple LR - Fit Model - Stepwise Since we are interested in understanding the survival rates, we will apply the Value Ordering column property so that Yes (Survived) appears first in the LR model results. To change the order, select Survived column, go to Cols > Column Info, select Value Ordering under Column Properties, and move \"1\" up. Note: If you use Stepwise model to develop LR model, you will not get an option to display Odds ratios. To get Odds ratios, you need to run the LR model (without Stepwise) and remove independent variables one at a time that are not significant by using backward elimination method. Fit Model Independent Variables Sex, Passenger Class, Age, Sibling & Spouses, and Port are all significant at . 05 level. Fare & Parents and Children are not significant. The overall model is significant because P-value is very low and less than Alpha (.05) Fir Model - R-Square The interpretation is not the same, but they can be interpreted as an approximate variance in the outcome accounted for by the IVs. Entropy R-Square for the model (training set) is .3064. R-Square is higher for Training set compared to Test set. There is a little bit of overfitting in this model. Misclassification rate for the training set is .2049 (20%) which means the model can predict 80% accurately. As per the Test set, the misclassification rate is 27% (.2688). There is a difference between the misclassification rates between Training and Test set indicating little bit of overfitting. Lack of Fit If the \"fit\" model is significant and the \"lack of fit\" is model not significant, it means the fit model is good. There is no need to add any cross effects of IVs. If the \"fit\" model is significant and \"lack of fit\" model is also significant, it means that the fit model can be improved by adding cross effects of independent variables. Parameter estimate for not survived From the parameter estimate, we can say that Female passengers have a higher survival rate than male, young passengers have higher survival rates than older passengers. The ChiSquare is the highest for Sex and second highest for passenger class between 2 and 1. Next comes Age. This tells us which variable is the most and least important in predicting whether the passenger survived or not. The Effect Likelihood Ratio Test provides the same information. You can use either one to test the significant of IVs Prediction Profiler Probability This tells us the relationship between the IVs and the Dep variable. This can also be used to compute the probability of survival for a given situation. For example, as the age increases, the probability of survival decreases; as the number of siblings increases, the probability of survival decreases. Unit Odds Ratios (Titanic) Note that the Value ordering of the variable Survived is changed so that Yes will appear first & No will appear second. For every additional year in age, the odds of survived increases by 0.97 (passengers were 0.97 times more likely to survive); For every additional sibling/Spouses, the odds of survived increases by 0.699 (.699 times more likely to survive). For every additional year in age, the odds of not survived increases by 1.03; For every additional sibling, the odds of not survived increases by 1.43 (passengers were 1.43 times less likely to survive). Range Odds Ratios As the age increases from minimum to maximum, the odds of survived increases by .09 (higher the age, lower the odds of survival); As the number of sibling increases from minimum to maximum, the odds of survived increases by .057 (single passenger had a higher odds of survival than passengers with spouses and siblings). As the age increases from minimum to maximum, the odds of not survived increases by 10.81; As the number of sibling increases from minimum to maximum, the odds of not survived increases by 17.51. Odds Ratios - Passenger class, Sex, and Port For survived odds of Yes versus No Confusion Matrix (important in determining the accuracy of the model) Misclassification or errors in prediction Correct classification rate = 1 - misclassification rate Receiver Operating Characteristic (ROC) Curve - Test Data The area under the curve or AUC is a measure of how well our model classifies the data . The diagonal line (random classifying model), has an AUC of 0.5. A perfect Classifying model has an AUC of 1.0. The area under the curve for Survived = Yes is .8541, indicating that the mode predicts better than the random classification model. Lift Curve Test Data Another measure of how well a model can classify outcomes Lift is a measure of how much \"richness\" in the response we achieve by applying a classification rule to the data. A lift curve plots Lift (Y-axis) against the Portion (X-axis). The higher the lift at a given portion, the better our model is at correctly classifying the outcome within the portion. For Survived = Yes, the lift at portion = 0.20 is approximately 2. This means among the top 20% of the model's predicted probabilities, the number of actual Yes outcomes is 2 times higher than we would expect if we had chosen 20% of the data set at random. If the model is not classifying the data well, then the lift will hover around 1.0 across all of the portion values. Pitfalls Usually the rate of one of the events is rare. - Rule of thumb: take equal number of zero's and one's, then adjust later! Logistic Regression gives the Probability(one's) - Analyst still has to decide the cut-off % to determine the event outcome. Modeling a target with more than 2 values gets increasingly complex - Break it out into several binary models. Summary Logistic regression is similar to linear regression, except that it is used with a categorical response It can be used for explanatory tasks (=profiling) or predictive tasks (=classification) The predictors are related to the response Y via a nonlinear function called the logit As in linear regression, reducing predictors can be done via variable selection Logistic regression can be generalized to more than two classes

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!