New Semester
Started
Get
50% OFF
Study Help!
--h --m --s
Claim Now
Question Answers
Textbooks
Find textbooks, questions and answers
Oops, something went wrong!
Change your search query and then try again
S
Books
FREE
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Tutors
Online Tutors
Find a Tutor
Hire a Tutor
Become a Tutor
AI Tutor
AI Study Planner
NEW
Sell Books
Search
Search
Sign In
Register
study help
business
business analytics data
Data Mining And Predictive Analytics 2nd Edition Daniel T Larose, Chantal D Larose - Solutions
What is the relationship between a lift chart and a gains chart?
If lift and gains measure the proportion of hits, regardless of the cost matrix, why can we use lift charts and gains charts in the presence of misclassification costs?
What would it mean for a model to have a lift of 2.5 at the 15th percentile?
Take a shot at improving the regression of MPGBC 0.75 on the principal components. For example, you may wish to tweak the Box–Cox transformation, or you may wish to use an indicator variable for the luxury cars. Using whatever means you can bring to bear, obtain your best model that deals with
Using the four criteria from Chapter 5, determine the best number of principal components to extract for the gas mileage data.
Do you see some curvature in the residuals versus fitted values plot? Produce a plot of the residuals against each of the predictors. Any curvature? Add a quadratic term of one of the predictors (e.g., weight2 ) to the model, and see if this helps.
Use a Box–Cox transformation to try to eliminate the skewness in the normal probability plot.
Build two parallel models, one where we account for multicollinearity, and another where we do not. For which purposes may each of these models be used?
Determine which variables might be superfluous.
Determine which variables must be made into indicator variables.
Build the best multiple regression model you can for the purposes of predicting head injury severity, using all the other variables as the predictors.
Perform all possible regressions. Did the variable selection algorithms find the best regression? For Exercises 45–49, use the crash data set, found on the book web site.
Build the best multiple regression model you can for the purposes of predicting the response, using the gender ratio as the response, and all the other variables as the predictors. Compare and contrast the results from the forward selection, backward elimination, and stepwise variable selection
Next, build the best multiple regression model you can for the purposes both of predicting the response and of profiling the predictors’ individual relationship with the response. Make sure you account for multicollinearity.
(Extra credit). Write a script that will perform all possible regressions. Did the variable selection algorithms find the best regression?
Apply the best subsets procedure, and compare against the previous methods.
Build the best multiple regression model you can for the purposes of predicting calories, using all the other variables as the predictors. Do not worry about whether the predictor coefficients are stable or not. Compare and contrast the results from the forward selection, backward elimination, and
Clearly and completely express the interpretation for the coefficient for PCT_U18. Discuss whether this makes sense.
Suppose we omit TOT_POP from the model and rerun the regression. Explain what will happen to the value of R2.
How many towns are included in the sample?
Note that the variable PCT_O64 was excluded. Explain why this variable was automatically excluded from the analysis by the software. (Hint: Consider the analogous case of using too many indicator variables to define a particular categorical variable.)
Suppose a certain food was predicted to have 60 calories fewer than it actually has, based on its content of the predictor variables. Would this be considered unusual? Explain specifically how you would determine this. For Exercises 28–29, next consider the multiple regression output from SPSS
Clearly and completely express the interpretation for the coefficient for sodium.
Discuss the presence of multicollinearity. Evaluate the strength of evidence for the presence of multicollinearity. On the basis of this, should we turn to principal components analysis?
Which predictor is negatively associated with the response? Explain how you know this.
Suppose we omit cholesterol from the model and rerun the regression. Explain what will happen to the value of R2.
Which of the predictors probably does not belong in the model? Explain how you know this. What might be your next step after viewing these results?
How are we to interpret the value of b0, the coefficient for the constant term? Is this coefficient significantly different from zero? Explain how this makes sense.
How many foods are included in the sample?
What is the typical error in prediction? (Hint: This may take a bit of digging.)
What is the conclusion regarding the significance of the overall regression? How do you know? Does this mean that all of the predictors are important? Explain.
What is the response? What are the predictors?
Explain the circumstances under which the value for R2 would reach 100%. Now explain how the p-value for any test statistic could reach zero.
Suppose we wished to limit the number of predictors in the regression model to a lesser number than those obtained using the default settings in the variable selection criteria. How should we alter each of the selection criteria? Now, suppose we wished to increase the number of predictors. How then
Describe the behavior of Mallows’ Cp statistic, including the heuristic for choosing the best model.
Describe how the best subsets procedure works. Why not always use the best subsets procedure?
Describe the differences and similarities among the forward selection procedure, the backward elimination procedure, and the stepwise procedure.
Compare and contrast the effects that multicollinearity has on the point and intervals estimates of the response versus the values of the predictor coefficients.
Which statistics report the presence of multicollinearity in a set of predictors? Explain, using the formula, how this statistic works. Also explain the effect that large and small values of this statistic will have on the standard error of the coefficient.
Explain some of the drawbacks of a set of predictors with high multicollinearity.
Explain the difference between the sequential sums of squares and the partial sums of squares. For which procedures do we need these statistics?
Explain what it means when R2 adj is much less than R2.
Discuss the concept of the level of significance (????). At what value should it be set? Who should decide the value of ????? What if the observed p-value is close to ????? Describe a situation where a particular p-value will lead to two different conclusions, given two different values for ????.
When using indicator variables, explain the meaning and interpretation of the indicator variable coefficients, graphically and numerically.
Construct indicator variables for the categorical variable class, which takes four values, freshman, sophomore, junior, and senior.
Explain the difference between the t-test and the F-test for assessing the significance of the predictors.
Clearly explain why s and R2 adj are preferable to R2 as measures for model building.
Indicate whether the following statements are true or false. If the statement is false, alter it so that the statement becomes true.a. If we would like to approximate the relationship between a response and two continuous predictors, we would need a plane.b. In linear regression, while the
In view of the results obtained above, discuss the overall quality and adequacy of our churn classification models.
Construct a single lift chart which includes the better of the two CART models, the better of the two C4.5 models, and the neural network model. Which model is preferable over which regions?
Construct a lift chart for the neural network model. What is the estimated lift at 20%? 33%? 40%? 50%?
Next, apply a neural network model to predict churn. Construct a table containing the same measures as in Exercise 27.
Now turn to a C4.5 decision tree model, and redo Exercises 17–35. Compare the results. Which model is preferable?
Construct a single lift chart for both of the CART models. Which model is preferable over which regions?
Construct a lift chart for the CART model with the adjusted misclassification costs. What is the estimated lift at 20%? 33%? 40%? 50%?
Construct a gains chart for the default CART model. Explain the relationship between this chart and the lift chart.
Construct a lift chart for the default CART model. What is the estimated lift at 20%? 33%? 40%? 50%?
Perform a cost/benefit analysis for the CART model with the adjusted misclassification costs. Use the same cost/benefits assignments as for the default model. Find the overall anticipated cost. Compare with the default model, and formulate a recommendation as to which model is preferable.
Perform a cost/benefit analysis for the default CART model from Exercise 1 as follows. Assign a cost or benefit in dollar terms for each combination of false and true positives and negatives, similarly to Table 15.5. Then, using the contingency table, find the overall anticipated cost.
Based on your answer to the previous exercise, adjust the misclassification costs for your CART model to reduce the prevalence of the more costly type of error. Rerun the CART algorithm. Compare the false positive, false negative, sensitivity, specificity, and overall error rate with the previous
In a typical churn model, in which interceding with a potential churner is relatively cheap but losing a customer is expensive, which error is more costly, a false negative or a false positive (where positive=customer predicted to churn)? Explain.
Apply a CART model for predicting churn. Use default misclassification costs. Construct a table containing the following measures:a. Accuracy and overall error rateb. Sensitivity and false-positive ratec. Specificity and false-negative rated. Proportion of true positives and proportion of false
What is meant by a confluence of results?
For model selection, should model evaluation be performed on the training data set or the test data set, and why?
What should one look for when evaluating a gains chart?
Describe the trade-off between reaching out to a large number of customers and having a high expectation of success per contact.
Explain in your own words what is meant by lift.
When misclassification costs are involved, what is the best model evaluation measure?
Are accuracy and overall error rate always the best indicators of a good model?
In your situation from the previous exercise, describe the expected consequences of increasing the false negative cost. Why would these be beneficial?
The text describes a situation where a false positive is worse than a false negative. Describe a situation from the medical field, say from screen testing for a virus, where a false negative would be worse than a false positive. Explain why it would be worse.
If we use a hypothesis testing framework, explain what represents a type I error and a type II error.
Describe the difference between the proportion of false-positives and the false-positive rate.
What is the term used for the proportion of true positives in the medical literature? Why do we prefer to avoid this term in this book?
What is the relationship between false positive rate and sensitivity?
Suppose our model has perfect sensitivity and perfect specificity. What then is our accuracy and overall error rate?
Suppose our model has perfect sensitivity. Why is that insufficient for us to conclude that we have a good model?
True or false: If model A has better accuracy than model B, then model A has fewer false negatives than model B. If false, give a counterexample.
What is the relationship between accuracy and overall error rate?
What is the difference between the total predicted negative and the total actually negative?
What is a false positive? A false negative?
Describe the general form of a contingency table.
What might be a drawback of evaluation measures based on squared error? How might we avoid this?
Describe the trade-off between model complexity and prediction error.
How is the square root of the MSE interpreted?
Why do we not use the average deviation as a model evaluation measure?
What is the minimum descriptive length principle, and how does it represent the principle of Occam’s razor?
Why do we need to evaluate our models before model deployment?
Identify the set of outliers in the lower right of the residuals versus fitted values plot. Have we uncovered a natural grouping? Explain how this group would end up in this place in the graph.
Perform the regression of ln pct (ln of percentage over 64) on ln popn, and obtain the regression diagnostics. Explain how taking the ln of percentage over 64 has tamed the residuals versus fitted values plot.
Describe the pattern in the plot of the residuals versus the fitted values. What does this mean? Are the assumptions validated?
Describe the pattern in the normal probability plot of the residuals. What does this mean?
Apply the ln transformation to the predictor, giving us the transformed predictor variable ln popn. Note that the application of this transformation is due solely to the skewness inherent in the variable itself (shown by the scatter plot), and is not the result of any regression diagnostics.
Identify the four cities that appear larger than the bulk of the data in the scatter plot.
Construct a scatter plot of percentage over 64 versus popn. Is this graph very helpful in describing the relationship between the variables?
Construct and interpret a 95% confidence interval for the nutrition rating for a randomly chosen cereal with sodium content of 100. Open the California data set (Source: US Census Bureau, www.census.gov, and available on the book website, www.DataMiningConsultant.com), which consists of some census
Construct and interpret a 95% confidence interval for the true nutrition rating for all cereals with a sodium content of 100.
What is the typical error in predicting rating based on sodium content?
Construct the graphics for evaluating the regression assumptions. Are they validated?
Put the outlier back in the data set for the rest of the analysis. On the basis of the scatter plot, is there evidence of a linear relationship between the variables? Discuss. Characterize their relationship, if any.
Showing 2300 - 2400
of 4107
First
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Last
Step by Step Answers