New Semester
Started
Get
50% OFF
Study Help!
--h --m --s
Claim Now
Question Answers
Textbooks
Find textbooks, questions and answers
Oops, something went wrong!
Change your search query and then try again
S
Books
FREE
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Tutors
Online Tutors
Find a Tutor
Hire a Tutor
Become a Tutor
AI Tutor
AI Study Planner
NEW
Sell Books
Search
Search
Sign In
Register
study help
business
business analytics data
Data Mining And Predictive Analytics 2nd Edition Daniel T Larose, Chantal D Larose - Solutions
On your own, recapitulate the trinary classification analysis undertaken in this chapter using the Loans3 data sets. (Note that the results may differ slightly due to different settings in the CART models.) Report all salient results, including a summary table, similarly to Table 17.9.
Using the results in Tables 17.12 and 17.14, confirm the values for the evaluation measures in Table 17.15.
Adjust Table 17.13 so that there are zeroes on the diagonal and the matrix is scaled, similarly to Table 17.7.
Provide justifications for each of the direct costs given in Table 17.5.
When misclassification costs are involved, what is the best metric for comparing model performance?
Which cost matrix should we use when comparing models?
Why do we adjust our cost matrix so that there are zeroes on the diagonal?
Explain how we determine the principal and interest amounts for the Loans problem.
Express in your own words how we interpret the following measures:a. D-sensitivity, where D represents the denied class in the Loans problemb. False D ratec. Proportion of true Dsd. Proportion of false Ds.
Use the term “diagonal elements of the contingency table” to define (i) accuracy and (ii) overall error rate.
Interpret the proportion of true As and the proportion of false As.
What is the relationship between the proportion of true As and the proportion of false As?
Why do we avoid the term positive predictive value in this book?
How are A-sensitivity and false A rate interpreted?
What is the relationship between false A rate and A-sensitivity?
Explain the Σ notation used in the notation in this chapter, for the marginal totals and the grand total of the contingency tables.
Explain why the true positive/false positive/true negative/false negative usage is not applicable to classification models with trinary targets.
Finally, assume that 50% of those customers who are in danger of churning, and with whom the company intervenes, will stay with the company, and 50% will churn anyway. Redo Exercises 45–50 under this assumption.
Next, assume the company’s intervention strategy is perfect, and that everyone the company intervenes with to stop churning will not churn. Redo Exercises 45–50 under this assumption.
Construct a table of evaluation measures for the two models, similarly to Table 16.13.
Using the training set, and the cost matrix, develop a CART model for predicting Churn. Call this Model 2.
Using the training set, develop a CART model for predicting Churn. Do not use misclassification costs. Call this Model 1.
Partition the Churn data set into a training data set and a test data set.
Why don’t we rebalance the test data set?
Suppose the classification algorithm of choice had no method of applying misclassification costs.a. What would be the resampling ratio for using rebalancing as a surrogate for misclassification costs?b. How should the training set be rebalanced?
Revenue per customer.
Model cost.
Use Result 3 to readjust the adjusted misclassification costs, so that the readjusted false negative cost is $1. Interpret the readjusted false positive and false negative costs. For Exercises 33–42, consider two classification models: Model 1 is a naïve model with no misclassification costs,
Use Result 3 to readjust the adjusted misclassification costs, so that the readjusted false positive cost is $1. Interpret the readjusted false positive and false negative costs.
Calculate the positive confidence threshold. Use Result 2 to state when the model will make a positive classification.
Use Result 1 to construct the adjusted cost matrix. Interpret the adjusted costs.
Construct the cost matrix. Provide rationales for each cost.
Explain why (i) misclassification costs are needed in this scenario, and (ii) the overall error rate is not the best measure of a good model.
Why does rebalancing work as a surrogate for misclassification costs? Use the following information for Exercises 27–44. Suppose that our client is a retailer seeking to maximize revenue from a direct marketing mailing of coupons to likely customers for an upcoming sale. A positive response
What does it mean to say that the resampling ratio is data-driven?
Explain how we do such rebalancing when the adjusted false positive cost is greater than the adjusted false negative cost.
Why might we need rebalancing as a surrogate for misclassification costs?
What do we mean when we say that the misclassification costs in the case study are data-driven?
What are direct costs? Opportunity costs? Why should we not include both when constructing our cost matrix?
How might Result 3 be of use to an analyst making a presentation to a client?
Explain what is meant by decision invariance under scaling.
Clearly explain how Figure 16.1 demonstrates the positive classification criterion for a C5.0 binary classifier.
Explain the positive classification criterion.
What is the positive confidence threshold?
What is the adjusted false positive cost? The adjusted false negative cost?
What is the difference between confidence and positive confidence?
True or false: We can always adjust the costs in our cost matrix so that the two cells representing correct decisions have zero cost.
Explain decision invariance under row adjustment.
Describe what is meant by the minimum expected cost principle.
True or false: The overall error rate is always the best indicator of a good model
Proportion of false negatives.
Proportion of true negatives
Proportion of false positives.
Proportion of true positives.
False negative rate.
Specificity.
False positive rate.
For Exercises 1–8, state what you would expect to happen to the indicated classification evaluation measure, if we increase the false negative misclassification cost, while not increasing the false positive cost. Explain your reasoning.Sensitivity.
Recall the WEKA Logistic example for classifying cereals as either high or low. Compute the probability that the fourth instance from the test set is classified either high or low. Does your probability match that produced by WEKA?
Open the breast cancer data set. Investigate, for each significant predictor, whether the linearity assumption is warranted. If not, ameliorate the situation using the methods discussed in this chapter.
Open the data set, German, which is provided on the textbook website. The data set consists of 20 predictors, both continuous and categorical, and a single response variable, indicating whether the individual record represents a good or bad credit risk. The predictors are as follows, with amounts
Find the probability of high income for a 50-year-old married male with 16 years education working 40 hours per week with capital gains of $6000.
Find the probability of high income for a 20-year-old single female with 12 years education working 20 hours per week with no capital gains or losses.
Construct and interpret 95% confidence intervals for the coefficients for age, sex-male, and educ-squared. Verify that these predictors belong in the model.
Find the estimated logit.
For indicator categories that are not significant, collapse the categories with the reference category. (How are you handling the category with the 0.083 p-value?) Rerun the logistic regression with these collapsed categories. Use the results from your rerunning of the logistic regression for
Consider the results from Table 13.26. Construct the logistic regression model that produced these results.
Construct and interpret a 95% confidence interval for each coefficient. Use Table 13.26 for Exercises 29–31.
Find the probability of high income for someone working 30, 40, 50, and 60 hours per week.
Find the form of the estimated logit.
Construct the logistic regression model developed in the text, with the age2 term and the indicator variable age 33–65. Verify that using the quadratic term provides a higher estimate of the probability of high income for the 32-year-old than the 20-year-old. Use Table 13.25 for Exercises 26–28.
Clearly interpret the value of the coefficients for the following predictors:a. Bland chromatinb. Normal nucleoli
Calculate the 95% confidence intervals for the following predictor coefficients.a. Clump thicknessb. Mitosesc. Comment as to the evidence provided by the confidence interval for the mitoses coefficient regarding its significance.
Find the probability that a tumor is malignant, given the following:a. The values for all predictors are at the minimum (1).b. The values for all predictors are at a moderate level (5).c. The values for all predictors are at the maximum (10).
Assume that our level of significance is 0.11. Express the logit, using all significant variables.
Did you drop cell shape uniformity in the previous exercise? Are you surprised that the variable is now a significant predictor? Discuss the importance of retaining variables of borderline significance in the early stages of model building.
Explain why the deviance difference fell, but only by a small amount.
Explain what will happen to the deviance difference if we rerun the model, dropping the nonsignificant variables. Work by analogy with the linear regression case.
Discuss how you should handle the variables with p-values around 0.05, 0.10, or 0.15.
Discuss whether the variables you cited in Exercise 15 should be used in predicting the class of tumor with a new unseen data set.
Which variables do not appear to be significant predictors of breast tumor class? How can you tell?
Without reference to inferential significance, express the form of the logit.
What is the value of the deviance difference? Is the overall logistic regression significant? How can you tell? What does it mean to say that the overall logistic regression is significant?
Explain why, for data that are missing one or more indicator variable values, it would not be appropriate to simply ignore these missing variables when making an estimation. Provide options for the data analyst in this case.
Discuss the use of predictors that turn out to be nonsignificant in estimating the response. When might this be appropriate, if at all? Why would this not be appropriate in general?
Discuss the assumption that the odds ratio is constant across the range of the predictor, with respect to various types of relationships between the predictor and the response. Provide modeling options for when this assumption may not be reflected in the data.
Discuss the role of statistical inference with respect to the huge sample sizes prevalent in data mining.
If the difference between a particular indicator variable and the reference category is not significant, then what should the analyst consider doing?
Describe how we determine the statistical significance of the odds ratio, using a confidence interval.
What is the definition of the odds ratio? What is the relationship between the odds ratio and the slope coefficient ????1? For what quantity is the odds ratio sometimes used as an estimate?
What are odds? What is the difference between odds and probability?
Explain clearly how the slope coefficient ????1, and its estimate b1, may be interpreted in logistic regression. Provide at least two examples, using both a categorical and a continuous predictor.
Explain what is meant by maximum-likelihood estimation and maximum-likelihood estimators.
By hand, derive the logit result g(x) = ????0 + ????1x.
Indicate whether the following statements are true or false. If the statement is false, alter it so that the statement becomes true.a. Logistic regression refers to methods for describing the relationship between a categorical response variable and a set of categorical predictor variables.b.
Make sure the target variable takes the flag type. Compare the sign of (weight for Age-to-Neuron1)+(weight for Bias-to-Neuron1)*(weight for Neuron 1-to-Output node) for the good risk output node, as compared to the bad loss output node. Explain whether this makes sense, given the data, and why.
Consider the following quantity: (weight for Age-to-Neuron1)+(weight for Biasto-Neuron1)*(weight for Neuron 1-to-Output node). Explain whether this makes sense, given the data, and why.
Run a NN model predicting income based only on age. Use the default settings and make sure there is one hidden layer with one neuron.
Compare the neural network model with the classification and regression tree (CART) and C4.5 models for this task in Chapter 11. Describe the benefits and drawbacks of the neural network model compared to the others. Is there convergence or divergence of results among the models? For Exercises
Which variables, in order of importance, are identified as most important for classifying churn?
Showing 2500 - 2600
of 4107
First
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Last
Step by Step Answers