New Semester
Started
Get
50% OFF
Study Help!
--h --m --s
Claim Now
Question Answers
Textbooks
Find textbooks, questions and answers
Oops, something went wrong!
Change your search query and then try again
S
Books
FREE
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Tutors
Online Tutors
Find a Tutor
Hire a Tutor
Become a Tutor
AI Tutor
AI Study Planner
NEW
Sell Books
Search
Search
Sign In
Register
study help
business
business analytics data
Data Mining And Predictive Analytics 2nd Edition Daniel T Larose, Chantal D Larose - Solutions
Obtain the Cook’s distance value for the outlier. Is it influential?
Using the scatter plot, explain why the y-intercept changed more than the slope when the outlier was omitted.
Omit the outlier. Perform the same regression. Compare the values of the slope and y-intercept for the two regressions.
Perform the appropriate regression.
We are interested in predicting nutrition rating based on sodium content. Construct the appropriate scatter plot. Note that there is an outlier. Identify this outlier. Explain why this cereal is an outlier.
Suppose someone said that knowing the number of stolen bases a player has explains most of the variability in the number of times the player gets caught stealing. What would you say? For Exercises 72–85, use the Cereals data set.
Clearly interpret the meaning of the slope coefficient.
Calculate and interpret the correlation coefficient.
Inferentially, is there a significant relationship between the two variables? What tells you this?
Interpret the y-intercept. Does this make any sense? Why or why not?
What is the typical error in predicting the number of times a player is caught stealing, given his number of stolen bases?
Find and interpret the statistic that tells you how well the data fit the model.
Perform the regression of the number of times a player has been caught stealing versus the number of stolen bases the player has.
On the basis of the scatter plot, is a transformation to linearity called for? Why or why not?
We are interested in investigating whether there is a linear relationship between the number of times a player has been caught stealing and the number of stolen bases the player has. Construct a scatter plot, with “caught” as the response. Is there evidence of a linear relationship?
List the influential observations, according to Cook’s distance and the F criterion. Next, subset the Baseball data set so that we are working with batters who have at least 100 at bats. Use this data set for Exercises 62–71.
List the high leverage points. Why is Greg Vaughn a high leverage point? Why is Bernie Williams a high leverage point?
List the outliers. What do all these outliers have in common? For Orlando Palmeiro, explain why he is an outlier.
Construct and interpret a 95% prediction interval for a randomly chosen player with a 0.300 batting average. Is this prediction interval useful?
Construct and interpret a 95% confidence interval for the mean number of home runs for all players who had a batting average of 0.300.
Calculate the correlation coefficient. Construct a 95% confidence interval for the population correlation coefficient. Interpret the result.
Construct and interpret a 95% confidence interval for the unknown true slope of the regression line.
Perform the hypothesis test for determining whether a linear relationship exists between the variables.
What percentage of the variability in the ln home runs does batting average account for?
What is the size of the typical error in predicting the number of home runs, based on the player’s batting average?
Estimate the number of home runs (not ln home runs) for a player with a batting average of 0.300.
State the regression equation (from the regression results) in words and numbers.
Write the population regression equation for our model. Interpret the meaning of ????0 and ????1.
Construct a plot of the residuals versus the fitted values. Do you see strong evidence that the constant variance assumption has been violated? (Remember to avoid the Rorschach effect.) Therefore conclude that the assumptions are validated.
Take the natural log of home runs, and perform a regression of ln home runs on batting average. Obtain a normal probability plot of the standardized residuals from this regression. Does the normal probability plot indicate acceptable normality?
Construct a plot of the residuals versus the fitted values (fitted values refers to the y’s). What pattern do you see? What does this indicate regarding the regression assumptions?
Perform a regression of home runs on batting average. Obtain a normal probability plot of the standardized residuals from this regression. Does the normal probability plot indicate acceptable normality, or is there skewness? If skewed, what type of skewness?
Refer to the previous exercise. Which regression assumption might this presage difficulty for?
What would you say about the variability of the number of home runs, for those with higher batting averages?
Informally, is there evidence of a relationship between the variables?
Construct a scatter plot of home runs versus batting average.
Compare your results for the hypothesis test and the confidence interval. Comment.
Assume normality. Construct a 90% confidence interval for the population correlation coefficient. Interpret the result.
Calculate the correlation coefficient r.
Suppose we let ???? = 0.10. Perform the hypothesis test to determine if a linear relationship exists between x and y. Assume the assumptions are met.
Interpret the value of the standard error of the estimate, s.
Interpret the value of the slope b1.
Interpret the value of the y-intercept b0.
Carefully state the regression equation, using words and numbers.
As it has been standardized, the response z vmail messages has a standard deviation of 1.0. What would be the typical error in predicting z vmail messages if we simply used the sample mean response and no information about day calls? Now, from the printout, what is the typical error in predicting z
Discuss the usefulness of the regression of z mail messages on z day calls.
Assuming normality, construct and interpret a 95% confidence interval for the population correlation coefficient.
Use the data in the ANOVA table to find or calculate the following quantities: 20,000 19,000 18,000 17,000 16,000 15,000 14,000 13,000 12,000 11,000 Scatterplot of attendance vs winning percent Figure 8.22 Scatter plot of attendance versus winning percentage. 1.2 0.9 0.6 0.3 0.0 0.5 0.6 0.7 0.8 0.9
Is there evidence of a linear relationship between z vmail messages (z-scores of the number of voice mail messages) and z day calls (z-scores of the number of day calls made)? Explain.
What type of transformation or transformations is called for? Use the bulging rule. For Exercises 26–30, use the output from the regression of z mail messages on z day calls (from the Churn data set) in Table 8.17 to answer the questions.
Is it appropriate to perform linear regression? Why or why not?
Is there an observation that may look as though it is an outlier? Explain. For Exercises 24 and 25, use the scatter plot in Figure 8.23 to answer the questions.
Will the value of s be closer to 10, 100, 1000, or 10,000? Why?
Will the confidence interval for the slope parameter include zero or not? Explain.
Will the p-value for the hypothesis test for the existence of a linear relationship between the variables be small or large? Explain.
Estimate as best you can the values of the regression coefficients b0 and b1.
Describe any correlation between the variables. Interpret this correlation.
A colleague would like to use linear regression to predict whether or not customers will make a purchase, based on some predictor variable. What would you explain to your colleague?
What recourse do we have if the residual analysis indicates that the regression assumptions have been violated? Describe three different rules, heuristics, or family of functions that will help us.
Clearly explain the correspondence between an original scatter plot of the data and a plot of the residuals versus fitted values.
Explain the difference between a confidence interval and a prediction interval. Which interval is always wider? Why? Which interval is probably, depending on the situation, more useful to the data miner? Why?
(a) Explain why an analyst may prefer a confidence interval to a hypothesis test. (b) Describe how a confidence interval may be used to assess significance.
Describe the criterion for rejecting the null hypothesis when using the p-value method for hypothesis testing. Who chooses the value of the level of significance, ????? Make up a situation (one p-value and two different values of ????) where the very same data could lead to two different
Explain what information is conveyed by the value of the standard error of the slope estimate.
Which values of the slope parameter indicate that no linear relationship exist between the predictor and response variables? Explain how this works.
Explain what statistics from Table 8.11 indicate to us that there may indeed be a linear relationship between x and y in this example, even though the value for r2 is less than 1%.
Explain in your own words the implications of the regression assumptions for the behavior of the response variable y.
Match each of the following regression terms with its definition. Regression Term Definitiona. Influential observation Measures the typical difference between the predicted response value and the actual response value.b. SSE Represents the total variability in the values of the response variable
Calculate the values for leverage, standardized residual, and Cook’s distance for the 11th hiker who had hiked for 10 hours and traveled 23 kilometers. Show that, while it is neither an outlier nor of high leverage, it is nevertheless influential.
Calculate the values for leverage, standardized residual, and Cook’s distance for the hard-core hiker example in the text.
Where would a data point be situated that has the smallest possible leverage?
Calculate the estimated regression equation for the orienteering example, using the data in Table 8.3. Use either the formulas or software of your choice.
Describe the difference between the estimated regression line and the true regression line.
Indicate whether the following statements are true or false. If false, alter the statement to make it true.a. The least-squares line is that line that minimizes the sum of the residuals.b. If all the residuals equal zero, then SST=SSR.c. If the value of the correlation coefficient is negative, this
Suppose we wish to test for difference in population means among three groups.a. Explain why it is not sufficient to simply look at the differences among the sample means, without taking into account the variability within each group.b. Describe what we mean by between-sample variability and
Our partition shows that 800 of the 2000 customers in our test set own a tablet, while 230 of the 600 customers in our training set own a tablet. Test whether the partition is valid for this variable, using ???? = 0.10. Table 6.12 contains the counts for the marital status variable for the training
In Chapter 7, we will learn to split the data set into a training data set and a test data set. To test whether there exist unwanted differences between the training and test set, which hypothesis test do we perform, for the following types of variables:a. Flag variableb. Multinomial variablec.
How is the bias–variance trade-off related to the issue of overfitting and underfitting? Is high bias associated with overfitting and underfitting, and why? High variance? 170 CHAPTER 7 PREPARING TO MODEL THE DATA
Work with international minutes as follows:a. Construct a normal probability plot of international minutes.b. What is preventing this variable from being normally distributed.c. Construct a flag variable to deal with the situation in (b).d. Construct a normal probability plot of the derived
Identify the range of customer service calls that should be considered outliers, using:a. the Z-score method;b. the IQR method.
Explain why we might not want to remove a variable just because it is highly correlated with another variable. EXERCISES 53 HANDS-ON ANALYSIS Use the churn data set14 on the book series web site for the following exercises:
Clarify why each of the binning solutions above are not optimal.
Bin the data into three bins of two records each.
What are the four common methods for binning numerical predictors? Which of these are preferred? Use the following data set for Exercises 28–30: 111337
Investigate how the outlier affects the mean and median by doing the following:a. Find the mean score and the median score, with and without the outlier.b. State which measure, the mean or the median, the presence of the outlier affects more, and why.
Identify all possible stock prices that would be outliers, using:a. The Z-score method.b. The IQR method.
Do the following.a. Identify the outlier.b. Verify that this value is an outlier, using the Z-score method.c. Verify that this value is an outlier, using the IQR method.
What do we look for in a normal probability plot to indicate nonnormality? Use the stock price data for Exercises 24–26.
Find the decimal scaling stock price for the stock price $20.
Compute the Z-score standardized stock price for the stock price $20.
Calculate the midrange stock price.
Find the min–max normalized stock price for the stock price $20.
Compute the SD of the stock price. Interpret what this number means.
Calculate the mean, median, and mode stock price.
Make up a classification scheme that is inherently flawed, and would lead to misclassification, as we find in Table 2.2. For example, classes of items bought in a grocery store.
Which of the four methods for handling missing data would tend to lead to an underestimate of the spread (e.g., SD) of the variable? What are some benefits to this method?
Discuss the similarities and differences with CRISP-DM.
CRISP-DM is not the only standard process for data mining. Research an alternative methodology (Hint: Sample, Explore, Modify, Model and Assess (SEMMA), from the SAS Institute).
For each of the following meetings, explain which phase in the CRISP-DM process is represented:a. Managers want to know by next week whether deployment will take place. Therefore, analysts meet to discuss how useful and accurate their model is. EXERCISES 19b. The data mining project manager meets
On your own, recapitulate the trinary classification analysis undertaken in this chapter using the Loans4 data sets. (Note that the results may differ slightly due to different settings in the CART models.) Report all salient results, including a summary table, similarly to Table 17.15
Showing 2400 - 2500
of 4107
First
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Last
Step by Step Answers