Get questions and answers for Categorical Data Analysis

GET Categorical Data Analysis TEXTBOOK SOLUTIONS

1 Million+ Step-by-step solutions
math books

Suppose that Y has a bin(n,π ) distribution. For the model, logit(π) = α, consider testing H0: α = 0 (i.e., π = 0.5). Let π̂ = y/n.

a. From Section 3.1.6, the asymptotic variance of α̂ = logit(π̂) is [nπ(l – π)]–1. Compare the estimated SE for the Wald test and the SE using the null value of π, using test statistic [logit(π̂)/SE]2. Show that the ratio of the Wald statistic to the statistic with null SE equals 4π̂(1 – π̂). What is the implication about performance of the Wald test if |α| is large and π̂ tends to be near 0 or 1?

b. Wald inference depends on the parameterization. How does the comparison of tests change with the scale [(π̂ – 0.5)/SE]2, where SE is now the estimated or null SE of π̂?

c. Suppose that y = 0 or y = n. Show that the Wald test in part (a) cannot reject H0: π = π0 for any 0 < π0 < 1, whereas the Wald test in part (b) rejects every such π0.

Construct the log-likelihood function for the model logit[π(x)] = α + βx with independent binomial outcomes of y0 successes in n1 trials at x = 0 and y1 successes in n1 trials at x = 1. Derive the likelihood equations, and show that β̂ is the sample log odds ratio.

A study for several professional sports of the effect of a player’s draft position d (d = 1, 2, 3,...) of selection from the pool of potential players in a given year on the probability π of eventually being named an all star used the model logit(π) = a + β log d.

a. Show that π/(1 – π) = eα dβ. Show that eα = odds for the first draft pick.

b. In the United States, Berry reported α̂ = 2.3 and β̂ = – 1.1 for pro basketball and α̂ = 0.7 and β̂ = –0.6 for pro baseball. This suggests that in basketball a first draft pick is more crucial and picks with high d are relatively less likely to be all-stars. Explain why.

The calibration problem is that of estimating x at which π(x) = π0. For the linear logit model, argue that a confidence interval is the set of x values for which |α̂ + β̂x – logit(π0)|/[var(α̂) + x2 var(β̂) + 2x cov(α̂, β̂)]1/2 < zα/2.

Prove that the logistic regression curve (5.1) has the steepest slope where π(x) = 1/2. Generalize to model (5.8).

For model (5.1), when π(x) is small, explain why you can interpret exp(β) approximately as π(x + 1)/π(x).

For model (5.1), show that ∂π(x)/∂x = βπ(x)[1 – π(x)].

Let Y denote a subject’s opinion about current laws legalizing abortion (1 = support), for gender h (h = 1, female; h = 2, male), religious affiliation i (i = 1, Protestant: i = 2, Catholic; i = 3, Jewish), and political party affiliation j (j = 1, Democrat; j = 2, Republican; j = 3, Independent). For survey data, software for fitting the model logit[P(Y= 1)] = α + βGh + βRh + βRi + βjP reports α̂ = 0.62, β̂1G = 0.08, β̂2G = –0.08, β̂1R = –0.16, β̂2R = –0.25, β̂3R = 0.41, β̂1P = 0.87, β̂2P = –1.27, β̂P3 = 0.40.

a. Interpret how the odds of support depends on religion.

b. Estimate the probability of support for the group most (least) likely to support current laws.

c. If, instead, parameters used constraints β1G = β1G = β1P = 0, report the estimates.

Refer to the prediction equation logit(π̂) = – 10.071 – 0.509c + 0.458x for model (5.13). The means and standard deviations are c̅ = 2.44 and s = 0.80 for color, and x̅ = 26.30 and s = 2.11 for width.

For standardized predictors [e.g., x = (width – 26.3)/2.11], explain why the estimated coefficients of e and x equal –0.41 and 0.97. Interpret these by comparing the partial effects of a 1 standard deviation increase in each predictor on the odds. Describe the color effect by estimating the change in π̂ between the first and last color categories at the mean score for width.

Refer to model (5.2) for the horseshoe crabs using x = width.

a. Show that (I) at the mean width (26.3), the estimated odds of a satellite equal 2.07; (ii) at x = 27.3, the estimated odds equal 3.40; and (iii) since exp(β̂ ) = 1.64, 3.40 = (1.64)2.07, and the odds increase by 64%.

b. Based on the 95% confidence interval for β, show that for x near where π = 0.5, the rate of increase in the probability of a satellite per 1-cm increase in x falls between about 0.07 and 0.17.

A survey of high school students on Y = whether the subject has driven a motor vehicle after consuming a substantial amount of alcohol (1 = yes), s = gender (1 = female), r = race (1 = black: 0 = white), and g = grade (g1 = 1, grade 9; g2 = 1, grade 10; g3 = 1, grade 11; g1 = g2 = g3 = 0, grade 12) has prediction equation logit[ P̂(Y = 1) = –0.88 – 0.40s – 0.72r – 2.22g1 – 1.43g2 – 0.58g3 + 0.74rg1 + 0.38rg2 + 0.01rg3.]

a. Carefully interpret effects. Explain the interaction by describing the race effect at each grade and the grade effect for each race.

b. Replace r above by r1 (1 = black, 0 = other). The study also measured r2 (1 = Hispanic, 0 = other), with r1 = r2 = 0 for white. Suppose that the prediction equation is as above but with additional terms –0.29 r2 + 0.53 r2g1 + 0.25 r2g2 – 0.06 r3g3. Interpret the effects.

Table 5.17 shows estimated effects for a logistic regression model with squamous cell esophageal cancer (Y = 1, yes; Y = 0, no) as the response. Smoking status (S) equals 1 for at least one pack per day and 0 otherwise, alcohol consumption (A) equals the average number of alcoholic drinks consumed per day, and race (R) equals 1 for blacks and 0 for whites. To describe the race × smoking interaction, construct the prediction equation when R = 1 and again when R = 0. Find the fitted YS conditional odds ratio for each case. Similarly, construct the prediction equation when S = 1 and again when S = 0. Find the fitted YR conditional odds ratios. Note that for each association, the coefficient of the cross-product term is the difference between the log odds ratios at the two fixed levels for the other variable. Explain why the coefficient of S represents the log odds ratio between Y and S for whites. To what hypotheses do the P-values for R and S refer?


Table 5.17:

In a study designed to evaluate whether an educational program makes sexually active adolescents more likely to obtain condoms, adolescents were randomly assigned to two experimental groups. The educational program, involving a lecture and videotape about transmission of the HIV virus, was provided to one group but not the other. Table 5.16 summarizes results of a logistic regression model for factors observed to influence teenagers to obtain condoms.

a. Find the parameter estimates for the fitted model, using (1,0) dummy variables for the first three predictors. Based on the corresponding confidence interval for the log odds ratio, determine the standard error for the group effect.

b. Explain why either the estimate of 1.38 for the odds ratio for gender or the corresponding confidence interval is incorrect. Show that if the reported interval is correct, 1.38 is actually the log odds ratio, and the estimated odds ratio equals 3.98.


Table 5.16:

The National Collegiate Athletic Association studied graduation rates for freshman student athletes during the 1984–1985 academic year. The (sample size, number graduated) totals were (796, 498) for white females, (1625, 878) for white males, (143, 54) for black females, and (60, 197) for black males. Analyze and interpret.

Table 5.15 appeared in a national study of 15-and 16-year-old adoles cent. The event of interest is ever having sexual intercourse, Analyze, including description and inference about the effects of gender and race, goodness of fit, and summary interpretations.


Table 5.15:

Refer to Table 2.6. Table 5.14 shows the results of fitting a logit model, treating death penalty as the response (1 = yes) and defendant’s race (1 = white) and victims’ race (1 = white) as dummy predictors.

a. Interpret parameter estimates. Which group is most likely to have the yes response? Find the estimated probability in that case.

b. Interpret 95% confidence intervals for conditional odds ratios.

c. Test the effect of defendant’s race, controlling for victims’ race, using a (i) Wald test, and (ii) likelihood-ratio test. Interpret.

d. Test the goodness of fit. Interpret.


Table 5.14:

A study used the 1998 Behavioral Risk Factors Social Survey to consider factors associated with women’s use of oral contraceptives in the United States. Table 5.13 summarizes effects for a logistic regression model for the probability of using oral contraceptives. Each predictor uses a dummy variable, and the table lists the category having dummy outcome 1. Interpret effects. Construct and interpret a confidence interval for the conditional odds ratio between contraceptive use and education.


Table 5.13:

Refer to Table 2.11. Using scores (0, 3, 9.5, 19.5, 37, 55) for cigarette smoking, analyze these data using a logit model. Is the intercept estimate meaningful? Explain.


Table 2.11:

Refer to Table 6.11. The Pearson test of independence has X2(I) = 6.88. For equally spaced scores, the Cochran—Armitage trend test has z2= 6.67 (P = 0.01). Interpret, and explain why results differ so. Analyze the data using a linear logit model. Test independence using the Wald and likelihood-ratio tests, and compare results to the Cochran—Armitage test. Check the fit of the model, and interpret.


Table 6.11:

Hastie and Tibshirani described a study to determine risk factors for kyphosis, severe forward flexion of the spine following corrective spinal surgery. The age in months at the time of the operation for the 18 subjects for whom kyphosis was present were 12, 15, 42, 52, 59, 73, 82, 91, 96, 105, 114, 120, 121, 128, 130, 139, 139, 157 and for 22 of the subjects for whom kyphosis was absent were 1, 1, 2, 8, 11, 18, 22, 31, 37, 61, 72, 81, 97, 112, 118, 127, 131, 140, 151, 159, 177, 206.

a. Fit a logistic regression model using age as a predictor of whether kyphosis is present. Test whether age has a significant effect.

b. Plot the data. Note the difference in dispersion on age at the two levels of kyphosis. Fit the model logit[π(x)] = α + β1x + β2 x2. Test the significance of the squared age term, plot the fit, and interpret.

Refer to Table 4.2. Using scores {0, 2, 4, 5) for snoring, fit the logistic regression model. Interpret using fitted probabilities, linear approximations, and effects on the odds. Analyze the goodness of fit.


Table 4.2:

For the 23 space shuttle flights before the challenger mission disaster in 1986, Table 5.12 shows the temperature at the time of the flight and whether at least one primary O-ring suffered thermal distress.

a. Use logistic regression to model the effect of temperature on the probability of thermal distress. Plot a figure of the fitted model, and interpret.

b. Estimate the probability of thermal distress at 31°F, the temperature at the place and time of the Challenger flight.

c. Construct a confidence interval for the effect of temperature on the odds of thermal distress, and test the statistical significance of the effect.


Table 5.12:

In a GLM, suppose that var(Y) = υ(µ) for µ = E(Y). Show that the link g satisfying g’(µ.) = [υ(µ)]–1/2 has the same weight matrix W(t) at each cycle. Show this link for a Poisson random component is g(µ) = 2√µ.

For n independent observations from a Poisson distribution, show that Fisher scoring gives µ(t + 1) = y̅ for all t > 0. By contrast, what happens with Newton—Raphson?

Consider the value β̂ that maximizes a function L( β). Let β0 denote an initial guess.

a. Using L’( β̂.) = L’( β(0) + (β̂ – β(0) L”(β(0)) + ..., argue that for β(0) close to β̂, approximately 0 = L’( β(0)) + (β̂ – β(0)) L’’(β(0)).

Solve this equation to obtain an approximation β(1) for β̂.

b. Let β(t) denote approximation t for β̂, t = 0, 1, 2, ... Justify that the next approximation is β(t + 1) = β(t) – L’(β(t))/L”(β(t)).

For binary observations, consider the model π(x) = 1/2 + (1/π)tan–1(α + βx). Which distribution has cdf of this form? Explain when a GLM using this curve might be more appropriate than logistic regression.

Show the normal distribution N(µ, σ2) with fixed σ satisfies family (4.1), and identify the components. Formulate the ordinary regression model as a GLM.

Identify each variable as nominal, ordinal, or interval.

a. UK political party preference (Labour, Conservative, Social Democrat)

b. Anxiety rating (none, mild, moderate, severe, very severe)

c. Patient survival (in number of months)

d. Clinic location (London, Boston, Madison, Rochester, Montreal)

e. Response of tumor to chemotherapy (complete elimination, partial reduction, stable, growth progression)

f. Favorite beverage (water, juice, milk, soft drink, beer, wine)

g. Appraisal of company’s inventory level (too low, about right, too high)

Each of 100 multiple-choice questions on an exam has four possible answers, one of which is correct. For each question, a student guesses by selecting an answer randomly.

a. Specify the distribution of the student’s number of correct answers.

b. Find the mean and standard deviation of that distribution. Would it be surprising if the student made at least 50 correct responses? Why?

c. Specify the distribution of (n1, n2, n3, n4), where nj is the number of times the student picked choice j.

d. Find E(nj), var(nj), cov(nj, nk), and corr(nj, nk).

An experiment studies the number of insects that survive a certain dose of an insecticide, using several batches of insects of size n each. The insects are sensitive to factors that vary among batches during the experiment but were not measured, such as temperature level. Explain why the distribution of the number of insects per batch surviving the experiment might show overdispersion relative to a bin(n, π) distribution.

In his autobiography A Sort of Life, British author Graham Greene described a period of severe mental depression during which he played Russian Roulette. This “game’s consists of putting a bullet in one of the six chambers of a pistol, spinning the chambers to select one at random, and then firing the pistol once at one’s head.

a. Greene played this game six times and was lucky that none of them resulted in a bullet firing. Find the probability of this outcome.

b. Suppose that he had kept playing this game until the bullet fired. Let Y denote the number of the game on which it fires. Show the probability mass function for Y, and justify.

Consider the statement, “Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if she is married and does not want any more children.” For the 1996 General Social Survey, conducted by the National Opinion Research Center (NORC), 842 replied “yes” and 982 replied ‘no.” Let π denote the population proportion who would reply “yes.” Find the P-value for testing H0: π = 0.5 using the score test, and construct a 95% confidence interval for π. Interpret the results.

Refer to the vegetarianism example in Section 1.4.3. For testing H0: π = 0.5 against H0: π ≠ 0.5, show that:

a. The likelihood-ratio statistic equals 2[25log(25/12.5)] = 34.7.

b. The chi-squared form of the score statistic equals 25.0.

c. The Wald z or chi-squared statistic is infinite.

In a crossover trial comparing a new drug to a standard, π denotes the probability that the new one is judged better. It is desired to estimate π and test H0: π = 0.5 against Ha: π ≠ 0.5. In 20 independent observations, the new drug is better each time.

a. Find and sketch the likelihood function. Give the ML estimate of

b. Conduct a Wald test and construct a 95% Wald confidence interval for π. Are these sensible?

c. Conduct a score test, reporting the P-value. Construct a 95% score confidence interval. Interpret.

d. Conduct a likelihood-ratio test and construct a likelihood-based 95% confidence interval. Interpret.

e. Construct an exact binomial test and 95% confidence interval. Interpret.

f. Suppose that researchers wanted a sufficiently large sample to estimate the probability of preferring the new drug to within 0.05, with confidence 0.95. If the true probability is 0.90, about how large a sample is needed?

Table 1.3 contains Ladislaus von Bortkiewicz’s data on deaths of soldiers in the Prussian army from kicks by army mules (Fisher 1934; Quine and Seneta 1987). The data refer to 10 army corps, each observed for 20 years. In 109 corps-years of exposure, there were no deaths, in 65 corps-years there was one death, and so on. Estimate the mean and test whether probabilities of occurrences in these five categories follow a Poisson distribution (truncated for 4 and above.)


Table 1.3:

Why is it easier to get a precise estimate of the binomial parameter π when it is near 0 or 1 than when it is near 1/2?

Suppose that P(Yi = 1) = 1 – P(Yi = 0) = π, i = 1, . . . , n, where {Yi} are independent. Let Y = ∑i Yi).

a. What are var(Y) and the distribution of Y?

b. When {Yi} instead have pairwise correlation ρ > 0, show that var(Y) > nπ(1 – π), overdispersion relative to the binomial. [Altharn (1978) discussed generalizations of the binomial that allow correlated trials.]

c. Suppose that heterogeneity exists: P(Yi = 1|π) = π for all i, but π is a random variable with density function g(.) on [0, 11 having mean ρ and positive variance. Show that var(Y) > nρ(l – ρ). (When π has a beta distribution, Y has the beta-binomial distribution of Section 13.3.)

d. Suppose that P(Yi = 1|πi) = πi, i = 1,.. .,n, where (πi) are independent from g(). Explain why Y has a bin(n, ρ) distribution unconditionally but not conditionally on {πi}. 

For the multinomial distribution, show that


Show that corr(n1, n2) = –1 when c = 2.

Show that the moment generating function (mgf) for the binomial distribution is m(t) = (1 – π + πet)n, and use it to obtain the first two moments. Show that the mgf for the Poisson distribution is m(t) = exp{µ[exp(t) – 1]}, and use it to obtain the first two moments.

Suppose that P(T = tj) = πj, j = 1 Show that E(mid-P-value) 0.5. [Show that ∑jπjj/2 + πj+1 + .....) = (∑jπj)2/2.]

For a statistic T with cdf F(t) and p(t) = P(T = t), the mid-distribution function is Fmid(t) = F(t) – 0.5 p(t) (Parzen 1997). Given T = t0, show that the mid-P-value equals 1 – F(t0). (It also satisfies E[Fmid(T)] = 0.5 and var[Fmid(T)] = (1/12){1 – E[p2(T)]}.)

From Section 1.4.2 the midpoint π̴ of the score confidence interval for π is the sample proportion for an adjusted data set that adds z2a/2/2 observations of each type to the sample. This motivates an adjusted Wald interval,


Show that the variance π̴(1 – π̴)/n* at the weighted average is at least as large as the weighted average of the variances that appears under the square root sign in the score interval. Thus, this interval contains the score interval.

A likelihood-ratio statistic equals to,. At the ML estimates, show that the data are exp(to/2) times more likely under Ha than under H0.

Assume that y1, y2,. .., yn are independent from a Poisson distribution.

a. Obtain the likelihood function. Show that the ML estimator µ̂ = y̅.

b. Construct a large-sample test statistic for H0: µ. = µ0 using (i) the Wald method, (ii) the score method, and (iii) the likelihood-ratio method.

c. Construct a large-sample confidence interval for µ using (i) the Wald method, (ii) the score method, and (iii) the likelihood-ratio method.

Inference for Poisson parameters can often be based on connections with binomial and multinomial distributions. Show how to test H0: µ1 = µ2 for two populations based on independent Poisson counts (y1,y2), using a corresponding test about a binomial parameter π. How can one construct a confidence interval for µ12 based on one for π?

A researcher routinely tests using a nominal P(type I error) = 0.05, rejecting H0 if the P-value ≤ 0.05. An exact test using test statistic T has null distribution P(T = 0) = 0.30, P(T = 1) = 0.62, and P(T = 2) = 0.08, where a higher T provides more evidence against the null.

a. With the usual P-value, show that the actual P(type I error) = 0.

b. With the mid-P-value, show that the actual P(type I error) = 0.08.

c. Find P(type I error) in parts (a) and (b) when P(T = 0) = 0.30, P(T = 1) = 0.66, P(T = 2) = 0.04. Note that the test with mid P-value can be conservative or liberal. The exact test with ordinary P-value cannot be liberal.

d. In part (a), a randomized-decision test generates a uniform random variable U from [0, 1] and rejects H0 when T = 2 and U ≤ 5/8 . Show the actual P(type I error) = 0.05. Is this a sensible test?

For a binomial parameter π, show how the inversion process for constructing a confidence interval works with 

(a) The Wald test, 

(b) The score test.

For a flip of a coin, let π denote the probability of a head. An experiment tests H0: π = 0.5 against Ha: π ≠ 0.5, using n = 5 independent flips.

a. Show that the true null probability of rejecting H1 at the 0.05 significance level is 0.0 for the exact binomial test and using the large-sample score test.

b. Suppose that truly π = 0.5. Explain why the probability that the 95% Clopper—Pearson confidence interval contains π equals 1.0.

(Is there any possible y for which both one-sided tests of H0: π = 0.5 have P-value ≤ 0.025?)

Consider the Wald confidence interval for a binomial parameter π. Since it is degenerate when π̂ = 0 or 1, argue that for 0 < π < 1 the probability the interval covers π cannot exceed [1 –πn – (1–π)n]; hence, the infimum of the coverage probability over 0 < π < 1 equals 0, regardless of n.

Consider the 95% binomial score confidence interval for π. When y = 1, show that the lower limit is approximately 0.18/n; in fact, 0 < π < 0.18/n then falls in an interval only when y = 0. Argue that for large n and π just barely below 0.18/n or just barely above 1 – 0.18/n, the actual coverage probability is about e–0.18 = 0.84. Hence, even as n →∞, this method is not guaranteed to have coverage probability ≥ 0.95 (Agresti and Coull 1998; Blyth and Still 1983).

A binomial sample of size n has y = 0 successes.

a. Show that the confidence interval for π based on the likelihood function is [0.0, 1 – exp( –z2a/2/2n)]. For a = 0.05, use the expansion of an exponential function to show that this is approximately [0,2/n].

b. For the score method, show that the confidence interval is [0. Z2a/2/(n + z2a/2 )], or approximately [0, 4/(n + 4)] when α = 0.05.

c. For the Clopper—Pearson approach, show that the upper bound is 1 – (α/2)1/n, or approximately – log(0.025)/n = 3.69/n when α = 0.05.

d. For the adaptation of the Clopper–Pearson approach using the mid-P-value, show that the upper bound is 1 – α1/n, or approximately – log(0.05)/n = 3/n when α = 0.05.

For I × J contingency tables, explain why the variables are independent when the (I – 1) (J – 1) differences πj|i – πj|1 = 0, i = 1,......., I – 1, j = 1,........., J – 1.

Genotypes AA, Aa, and aa occur with probabilities [θ2, 2θ(1 – θ), (1 – θ)2]. A multinomial sample of size n has frequencies (n1, n2, n3) of these three genotypes.

a. Form the log likelihood. Show that θ̂ = (2n1 + n2)/(2n1 + 2n2 + 2n3).

b. Show that –∂2L(θ)/∂θ2 = [(2n1 + n2)/θ2] + [(n2 + 2n3)/(1 – θ)2] and that its expectation is 2n/θ(1 – θ). Use this to obtain an asymptotic standard error of θ̂.

c. Explain how to test whether the probabilities truly have this pattern.

Refer to quadratic form (1.16).

For the zs statistic (1.11), show that z2S = X2 for c = 2.

For testing H0: πj= πj0j = 1,. . . ,c, using sample multinomial proportions {π̂j}, the likelihood-ratio statistic (1.17) is


Show that G2 ≥ 0, with equality if and only if π̂= Ï€j0 for all j. (Apply Jensen’s inequality to E(–2n log X), where X equals Ï€j0/π̂with probability π̂j.)

The chi-squared mgf with df = ν is m(t) = (1–2t)–ν/2, for |t| < ½. Use it to prove the reproductive property of the chi-squared distribution.

An article in the New York Times (Feb. 17, 1999) about the PSA blood test for detecting prostate cancer stated: ‘The test fails to detect prostate cancer in 1 in 4 men who have the disease (false-negative results), and as many as two-thirds of the men tested receive false-positive results.” Let C(C̅) denote the event of having (not having) prostate cancer, and let + (–) denote a positive (negative) test result. Which is true: P(– |C) = 1/4 or P(C | –) = ? P(C̅ | + ) = 2/3 or P(+ | C̅) = 2/3? Determine the sensitivity and specificity.

A diagnostic test has sensitivity = specificity = 0.80. Find the odds ratio between true disease status and the diagnostic test result.

Table 2.9 is based on records of accidents in 1988 compiled by the Department of Highway Safety and Motor Vehicles in Florida. Identify the response variable, and find and interpret the difference of proportions, relative risk, and odds ratio. Why are the relative risk and odds ratio approximately equal?


Table 2.9:

Consider the following two studies reported in the New York Times.

a. A British study reported (Dec. 3, 1998) that of smokers who get lung cancer, “women were 1.7 times more vulnerable than men to get small-cell lung cancer.” Is 1.7 the odds ratio or the relative risk?

b. A National Cancer Institute study about tamoxifen and breast cancer reported (Apr. 7, 1998) that the women taking the drug were 45% less likely to experience invasive breast cancer then were women taking placebo. Find the relative risk for (I) those taking the drug compared to those taking placebo, and (ii) those taking placebo compared to those taking the drug.

A study (E. G. Krug et al., Internat. J. Epiderniol., 27: 214-221, 1998) reported that the number of gun-related deaths per 100,000 people in 1994 was 14.24 in the United States, 4.31 in Canada, 2.65 in Australia, 1.24 in Germany, and 0.41 in England and Wales. Use the relative risk to compare the United States with the other countries. Interpret.

A newspaper article preceding the 1994 World Cup semifinal match between Italy and Bulgaria stated that “Italy is favored 10–11 to beat Bulgaria, which is rated at 10–3 to reach the final.” Suppose that this means that the odds that Italy wins are 11/10 and the odds that Bulgaria wins are 3/10. Find the probability that each team wins, and comment.

In the United States, the estimated annual probability that a woman over the age of 35 dies of lung cancer equals 0.001304 for current smokers and 0.000121 for nonsmokers (M. Pagano and K. Gauvreau, Principles of Biostatistics, Duxbury Press, Pacific Grove, CA. 1993, p. 134).

a. Find and interpret the difference of proportions and the relative risk. Which measure is more informative for these data? Why?

b. Find and interpret the odds ratio. Explain why the relative risk and odds ratio take similar values.

For adults who sailed on the Titanic on its fateful voyage, the odds ratio between gender (female, male) and survival (yes, no) was 11.4. 

a. What is wrong with the interpretation. “The probability of survival for females was 11.4 times that for males”? Give the correct interpretation. When would the quoted interpretation be approximately correct?

b. The odds of survival for females equaled 2.9. For each gender, find the proportion who survived.

In an article about crime in the United States, Newsweek (Jan. 10, 1994) quoted FBI statistics for 1992 stating that of blacks slain, 94% were slain by blacks, and of whites slain, 83% were slain by whites. Let Y = race of victim and X = race of murderer. Which conditional distribution do these statistics refer to, Y | X. or X | Y? What additional information would you need to estimate the probability that the victim was white given that a murderer was white? Find and interpret the odds ratio.

A research study estimated that under a certain condition, the probability that a subject would be referred for heart catheterization was 0.906 for whites and 0.847 for blacks.

a. A press release about the study stated that the odds of referral for cardiac catheterization for blacks are 60% of the odds for whites. Explain how they obtained 60% (more accurately, 57%).

b. An Associated Press story later described the study and said ‘Doctors were only 60% as likely to order cardiac catheterization for blacks as for whites.” Explain what is wrong with this interpretation. Give the correct percentage for this interpretation. (In stating results to the general public, it is better to use the relative risk than the odds ratio. It is simpler to understand and less likely to be misinterpreted.

A 20-year cohort study of British male physicians (R. Doll and R. Peto, British Med. J. 2: 1525–1536, 1976) noted that the proportion per year who died from lung cancer was 0.00140 for cigarette smokers and 0.00010 for nonsmokers. The proportion who died from coronary heart disease was 0.00669 for smokers and 0.00413 for nonsmokers.

a. Describe the association of smoking with each of lung cancer and heart disease, using the difference of proportions, relative risk, and odds ratio. Interpret.

b. Which response is more strongly related to cigarette smoking, in terms of the reduction in number of deaths that would occur with elimination of cigarettes? Explain.

Table 2.10 refers to applicants to graduate school at the University of California at Berkeley, for fall 1973. It presents admissions decisions by gender of applicant for the six largest graduate departments. Denote the three variables by A = whether admitted, G = gender, and D = department. Find the sample AG conditional odds ratios and the marginal odds ratio. Interpret, and explain why they give such different indications of the AG association.


Table 2.10:

Based on 1987 murder rates in the United States, an Associated Press story reported that the probability that a newborn child has of eventually being a murder victim is 0.0263 for nonwhite males, 0.0049 for white males. 0.0072 for nonwhite females, and 0.0023 for white females.

a. Find the conditional odds ratios between race and whether a murder victim, given the gender. Interpret. Do these variables exhibit homogeneous association?

b. Half the newborns are of each gender, for each race. Find the marginal odds ratio between race and whether a murder victim.

At each age level, the death rate is higher in South Carolina than in Maine, but overall, the death rate is higher in Maine. Explain how this could be possible.

A study of the death penalty for cases in Kentucky between 1976 and 1991 (T. Keil and G. Vito, Amer. J. Criminal Justice 20: 17—36, 1995) indicated that the defendant received the death penalty in 8% of the 391 cases in which a white killed a white, in 2% of the 108 cases in which a black killed a black, in 12% of the 57 cases in which a black killed a white, and in 0% of the 18 cases in which a white killed a black. Form the three-way contingency table, obtain the conditional odds ratios between the defendant’s race and the death penalty verdict, interpret those associations, study whether Simpson’s paradox occurs, and explain why the marginal association is so different from the conditional associations.

Table 2.12 summarizes responses of 91 married couples in Arizona to a question about how often sex is fun. Find and interpret a measure of association between wife’s response and husband’s response.


Table 2.12:

Let D denote having a certain disease and E denote having exposure to a certain risk factor. The attributable risk (AR) is the proportion of disease cases attributable to that exposure.

a. Let P(E̅) = 1 – P(E). Explain why AR= [P(D) –P(D|E̅)]/P(D).

b. Show that AR relates to the relative risk RR by AR = [P(E)(RR – 1)]/[1 + P(E)(RR – 1)].

For given π1 and π2 show that the relative risk cannot be farther than the odds ratio from their independence value of 1.0.

Explain why for three events E1, E2 and E3 and their complements, it is possible that P(E1 | E2) > P(E1 | E̅2) even if both P(E1 | E2 E3) < P(E1 | E̅E3) and P(E1 |E23) < P(E1 |E̅2, E̅3). 

Let πij|k = P(X = i, Y = j|Z = k). Explain why XY conditional independence is πij|k = πi+|k π+j|k for all i and j and k.

For a 2 × 2 × 2 table, show that homogeneous association is a symmetric property, by showing that equal XY conditional odds ratios is equivalent to equal YZ conditional odds ratios.

Smith and Jones are baseball players. Smith has a higher batting average than Jones in each of K years. Is it possible that for the combined data from the K years, Jones has the higher batting average? Explain, using an example to illustrate.

When X and Y are conditionally dependent at each level of Z yet marginally independent, Z is called a suppressor variable. Specify joint probabilities for a 2 × 2 × 2 table to show that this can happen 

(a) When there is homogeneous association, 

(b) When the association has opposite direction in the partial tables.

Suppose that {Yij} are independent Poisson variates with means {µij}. Show that P(Yij = nij) for all i, j, conditional on {Yi+ = ni}, satisfy independent multinomial sampling [i.e., the product of (2.2) for all i] within the rows.

For 2 × 2 tables, Yule (1900, 1912) introduced


which he labeled Q in honor of the Belgian statistician Quetelet. It is now called Yule’s Q.

a. Show that for 2 × 2 tables, Goodman and Kruskal’s γ = Q.

b. Show that Q falls between – 1 and 1.

c. State conditions under which Q = – 1 or Q = 1.

d. Show that Q relates to the odds ratio by Q = (θ – 1)/(θ + 1), a monotone transformation of θ from the [θ, ∞] scale onto the [–1,+ 1] scale.

When X and Y are ordinal with counts {nij}:

a. Explain why the (n 2) pairs of observations partition into C + D + TX + TY – TXY, where TX = ∑ni + (ni+ –1)/2 pairs are tied on X, TY pairs are tied on Y, and TXY pairs are tied on X and Y.

b. 

Explain why d is the difference between the proportions of concordant and discordant pairs out of those pairs united on X (Somers 1962). (For 2 × 2 tables, d equals the difference of proportions, and tau-b equals the correlation between X and Y.)

The measure of association lambda for nominal variables (Goodman and Kruskal 1954) has V(Y) = 1 –max{π+j} and V(Y|i) = 1 – rnaxjj|i}. Interpret lambda as a proportional reduction in prediction error for predictions which select the response category that is most likely. Show that independence implies λ = 0 but that the converse is not true.

For a 2 × 2 table, consider H0: π11 = θ2, π12 = π21 = θ(1 – θ), π22 = (1 – θ)2.

a. Show that the marginal distributions are identical and that independence holds.

b. For a multinomial sample, under H0 show that θ̂ = (p1+ + p+1)/2.

c. Explain how to test H0. Show that df = 2 for the test statistic.

Show that X2 = n∑∑(pij – pi+ p+j)2/pi+ p+j. Thus, X2 can be large when n is large, regardless of whether the association is practically important. Explain why this test, like other tests, simply indicates the degree of evidence against H0 and does not describe strength of association. (“Like fire, the chi-square test is an excellent servant and a bad master,” Sir Austin Bradford Hill, Proc. Roy. Soc. Med. 58:295—300, 1965.)

An I × J table has ordered columns and unordered rows. Ridits (Bross 1958) are data-based column scores. The jth sample ridit is the average cumulative proportion within category j,


The sample mean ridit in row i is RÌ‚i = ∑j rÌ‚j pj|i. Show that = ∑j p+j rÌ‚j = 0.50 and ∑i pi+RÌ‚i = 0.50. 

For multinomial sampling, use the asymptotic variance of log θ̂ to show that for Yule’s Q the asymptotic variance of

Using the delta method, show that the Wald confidence interval for the logit of a binomial parameter π is log [π̂/(1–π̂)] ± zα/2/√nπ̂(1–π̂). Explain how to use this interval to obtain one for π itself. [Newcombe (2001) noted that the sample logit is also the midpoint of the score interval for π, on the logit scale. He showed that this logit interval contains the score interval.]

For comparing two binomial samples, show that the standard error (3.1) of a log odds ratio increases as the absolute difference of proportions of successes and failures for a given sample increases.

Is θ̂ the midpoint of large- and small-sample confidence intervals for θ? Why or why not?

An advertisement by Schering Corp. in 1999 for the allergy drug Claritin mentioned that in a pediatric randomized clinical trial, symptoms of nervousness were shown by 4 of 188 patients on loratadine (Claritin), 2 of 262 patients taking placebo, and 2 of 170 patients on choropheniramine. In each part below, explain which method you used, and why.

a. Is there inferential evidence that nervousness depends on drug?

b. For the Claritin and placebo groups, construct and interpret a 95% confidence interval for the (i) odds ratio and (ii) difference of proportions suffering nervousness.

Consider a 3 × 3 table having entries, by row, of (4, 2, 0 / 2, 2, 2 / 0, 2, 4). Conduct an exact test of independence, using X2. Assuming ordered rows and columns and using equally spaced scores, conduct an ordinal exact test. Explain why results differ so much.

A study considered the effect of prednisolone on severe hypercalcaemia in women with metastatic breast cancer (B. Kristensen et al., J. Intern. Med. 232: 237–245, 1992). Of 30 patients, 15 were randomly selected to receive prednisolone. The other 15 formed a control group. Normalization in their level of serum-ionized calcium was achieved by 7 of the treated patients and none of the control group. Analyze whether results were significantly better for treatment than for control. Interpret.

Table 3.13 shows the results of a retrospective study comparing radiation therapy with surgery in treating cancer of the larynx. The response indicates whether the cancer was controlled for at least two years following treatment. Table 3.14 shows SAS output.

a. Report and interpret the P-value for Fisher’s exact test with (i) Hα: θ > 1, and (ii) Hα: 0 ≠ 1. Explain how the P-values are calculated.

b. Interpret the confidence intervals for θ. Explain the difference between them and how they were calculated.

c. Find and interpret the one-sided mid-P-value. Give advantages and disadvantages of this type of P-value.


Table 3.13:


Table 3.14:

A study on educational aspirations of high school students (S. Crysdale, Internat. J. compar. Sociol. 16: 19–36, 1975) measured aspirations with the scale (some high school, high school graduate, some college, college graduate). The student counts in these categories were (11, 52, 23, 22) when family income was low, (9, 44, 13, 10) when family income was middle, and (9, 41, 12, 27) when family income was high.

a. Test independence of educational aspirations and family income using X2 or G2. Explain the deficiency of this test for these data.

b. Find the standardized Pearson residuals. Do they suggest any association pattern?

c. Conduct an alternative test that may be more powerful. Interpret.

Refer to Table 7.8. For the combined data for the two genders, yielding a single 4 × 4 table, X2= 11.5 (P = 0.24), whereas using row scores (3, 10, 20, 35) and column scores (1, 3, 4, 5), M2= 7.04 (P = 0.008). Explain why the results are so different.


Table 7.8:

Table 3.12 classifies a sample of psychiatric patients by their diagnosis and by whether their treatment prescribed drugs.

Partition chi-squared into three components to describe differences and similarities among the diagnoses, by comparing (I) the first two rows, (ii) the third and fourth rows, and (iii) the last row to the first and second rows combined and the third and fourth rows combined.


Table 3.12:

Project Blue Book: Analysis of Reports of Unidentified Aerial Objects was published by the U.S. Air Force (Air Technical Intelligence Center at Wright-Patterson Air Force Base) ¡n May 1955 to analyze reports of unidentified flying objects (UFOs). In its Table II, the report classified 1765 sightings later regarded as known objects and 434 sightings later regarded as unknown, according to the object color (nine categories).

The report states: “The chi-square test is applicable only to distributions which have the same number of elements,” so the investigators multiplied all counts in the known category by (434/1765), so each row has 434 observations, before computing X2. They reported X2 = 26.15 with df = 8. Explain why this is incorrect. What should X2 equal?

Refer to Table 2.1. Partition G2for testing whether the incidence of heart attacks is independent of aspirin intake into two components. Interpret.


Table 2.1:

In a study of the relationship between stage of breast cancer at diagnosis (local or advanced) and a woman’s living arrangement, of 144 women living alone, 41.0% had an advanced case; of 209 living with spouse, 52.2% were advanced; of 89 living with others, 59.6% were advanced. The authors reported the P-value for the relationship as 0.02 (D. J. Moritz and W. A. Satariano, J. am. Epidemiol. 46: 443–454, 1993). Reconstruct the analysis performed to obtain this P-value.

Showing 1 - 100 of 341
Join SolutionInn Study Help for
1 Million+ Textbook Solutions
Learn the step-by-step answers to your textbook problems, just enter our Solution Library containing more than 1 Million+ textbooks solutions and help guides from over 1300 courses.
24/7 Online Tutors
Tune up your concepts by asking our tutors any time around the clock and get prompt responses.