Consider the following case-control sample selection method for binary dependent variables. Intuitively, if we are working with

Question:

Consider the following "case-control" sample selection method for binary dependent variables. Intuitively, if we are working with a problem in which the event of interest is rare, we want to make sure that we sample a sufficient number of events so that our estimation procedures are reliable. Suppose that we have a large database consisting of \(\left\{y_{i}, \mathbf{x}_{i}\right\}, i=\) \(1, \ldots, N\) observations. (For insurance company records, \(N\) could easily be 10 million or more.) We want to make sure to get plenty of \(y_{i}=1\) (corresponding to claims or cases) in our sample, plus a sample of \(y_{i}=0\) (corresponding to nonclaims or controls). Thus, we split the dataset into two subsets. For the first subset, consisting of observations with \(y_{i}=1\), we take a random sample with probability \(\tau_{1}\). Similarly, for the second subset, consisting of observations with \(y_{i}=0\), we take a random sample with probability \(\tau_{0}\). For example, in practice we might use \(\tau_{1}=1\) and \(\tau_{0}=0.10\), corresponding to taking all of the claims and a \(10 \%\) sample of nonclaims - thus, \(\tau_{1}\) and \(\tau_{1}\) are considered known to the analyst.

a. Let \(\left\{r_{i}=1\right\}\) denote the event that the observation is selected to be part of the analysis. Determine \(\operatorname{Pr}\left(y_{i}=1, r_{i}=1\right), \operatorname{Pr}\left(y_{i}=0, r_{i}=1\right)\) and \(\operatorname{Pr}\left(r_{i}=1\right)\) in terms of \(\tau_{0}, \tau_{1}\), and \(\pi_{i}=\operatorname{Pr}\left(y_{i}=1\right)\).

b. Using the calculations in part (a), determine the conditional probability 

\(\operatorname{Pr}\left(y_{i}=1 \mid r_{i}=1\right)\).

c. Now assume that \(\pi_{i}\) has a logistic form \((\pi(z)=\exp (z) /(1+\exp (z))\) and \(\left.\pi_{i}=\pi\left(\mathbf{x}_{i}^{\prime} \boldsymbol{\beta}\right)\right)\). Rewrite your answer part (b) using this logistic form.

d. Write the likelihood of the observed \(y_{i}\) 's (conditional on \(r_{i}=1, i=\) \(1, \ldots, n\) ). Show how we can interpret this as the usual logistic regression likelihood with the exception that the intercept has changed. Specify the new intercept in terms of the original intercept, \(\tau_{0}\) and \(\tau_{1}\).

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Question Posted: