Question: NAME: Quantitative Analysis Due: Assignment 5: Sampling Distributions, Estimation, and Confidence Intervals Sampling Distributions Inferential Statistics involves three distributions: A population distribution - variation in
NAME: Quantitative Analysis Due: Assignment 5: Sampling Distributions, Estimation, and Confidence Intervals Sampling Distributions Inferential Statistics involves three distributions: A population distribution - variation in the larger group (the population) that we want to know about. A distribution of sample observations - variation in the sample that we can observe. A sampling distribution - a normal distribution whose mean and standard deviation are unbiased estimates of the population parameters and allows one to infer the parameters from the sample statistics. While we can try to take samples and observe the means and the distribution of their means, we are interested in the distribution that represents all the possible samples of a given size from a population. This sampling distribution is a theoretical probability distribution of all possible sample values for the statistic (in this case, the mean) in which we are interested. The central limit theorem What does the Central Limit Theorem tell us? - Even if a population distribution is skewed, we know that the sampling distribution of the mean is normally distributed. - As the sample size gets larger the mean of the sampling distribution approaches (becomes more equal to) the population mean. - As the sample size gets larger the standard error of the sampling distribution decreases in size (which means that the variability in the sample estimates from sample to sample decreases as N increases). How we use this information Like the population distribution of individuals and the distribution of individuals who fall in any given sample, the sampling distribution can be thought of in terms of its mean and its standard deviation. The standard error of the mean is really the standard deviation of a sampling distribution. For a variable, X: x = the mean of the population = mean of the sample X N = sample size = Population standard deviation X Sx = sample standard deviation. 1 Furthermore: = the mean of the sampling distribution, i.e. the mean of the means from all possible X samples of a given size. It will equal the population mean ( x) with large samples. S.E. = the standard deviation of the sampling distribution which we will now call the standard error. It is calculated by dividing the standard deviation of the population ( x) by (Note: N. When we don't know the value of (x), we can substitute the sample standard deviation (Sx) to determine the S.E.) The standard error therefore is affected by two things: the variability in the population and the size of the sample. The dilemma is that if sample estimates, i.e. the sample means, vary and if most result in some chance sampling error, how much confidence can we place in them as estimates of the population mean ()? The solution lies in that we know from the central limit theorem that the distribution of sample means is normal. Therefore we can determine the probability of a particular sample mean by calculating its zscore. In this case, however, we calculate the z-score as: Z ( X X ) / S .E. We can now use the area under the normal curve table (on Moodle) to determine how likely a particular value is. Answer the following questions: The Law School Admission Test (LSAT) is designed so that test scores are normally distributed. Although the figures may vary slightly from one administration to the next, the mean LSAT score for the population of all test-takers since the last test revision in 1991 is 150, with a standard deviation of 10. a. If you drew all possible random samples of size 100 from the population of LSAT test-takers and plotted the values of the mean from each sample, the resulting distribution would be the sampling distribution of the mean. Would this sampling distribution be a normal distribution? Why or why not? b. Suppose that LSAT scores were not normally distributed. For a sample size of 100, would the sampling distribution of the mean be a normal distribution? Why or why not? c. What will be the value of the mean of the sampling distribution described in (a)? 2 d. Calculate the value of the standard error of the mean for the sampling distribution described in (a). e. Explain what the standard error measures or describes about the sampling distribution. f. If your sample size were more than 100 cases, would the standard error be the same, higher, or lower? Why? g. Will the standard error always be lower than the standard deviation? If so, why? If not, why not? h. For a sample size of 100, in what percentage of samples would you expect to find a sample mean between 149 and 151? Show your work. i. If you used a sample of 400 test-takers instead of 100, what is the probability that the sample mean would be between 149 and 151? Show your work. j. What happens to the sampling distribution as the sample size increases from 100 to 400? 3 Estimation and Confidence Intervals: In inferential statistics we are interested in making estimates of the population parameter (in this case the mean) from one sample that we draw. We need to have an idea about how good the estimate is. We will begin by thinking about two kinds of estimates. The first is a point estimate. For a point estimate of the mean of the population we use the mean of the sample. However, we would get somewhat different point estimates from different samples. Therefore, we may want to use a second kind of estimate, an interval estimate, i.e. we identify a range of values within which the population parameter will likely fall. This is called a confidence interval. To determine the confidence intervals we can use the formula: CI X Z ( S .E.) Where: sample mean X Z = Z-score for one half the acceptable error S.E. = the estimated standard error, based on the sample standard deviation For example: For a 95% confidence interval, we would use a Z score of 1.96 to calculate the interval. A 99% confidence interval will be larger (and therefore less precise) and we would use the value 2.58 to construct the interval. Remember that in calculating the standard error we would use the standard deviation of the population () if we knew it, however, we don't usually know that and can substitute the standard deviation of the sample in the formula and that is what we have done in the equation above. The value Z ( S .E.) can be thought of as the margin of error. To calculate the margin of error and find the 95% confidence interval for the protein intake of a sample of 267 men with a sample mean of 77.0 grams and a sample standard deviation of 58.6 grams, we would proceed as follows: The margin of error = 1.96(S.E.) = 1.96 x (58.6/ ) = 7.0. 267 The confidence interval is 77.0 7.0. In other words, a range from 70.0 - 84.0 grams. The correct interpretation is as follows: We can be 95% confident that the mean of the population ( x) will fall in the interval. In other words, if we took 100 samples of size 267 from the population, 95 of those samples would have a confidence interval that would contain the population mean. Answer the following questions: Television viewing is not only a convenient and inexpensive form of entertainment; it is also a major advertising medium. Because billions of dollars are spent annually to introduce and promote products on television, describing the television viewing habits of various segments of the population is a big business in itself, carried out by the major networks, companies of all types, marketing firms, 4 and even by political candidates. Using data from the 1993 GSS, Table 1 shows the mean hours of daily television viewing among those with different household incomes. Table 1 Daily Hours of Television Viewing by Level of Household Income Hours of Daily Television Viewing Household Income Mean Standard Deviation N Less than $25,000 3.65 2.79 582 $25,000 or more 2.37 1.55 845 Source: General Social Survey, 1993. a. Explain the difference between a population parameter and a sample statistic. A population parameter is A sample statistic is b. Provide a point estimate of the mean hours of daily television viewing among those with household incomes of $25,000 or more. Remember to report units. c. Among those with household incomes of $25,000 or more, the 95 percent confidence interval for mean hours of daily television viewing is represented by the margin of error of 2.37 .10. Interpret this interval in a complete sentence. d. Using a 95 percent confidence interval, estimate the mean hours of daily television viewing by those with household incomes under $25,000. Show calculations and remember to use units. e. Will the 99 percent confidence interval give a more precise estimate of the mean hours of daily television viewing in each income group (versus the 95 percent C.I.)? Why or why not? 5 6