Question: CSSS/SOC/STAT 221 students from the Summer 2019, Fall 2019, Winter 2020, and Summer 2020 quarters were asked to report the lowest temperature they remembered experiencing,

CSSS/SOC/STAT 221 students from the Summer 2019, Fall 2019, Winter 2020, and Summer 2020 quarters were asked to report the lowest temperature they remembered experiencing, reported in degrees Fahrenheit (F). Among the Summer 2019 students, 26 out of 37 reported valid temperatures (the remaining 11 opted not to respond to this question). Among the Fall 2019 students, 159 out of 188 reported valid temperatures (the remaining 29 opted not to respond to this question).Among the Winter 2020 students, 204 out of 238 reported valid temperatures (32 opted not to answer, while 2 more reported temperatures that were impossibly high for that quarter). Among the Summer 2020 students, 51 out of 62 reported valid temperatures (10 opted not to answer, while 1 more reported a temperature that was impossibly high for that quarter). In total, 440 out of 525 students provided valid responses to this question (an 83.81% valid response rate).

The side-by-side box plots below suggest that the temperature experiences of these students were broadly similar between Summer 2019 and Summer 2020:

The following figure further reinforces this assertion:

The gray histogram presents the sample distribution of all 440 valid responses. The solid black, red, green, and blue lines present smoothed representations of each quarter's sample distribution (they are similar to the hollow histograms discussed in chapter 2 of your textbook). As you can see, the location, scale, and shape of all four quarter-specific sample distributions are very similar to one another. Their medians vary between 10 and 12.5; their first quartiles vary between -5 and 0; their third quartiles vary between 20 and 24; and their interquartile ranges vary between 20 and 29. All four samples are also clearly left-skewed. When combined into a single sample, the quartiles {Q0, Q1, Q2, Q3, Q4} for all 440 valid cases are {-60, -1, 12, 22, 60} degrees F, implying an interquartile range of 23 degrees.

The dashed line in the above graph represents a new parametric probability distribution model for continuous numerical variables called the Gumbel distribution, which has been fitted to the combined student data. The Gumbel model belongs to a set of models called "extreme value distributions," which are widely used to approximate the probability distributions underlying "lowest value from median" or "highest value from median" variables, for example scores for a particular sporting event observed over multiple years; daily website user traffic; market anomalies; etc. These distributions tend to be left-skewed for "lowest value per observational unit" data or right-skewed for "highest value per observational unit." When applied to left-skewed data such as the temperature variable described above, the Gumbel model has two parameters, the location parameter and the scale parameter . The probability function for this model based on these two parameters is

The location parameter is the mode (peak) of the distribution, which is related to the median[1] () and the mean () as

The scale parameter is related to the interquartile range () and the standard deviation () as

(where is the irrational mathematical constant approximately equal to 3.1416). The dashed line in the graph above is based on point estimates of the Gumbel parameters when fitted to the combined student temperature data: and .

The plots on the next five pages present (1) a histogram for each quarter of the student data described above (pages 4-7), as well as for the combined dataset (page 8); (2) histograms for six simulated random samples generated from a distribution with the same sample size as the real data shown on that page; and (3) a distribution of sample averages calculated for a large number (50,000) of identically generated samples, each with the same sample size as the real data shown on that page. The average of each sample is shown each of the first seven histograms with a dotted line, while of the average of the real sample is represented as a dotted line in the eighth histogram. The lower and upper 2.5% are also represented in the eighth histogram with solid vertical lines.

Problem Set 5 questions

1. The first through third quartiles of the Gumbel model may be calculated as follows

Using these three equations and the point estimates of and given on page 3 above, calculate the first through third quartiles for this model, to the fourth decimal place.

Q1=______

Q2=______

Q3=______

If the fitted Gumbel distribution is a good model for the combined student data, parameters such as the quartiles should come close to the corresponding sample statistics. Is this true for the quartiles you just calculated compared to the sample quartiles given on page 2? Why or why not?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2. Many probability distribution models in statistics are unimodal (i.e., they have only one peak). In contrast, multimodal sample distributions are quite common. If you examine the histograms for the student data on pages 4 through 8 (histogram in the first row, first column of each page), which sample histograms appear to have multiple pronounced modes? what appears to determine how many modes are in the sample distributions? What is the name of the principle in statistics that explains this pattern?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

3. If you examine the histograms of simulated samples on pages 4 through 8, does this support the explanation you offered in Question 2? Why or why not?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

4. As sample size increases, the distribution of a sample should approach the location, scale, and shape of its underlying probability distribution. If we assume that all four quarters of student data are sampled from the same population, do you think that the Gumbel distribution provides a reasonable approximation of the underlying probability distribution for this variable? Why or why not?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

5. The last histograms on pages 4 through 8 do not describe the distribution of a single sample but instead the distribution of 50,000 sample averages (i.e. the sample averages calculated for 50,000 separate samples). What is the name of the probability distribution toward which the distributions of sample statistics such as this converge as the number of identically generated samples approaches infinity?

______________________________________________________________________________

6. Do you think that the Gumbel distribution provides a reasonable approximation of the probability distribution toward which these distributions of sample averages converge as the number of samples approaches infinity? Why or why not? If not, what distribution might be a better candidate model be? Why?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

7. If the fitted Gumbel distribution is a good model for the combined student data, the sample averages calculated for a real sample should fall within a typical range of sample averages resulting from that model. Based on the last histogram on pages 4 through 8, does the fitted Gumbel distribution appear to be a good model for each quarter's sample, as well as for the combined sample? Why or why not?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

8. Use Equation 1 (page 2; the probability function for the Gumbel model) and the point estimates for and to calculate the probability that that you calculated in Question 1 above, to the fourth decimal place. You can (and should) check your solution against the dashed line in the figure on page 2. _______________________________________________________

9. How fun was your calculation in Question 8? (Be honest; there are no wrong answers.)

______________________________________________________________________________

10. Like the normal distribution, the Gumbel distribution goes on in both directions to infinity (negative and positive). Given what you know about the "lowest temperature" variable that CSSS/SOC/STAT 221 students generated (including how invalid cases were excluded, as described in the introduction), how does this limit the appropriateness of the Gumbel model for these data?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

Problem Set 6 questions

Among the Fall 2019 students represented in the first histogram on page 5, thirteen () out of one hundred and fifty-nine () students () had never seen temperatures as low as 32 degrees Fahrenheit (i.e. freezing, equal to 0 degrees Celsius). While it is reasonable to expect that some UW students have never experienced freezing temperatures, the question remains open regarding what the population proportion might be (). We can use the sample proportion as a point estimate, . However, it is almost certain that is not equal to , i.e. that there is some random sampling error.

To determine plausible values of , we can make use of the concept of a sampling distribution of proportions, which governs the distribution of common, uncommon, and rare values of the sample proportion and relates this distribution to . Without actually knowing , we can make two assumptions that will make our lives easier.

First, we will assume that (the number of observed successes in a given sample) follows a binomial distribution with a fixed and known sample size and a fixed but unknown value of . The mean and standard deviation of the binomial distribution are

The sampling distribution of proportions has the same shape as the binomial distribution but is relocated and rescaled to sit over all possible values of the sample proportion rather than the count of successes . In this case, the mean, variance, and standard deviation are

Second, under certain conditions we can assume that this sampling distribution is reliably approximated with a normal distribution, using the above mean and standard deviation:

Now comes the hard part: because we don't know , we don't know which sampling distribution is the correct one. Instead, what we will do is identify two sampling distribution for : one based on a small value of (denoted ) that includes at its upper boundary of "typical" outcomes, the other based on a large value of (denoted ) that includes at the lower boundary of its typical outcomes. If we use to stand for the "margin of error," in other words the space between the sample proportion and either of these two hypothetical values of , then we can summarize the relationship between and asand .

The size of depends on two values: a z-score for a particular "confidence level" (denoted ), and the "standard error" for the sampling distribution (denoted ), according to the equation .

The "confidence level" gives explicit numerical definition to what a typical outcome is. Specifically, the confidence level (denoted ) is a chosen proportion of the most probable values of the statistic (in this case, ). If we assume the standard normal distribution, we can easily find a value of that corresponds with any chose value . For example, if , then , because 95% of the area under the standard normal distribution falls within the central interval , while 2.5% falls to the left of -1.96 and another 2.5% falls to the right of +1.96. We have to choose what our confidence level is, but some values for for typical confidence levels are

The "standard error" is a special name that we give to the standard deviation of a normally approximated sampling distribution, in this case from Equations 12 and 13.

Note that if we do not know , then we also do not know . However, we will approximate as

Since the size of should not very greatly from for any value of between and if the "success-failure" condition is met. What is the success failure condition?

In words, the normal approximation of the sampling distribution of proportions with a fixed standard error of is acceptable if and only if there are at least ten successes and ten failures in the sample. If there are either fewer than 10 successes or fewer than ten failures in the sample (or fewer than ten of both), then the success-failure condition is violated. Note that this success-failure condition is best-suited to a 95% confidence level.

1. Is the success-failure condition met for the Fall 2019 students? Why or why not?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

If we assume that the success-failure condition is met, the confidence interval for given and is

which can abbreviated as

And then rewritten as

2. Calculate the 95% confidence interval for given the values for , , and given on page 10. Be sure to confidence interval as two numbers, with the smaller number written first, in parentheses, separated by a comma: .

______________________________________________________________________________

3. Calculate the 90% confidence interval. Be sure to write it in the same form that you wrote your answer to question 2.

______________________________________________________________________________

4. Calculate the 99% confidence interval. Be sure to write it in the same form that you wrote your answer to questions 2 and 3.

______________________________________________________________________________

5. What pattern do you notice for confidence intervals as they increase from 90% to 95% to 99% confidence intervals, and what is determining this pattern?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

6. Confidence intervals identify ranges of "plausible" estimates of the values of unknown parameters. A "plausible" value is a value that seems to be consistent with the data we observe. However, in practice confidence intervals sometimes exclude the true value. We call a confidence interval "accurate" if it contains the true value and "inaccurate" if it does not contain the true value. Increasing the size of a confidence interval will make it "more accurate" because it contains a wider range of plausible values for a parameter estimate. However, increasing the size of a confidence interval also makes it less precise, so as accuracy increases, precision decreases. For which confidence level out of 90%, 95%, and 99% is the confidence interval most precise but least accurate? Why? For which confidence level out of 90%, 95%, and 99% is the confidence interval most accurate but least precise? Why?

Most precise but least accurate: ____________________________________________________

Why? ________________________________________________________________________

Most accurate but least precise: ____________________________________________________

Why? ________________________________________________________________________

7. [2 points] If we assume the Gumbel model with the parameters and , the probability that a person drawn at random has never experienced temperatures lower than 32 degrees Fahrenheit is

Is this value considered a "plausible" value for according to each of the confidence intervals you calculated above? Why or why not?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

8. [2 points] Let's hypothesize that for the fall students. This will be our "null hypothesis." We can identify a normal approximation of the sampling distribution of proportions assuming this hypothetical value, using Equations 11 through 13 above. The "typical" (95% most probable) range of sample proportions assuming this value and the sample size given on page 11 is

Is the sample proportion given on page 11 among the typical values of sample proportions, according to this calculation?

__________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

[1] For those who are curious, the cumulative distribution function (i.e. the "area under the curve to the left of x") is

If is set to 0.5 (corresponding with the median) and the above equation is rearranged to solve for and simplified, the median is identified as

More generally, any value of x corresponding with a cumulative probability may be calculated as follows (the 1st quartile is while the 3rd quartile is )

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!