Question: For this section, we will aggregate the individual plays into games. Use group by on game_id to aggregate the games. Include the following columns: {home_score:

For this section, we will aggregate the individual plays into games.

Use group by on "game_id" to aggregate the games. Include the following columns:

{"home_score": "first", "away_score": "first", "week": "first", "home_team": "first", "away_team": "first", "roof": "first", "wind": "median", "temp": "median", "play_id": "size"}

Call the result games. Demonstrate using a plot that shows the number of games played each week. The seson is composed of a regular season in which all teams play and post season playoffs in which only some teams play. Using the plot, how many weeks are in a regular season?

b. Some people think teams benefit from playing at home. Compute the difference between the home team score and the away team score and store it as a column (call it "home_away_score").

Plot this new variable. Do you see evidence of this claim?

c. Suppose these games represent a sample from all possible games that could have been played in 2022. Let be the home and away teams' score difference. Test the hypothesis:

0:()=0 against 1:()0

at the 5% level or create 95% confidence interval for (). What do you conclude about this hypothesis. Interpret it as evidence for or against the claim of home field advantage.

Part (d)

One theory of home game advantage states that teams that play outdoors in cold weather are acclimated to cold weather, while teams that do not play outdoors will not perform as well in outdoor games.

We will ask a slightly simpler question and ask if the average home and away difference in outdoor games is larger than in indoor games.

To do this, idenfity if a game is played outdoors. Investigate the "roof" column and create a column (call it "is_outdoors") that has the value True if the games is played outdoors and False otherwise.

Use a box plot to explore whether games played outdoors have different home and away score differences than non-outdoor games.

Part (e)

Perform a difference of means hypothesis test to the the hypothesis that the average score difference is the same for both outdoors and non-outdoors games against the alternative that it is different. At the 5% level (or using 95% confidence interval) what do you conclude?

Part (f)

Another way to perform this test is to use linear regression. If we write:

(=)=+

Then the difference of means for

(=1)(=0)=(+)(+0)=

.The hypothesis test will use a slightly different standard error calculation, but it will be still be a valid way to test this hypothesis or get confidence intervals.

Use the sm.OLS to perform a linear regression of "home_away_score" on "is_outdoors". You will need to convert the "is_outdoors" variable to a numeric 1/0 version first. This can be done by using .astype('int') to create a column of 0 and 1 values.

Display the confidence intervals for each coefficient. For the is_outdoors coefficient, what do you see?

## quick example tf = pd.Series([True, False, False, True]) tf.astype("int")

Part (g)

If our theory that outdoor games helps the home team because of the weather, perhaps we can use measured temperature and wind to see if decreasing temperature and increasing wind increases the the home team's score over the away team.

You will notice that there is some amount of missingness for the "temp" and "wind" columns. Create a column that track if either are missing for each game.

Compute the conditional probability of missing either of these measurements for the different "roof" categories. What do you notice?

Part (h)

Imputation is the process of filling in missing values with reasonable guesses. In this case, since we are missing measurements for the non-outdoors games, we will assume that the wind is zero. For the temperature, we will assume that most indoor games are warmer than outdoor games, but perhaps not all and use the 90th quantile of the observed "temp" as our imputation value.

The .fillna(VALUE, inplace = True) can be used to update our table with the imputed values.

Create a scatters plot of the home and away score difference with wind and temperature, with temperature as the x-axis and wind as the size of the dot. Does this plot support the idea that temperature is an influential factor in home vs. away scores?

Part (i)

Perform a multiple linear regression using "is_outdoors" (converted to 0 and 1), "wind", and "temp". Print out the parameters and 95% confidence intervals.

For each factor, holding the others constant, would we reject the hypothesis that the conditional mean of the score difference is independent of the factor?

Part (j)

Review the results in this section. Write up one paragraph summarizing the results. What have learned about home field advantage?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!