Question: Run Markdown NFL Play-by-play Modeling and Classification In [10] : import os import pandas as pd import numpy as np import seaborn as sb import
![Run Markdown NFL Play-by-play Modeling and Classification In [10] : import](https://dsd5zvtm8ll6.cloudfront.net/si.experts.images/questions/2024/11/672d9b22a4050_082672d9b2283c15.jpg)




Run Markdown NFL Play-by-play Modeling and Classification In [10] : import os import pandas as pd import numpy as np import seaborn as sb import statsmodels . api as sm import sklearn. tree as tr import sklearn. ensemble as ens import sklearn. linear_model as slm from sklearn. model_selection import train_test_split import matplotlib. pyplot as plt pd. set_option( 'display . max_colwidth' , None) pd. set_option ( 'mode . chained_assignment ' , None) In [11] : section = 100 In [12]: base = "/scratch/stats206s$sw23_class_root/stats206s$sw23_class/materials/data" : (section, section) nfl = pd. read_csv(os . path. join(base, "NFL_play_by_play_2022.csv. gz") ) nf1 . shape Out [ 12] : (50147, 340) These data record play-by-play information for all games in the 2022 National Football League (NFL) season. These data were downloaded using the nflverse package for the R programming language (another statistics and data science environment), lightly edited, and saved in a tabular format for us to use in Python. There are many measurements for each play, some of which are computed values from nflverse . Here's a brief list using the data dictionary. In [13]: nfl_data_dictionary = pd. read_csv(os . path. join(base, "NFL_play_by_play_data_dictonary.csv"), index_col = "Field") nfl_data_dictionary . loc[ ["play_id", "game_id", "home_team", "away_team", "posteam", "defteam", "yardline_100", "down", "ydstogo" , touchdown", "play_type" ]] Out [13 ] : Description Type Field play_id Numeric play id that when used with game_id and drive provides the unique identifier for a single play. numeric game_id Ten digit identifier for NFL game. character home_team String abbreviation for the home team. character away_team String abbreviation for the away team. character posteam String abbreviation for the team with possession. character defteam String abbreviation for the team on defense. character yardline_100 Numeric distance in the number of yards from the opponent's endzone for the posteam. numeric down The down for the given play. numeric ydstogo Numeric yards in distance from either the first down marker or the endzone in goal down situations. numeric touchdown Binary indicator for if the play resulted in a TD. numeric play_type String indicating the type of play: pass (includes sacks), run (includes scrambles), punt, field goal, kickoff, extra point, qb_kneel, qb_spike, no_play (timeouts and penalties), and missing for rows indicating end of play. character Question 1 Part (a) For this section, we will aggregate the individual plays into games. Use group by on "game_id" to aggregate the games. Include the following columns: { "home_score": "first", "away_score": "first", "week": "first", "home_team": "first" "away_team": "first", 'roof": "first", "wind" : "median", "temp": "median", "play_id": "size"} Call the result games . Demonstrate using a plot that shows the number of games played each week. The seson is composed of a regular season in which all teams play and post season playoffs in which only some teams play. Using the plot, how many weeks are in a regular season? In [ ]:File Edit View Insert Cell Kernel Widgets Help Not Trusted I Python 3 [ipykerneljt O 1' 3:: ea E + + 'RUH I C Markdown V Part (b) Some people think teams benefit from playing at home. Compute the difference between the home team score and the away team score and store it as a new column (call it "home_away_score" }. Plot this new variable. Do you see evidence of this claim? I n I ] : Part to] Suppose these games represent a sample from all possible games that could have been played in 2022. Let X be the home and away teams' score difference. Test the hypothesis: H0 : 1500 = 0 against H] : EUR) 3: 0 at the 5% level or create a 95% condence interval for 504'). What do you conclude about this hypothesis. interpret it as evidence for or against the claim of home eld advantage. I n f ] : Part (d) One theory of home game advantage states that teams that play outdoors in cold weather are acclimated to cold weather, while teams that do not play outdoors will not perform as well in outdoor games. We will ask a slightly simpler question and ask if the average home and away difference in outdoor games is larger than in indoor games. To do this, we need to identity if a game is played outdoors. Investigate the \"roof" column and create a new column (call it "is_outdoors\" )that has the value True if the games is played outdoors and False otherWIse. I n l ] : Use a box plot to explore whether games played outdoors have different home and away score differences than non-outdoor games. I n l ] : Part [e] Perform a difference of means hypothesis test to the the hypothesis that the average score difference is the same for both outdoors and non-outdoors games against the alternative that it is different. At the 5% level (or using a 85% confidence interval] what do you conclude? I n l ] : Part If) Another way to perform this test is to use linear regression. If we write: E(YlX=x)=a+bx Then the difference of means for E(Y I X=l)E(YIX=lJ)=(a+b)(a+b'0)=b The hypothesis test will use a slightly different standard error calculation, but it will be still be a valid way to test this hypothesis or get condence intervals. Use the 5m.0LS to perform a linear regression of "home7awayiscore " on " isioutdoors" . You will need to convert the "isioutdoors " variable to a numeric 1m version first. This can be done by using . astypet 'int' 1 to create a new column of 0 and 1 values. Display the confidence intervals for each coefficient. For the is_outdoors coefficient, what do you see? In [5]: #4!' quick example ti = pd.Series([True, False, False, True\" ti.astype("int"l Outh]: U 1 l U 2 U 3 1 dtype: int.\" In t 1: Part lg) If our theory that outdoor games helps the home team because of the weather, perhaps we can use measured temperature and wind to see if decreasing File Edit View Insert Cell Kernel Widgets Help Not Trusted | Python 3 (ipykernel) O a + 3 + Run C > Markdown Part (9) If our theory that outdoor games helps the home team because of the weather, perhaps we can use measured temperature and wind to see if decreasing temperature and increasing wind increases the the home team's score over the away team. You will notice that there is some amount of missingness for the "temp" and "wind" columns. Create a new column that track if either are missing for each game. Compute the conditional probability of missing either of these measurements for the different "roof" categories. What do you notice? In [ ] : Part (h) Imputation is the process of filling in missing values with reasonable guesses. In this case, since we are missing measurements for the non-outdoors games, we will assume that the wind is zero. For the temperature, we will assume that most indoor games are warmer than outdoor games, but perhaps not all and use the 90th quantile of the observed "temp" as our imputation value. The . fillna (VALUE, inplace = True) can be used to update our table with the imputed values. Create a scatters plot of the home and away score difference with wind and temperature, with temperature as the x-axis and wind as the size of the dot. Does this plot support the idea that temperature is an influential factor in home vs. away scores? In [ ]: Part (i) Perform a multiple linear regression using " is_outdoors" (converted to 0 and 1), "wind" , and "temp ". Print out the parameters and 95% confidence intervals. For each factor, holding the others constant, would we reject the hypothesis that the conditional mean of the score difference is independent of the factor? In [ ] : Part () Review the results in this section. Write up one paragraph summarizing the results. What have learned about home field advantage? Type Markdown and LaTex: a2 Question 2 Part (a) In the previous question we focused on the difference in scores between the home and away team, but you only need to win by one point to claim victory. Create a new column called ' home_win' in the games table that is true if the home team won. We will be useing it with several routines that require numeric values, so use . astype( ' int' ) on it right away. Estimate the conditional probability of a home team win and create a 95% confidence interval for the population proportion (recall, the appoximate standard error will be v p(1 - p) where p is the estimated conditional probability. Interpret the result as it pertains to the question of homefield advantage. In [ ]: Part (b) One aspect of this analysis we haven't taken into account is the particular abilities of some teams versus others. To use the teams in the analaysis, we need to replace the single column of team names for the home team with a set of columns composed of binary columns, one for each team. This way of encoding variables is sometimes called "dummy encoding" or "one-hot encoding". Here's an example: In [6]: ex = pd. DataFrame ( {"letter" : ['a', ' c', 'b',' 'b', 'c'], "number" : [8, 7, -2, 3, 5, 91}) pd. get_dummies (ex) Out [ 6] : number letter_a letter_b letter_cFile Edit View Insert Kernel Widgets Help Not Trusted |Python 3 (ipykernel) O + 45 Run | Markdown VI Out [ 6] : number letter_a letter_b letter_c 7 0 0 O 0 W N 3 0 0 Create a table with the indicator if the hometeam won, the week, the wind, the temp, whether the game was outdoors, an the home and away team names. Use pd. get_dummies ( ) to turn the home and away team names into dummy encoded columns. Call this table full_season . In [ ]: Now create a table that limits the rows to the games before the super bowl (i.e., full_seasion[ "week" ] 1 depths = range (1, 50) --> 2 scores = [d in depths ] # change this line NameError: name 'd' is not defined Also find the particular max depth value that has the best score. Here's a little helper code. In [ ] : # uncomment when you are ready # best_depth = depths [np. argmax (scores ) ]jupyter homework05 Last Checkpoint: 04/12/2023 (unsaved changes) Logout File Edit View Insert Cell Kernel Widgets Help Not Trusted |Python 3 (ipykernel) + 3 + Run C Markdown V X Type Markdown and LaTex: a- Part (d) There are several tuning parameters we could try to see if we get better performing trees. We will focus on max_depth option which controls how many splits are allowed in the tree. Here are several values to try. For each, create a tree using the training data and score using the test data (hint: write a list comprehension that can fit and score in a single chained method call). Use a lineplot to graph the scores versus the depths. In [8]: depths = range (1,50) scores = [d in depths ] # change this line --- NameError Traceback (most recent call last) /tmp/ipykernel_763833/2163036997.py in
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
