Question: Run Markdown NFL Play-by-play Modeling and Classification In [10] : import os import pandas as pd import numpy as np import seaborn as sb import

Run Markdown NFL Play-by-play Modeling and Classification In [10] : import os import pandas as pd import numpy as np import seaborn as sb import statsmodels . api as sm import sklearn. tree as tr import sklearn. ensemble as ens import sklearn. linear_model as slm from sklearn. model_selection import train_test_split import matplotlib. pyplot as plt pd. set_option( 'display . max_colwidth' , None) pd. set_option ( 'mode . chained_assignment ' , None) In [11] : section = 100 In [12]: base = "/scratch/stats206s$sw23_class_root/stats206s$sw23_class/materials/data" : (section, section) nfl = pd. read_csv(os . path. join(base, "NFL_play_by_play_2022.csv. gz") ) nf1 . shape Out [ 12] : (50147, 340) These data record play-by-play information for all games in the 2022 National Football League (NFL) season. These data were downloaded using the nflverse package for the R programming language (another statistics and data science environment), lightly edited, and saved in a tabular format for us to use in Python. There are many measurements for each play, some of which are computed values from nflverse . Here's a brief list using the data dictionary. In [13]: nfl_data_dictionary = pd. read_csv(os . path. join(base, "NFL_play_by_play_data_dictonary.csv"), index_col = "Field") nfl_data_dictionary . loc[ ["play_id", "game_id", "home_team", "away_team", "posteam", "defteam", "yardline_100", "down", "ydstogo" , touchdown", "play_type" ]] Out [13 ] : Description Type Field play_id Numeric play id that when used with game_id and drive provides the unique identifier for a single play. numeric game_id Ten digit identifier for NFL game. character home_team String abbreviation for the home team. character away_team String abbreviation for the away team. character posteam String abbreviation for the team with possession. character defteam String abbreviation for the team on defense. character yardline_100 Numeric distance in the number of yards from the opponent's endzone for the posteam. numeric down The down for the given play. numeric ydstogo Numeric yards in distance from either the first down marker or the endzone in goal down situations. numeric touchdown Binary indicator for if the play resulted in a TD. numeric play_type String indicating the type of play: pass (includes sacks), run (includes scrambles), punt, field goal, kickoff, extra point, qb_kneel, qb_spike, no_play (timeouts and penalties), and missing for rows indicating end of play. character Question 1 Part (a) For this section, we will aggregate the individual plays into games. Use group by on "game_id" to aggregate the games. Include the following columns: { "home_score": "first", "away_score": "first", "week": "first", "home_team": "first" "away_team": "first", 'roof": "first", "wind" : "median", "temp": "median", "play_id": "size"} Call the result games . Demonstrate using a plot that shows the number of games played each week. The seson is composed of a regular season in which all teams play and post season playoffs in which only some teams play. Using the plot, how many weeks are in a regular season? In [ ]:File Edit View Insert Cell Kernel Widgets Help Not Trusted I Python 3 [ipykerneljt O 1' 3:: ea E + + 'RUH I C Markdown V Part (b) Some people think teams benefit from playing at home. Compute the difference between the home team score and the away team score and store it as a new column (call it "home_away_score" }. Plot this new variable. Do you see evidence of this claim? I n I ] : Part to] Suppose these games represent a sample from all possible games that could have been played in 2022. Let X be the home and away teams' score difference. Test the hypothesis: H0 : 1500 = 0 against H] : EUR) 3: 0 at the 5% level or create a 95% condence interval for 504'). What do you conclude about this hypothesis. interpret it as evidence for or against the claim of home eld advantage. I n f ] : Part (d) One theory of home game advantage states that teams that play outdoors in cold weather are acclimated to cold weather, while teams that do not play outdoors will not perform as well in outdoor games. We will ask a slightly simpler question and ask if the average home and away difference in outdoor games is larger than in indoor games. To do this, we need to identity if a game is played outdoors. Investigate the \"roof" column and create a new column (call it "is_outdoors\" )that has the value True if the games is played outdoors and False otherWIse. I n l ] : Use a box plot to explore whether games played outdoors have different home and away score differences than non-outdoor games. I n l ] : Part [e] Perform a difference of means hypothesis test to the the hypothesis that the average score difference is the same for both outdoors and non-outdoors games against the alternative that it is different. At the 5% level (or using a 85% confidence interval] what do you conclude? I n l ] : Part If) Another way to perform this test is to use linear regression. If we write: E(YlX=x)=a+bx Then the difference of means for E(Y I X=l)E(YIX=lJ)=(a+b)(a+b'0)=b The hypothesis test will use a slightly different standard error calculation, but it will be still be a valid way to test this hypothesis or get condence intervals. Use the 5m.0LS to perform a linear regression of "home7awayiscore " on " isioutdoors" . You will need to convert the "isioutdoors " variable to a numeric 1m version first. This can be done by using . astypet 'int' 1 to create a new column of 0 and 1 values. Display the confidence intervals for each coefficient. For the is_outdoors coefficient, what do you see? In [5]: #4!' quick example ti = pd.Series([True, False, False, True\" ti.astype("int"l Outh]: U 1 l U 2 U 3 1 dtype: int.\" In t 1: Part lg) If our theory that outdoor games helps the home team because of the weather, perhaps we can use measured temperature and wind to see if decreasing File Edit View Insert Cell Kernel Widgets Help Not Trusted | Python 3 (ipykernel) O a + 3 + Run C > Markdown Part (9) If our theory that outdoor games helps the home team because of the weather, perhaps we can use measured temperature and wind to see if decreasing temperature and increasing wind increases the the home team's score over the away team. You will notice that there is some amount of missingness for the "temp" and "wind" columns. Create a new column that track if either are missing for each game. Compute the conditional probability of missing either of these measurements for the different "roof" categories. What do you notice? In [ ] : Part (h) Imputation is the process of filling in missing values with reasonable guesses. In this case, since we are missing measurements for the non-outdoors games, we will assume that the wind is zero. For the temperature, we will assume that most indoor games are warmer than outdoor games, but perhaps not all and use the 90th quantile of the observed "temp" as our imputation value. The . fillna (VALUE, inplace = True) can be used to update our table with the imputed values. Create a scatters plot of the home and away score difference with wind and temperature, with temperature as the x-axis and wind as the size of the dot. Does this plot support the idea that temperature is an influential factor in home vs. away scores? In [ ]: Part (i) Perform a multiple linear regression using " is_outdoors" (converted to 0 and 1), "wind" , and "temp ". Print out the parameters and 95% confidence intervals. For each factor, holding the others constant, would we reject the hypothesis that the conditional mean of the score difference is independent of the factor? In [ ] : Part () Review the results in this section. Write up one paragraph summarizing the results. What have learned about home field advantage? Type Markdown and LaTex: a2 Question 2 Part (a) In the previous question we focused on the difference in scores between the home and away team, but you only need to win by one point to claim victory. Create a new column called ' home_win' in the games table that is true if the home team won. We will be useing it with several routines that require numeric values, so use . astype( ' int' ) on it right away. Estimate the conditional probability of a home team win and create a 95% confidence interval for the population proportion (recall, the appoximate standard error will be v p(1 - p) where p is the estimated conditional probability. Interpret the result as it pertains to the question of homefield advantage. In [ ]: Part (b) One aspect of this analysis we haven't taken into account is the particular abilities of some teams versus others. To use the teams in the analaysis, we need to replace the single column of team names for the home team with a set of columns composed of binary columns, one for each team. This way of encoding variables is sometimes called "dummy encoding" or "one-hot encoding". Here's an example: In [6]: ex = pd. DataFrame ( {"letter" : ['a', ' c', 'b',' 'b', 'c'], "number" : [8, 7, -2, 3, 5, 91}) pd. get_dummies (ex) Out [ 6] : number letter_a letter_b letter_cFile Edit View Insert Kernel Widgets Help Not Trusted |Python 3 (ipykernel) O + 45 Run | Markdown VI Out [ 6] : number letter_a letter_b letter_c 7 0 0 O 0 W N 3 0 0 Create a table with the indicator if the hometeam won, the week, the wind, the temp, whether the game was outdoors, an the home and away team names. Use pd. get_dummies ( ) to turn the home and away team names into dummy encoded columns. Call this table full_season . In [ ]: Now create a table that limits the rows to the games before the super bowl (i.e., full_seasion[ "week" ] 1 depths = range (1, 50) --> 2 scores = [d in depths ] # change this line NameError: name 'd' is not defined Also find the particular max depth value that has the best score. Here's a little helper code. In [ ] : # uncomment when you are ready # best_depth = depths [np. argmax (scores ) ]jupyter homework05 Last Checkpoint: 04/12/2023 (unsaved changes) Logout File Edit View Insert Cell Kernel Widgets Help Not Trusted |Python 3 (ipykernel) + 3 + Run C Markdown V X Type Markdown and LaTex: a- Part (d) There are several tuning parameters we could try to see if we get better performing trees. We will focus on max_depth option which controls how many splits are allowed in the tree. Here are several values to try. For each, create a tree using the training data and score using the test data (hint: write a list comprehension that can fit and score in a single chained method call). Use a lineplot to graph the scores versus the depths. In [8]: depths = range (1,50) scores = [d in depths ] # change this line --- NameError Traceback (most recent call last) /tmp/ipykernel_763833/2163036997.py in 1 depths = range (1, 50) ----> 2 scores = [d in depths ] # change this line NameError: name 'd' is not defined Also find the particular max depth value that has the best score. Here's a little helper code. In [ ] : # uncomment when you are ready # best_depth = depths [np. argmax (scores ) ] Part (d) Fit a tree on the full x and pre_super [ "home_win" ] using the best depth from the previous part. In [ ]: Provided you have created the full_season table as specified and have the column home_win recording if the home team won, the following code will set up two versions of the 2022 season final game between the Kansas City Chiefs and the Philadelphia Eagles. (In the data PHI is marked as home even though the game as played at a neutral site, so we will try it both ways as if both teams were home.) Use the . predict ( ) method to predict the outcome for each super bowl (1 means home team is predicted to win). In [ ] : Kansas City won 38 to 35, so it was a very close game. Our classifier should probably have estimated close probabilities for the game. We can get the estimated probabilities using . predict_proba ( ) . Was the game predicted to be close? In [ ] : # superbowl_kc_at_phi = full_season. loc[ [ "2022_22_KC_PHI" ] ] . drop(columns = "home_win") # superbowl_phi_at_kc = superbowl_kc_at_phi . copy ( ) # superbowl_phi_at_kc[ "home_team PHI" ] = 0 # superbowl_phi_at_kc[ "home_team KC"] = 1 # superbowl_phi_at_kc[ "away_team_PHI"] = 1 # superbowl_phi_at_kc[ "away_team KC"] = 0 Part (e) As one alternative to our best decision tree, we will try a ens . RandomForestClassifer (with defaults). Fit the random forest on the X_train and y_train and evaluate using . score (X_test, y_test ) . How does it perform compared to the best decision tree? In [ ]: Part (f) Refit the random forest on the entire data set and again predict Super Bowl results and probabilities. What differences do you see in the predictions/probabilities? In [ ]

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

Need help with stats/data science question in python, thank you! NFL Play-by-play Modeling and Classification In [1] import os import pandas as pd import numpy as np import seaborn as sb import...

D O, N O T TkeDeep Learning by proximity of networking and advanced programming Criteria Points AVOI Part 1 - Question 1 Normalize the train and test data 2 Part 1 - Question 2 Build and train a ANN...

Solve all parts with code The google colab code/file is : { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear Regression for Red Wine Quality Classification" ] }, {...

CAN YOU SOLVE BOTH PARTS WITH ACTUAL CODE IN GOOGLE COLAB USING THE . ipynb file copied and pasted below! { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear Regression for...

Here are the draft of my Group Project, title is :The impacts of a well-balanced diet on immunity in combating the COVID-19 virus in various countries we are solvong the 3 questions: 1.How many...

Total Number of Wins by Average Points Scored 70 60 50 Total Number of Wins 40 30 20 10 85 90 95 100 105 110 Average Points Scored Correlation between Average Points Scored and the Total Number of...

I need help specifically with ( 1 H , Part 3 Important Reminders This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading. This...

Data Science, Python, Jupyter Notebook I have a term project for my Capstone class in Data Science. Below is the syllabus, dataset, and the Jupiter Notebook. I am creating a Classification model to...

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ICE5_NLP", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } },...

Procedure 1. Attach the balloon to the end of the straw that fits inside the wash bottle, and secure it with a rubber band. The balloon will act as the lung. 2. While gently squeezing the bottle,...

Which of the following statements regarding Other Comprehensive Income (OCI) is true? a. Other Comprehensive Income (OCI) is use din calculating earnings per share (EPS). b. Other Comprehensive...

All of the following are regulations for producers using an assumed name EXCEPT A . The name must not be misleading to the public. B . The name may be similar to another name currently on file. C ....

Your investment club has only two stocks in its portfolio. $20,000 is invested in a stock with a beta of 0.3, and $35,000 is invested in a stock with a beta of 2.1. What is the portfolio's beta?...