Predicting Delayed Flights. The file FlightDelays . jmp contains information on all commercial flights departing the Washington,

Question:

Predicting Delayed Flights. The file FlightDelays . jmp contains information on all commercial flights departing the Washington, DC area and arriving at New York during January 2004. For each flight there is information on the departure and arrival airports, the distance of the route, the scheduled time and date of the flight, and so on. The variable that we are trying to predict is whether or not a flight is delayed. A delay is defined as an arrival that is at least 15 minutes later than scheduled.

Data preprocessing. Bin the scheduled departure time (CRS_DEP_TIME) into 8 bins. This will avoid treating the departure time as a continuous predictor, because it is reasonable that delays are related to rush-hour times. (Note that these data are not stored in JMP with a time format, so you'll need to explore the best way to bin this data - two options are (1) via the formula editor and (2) using the Make Binning Formula column utility.) Partition the data into training and validation sets.

a. Fit a classification tree to the flight delay variable using all the relevant predictors (use the binned version of the departure time) and the validation column. Do not include DEP_TIME (actual departure time) in the model because it is unknown at the time of prediction (unless we are doing our predicting of delays after the plane takes off, which is unlikely).

i. How many splits are in the final model?

ii. How many variables are involved in the splits?

iii. Which variables contribute the most to the model?

iv. Which variables were not involved in any of the splits?

v. Express the resulting tree as a set of rules.

vi. If you needed to fly between DCA and EWR on a Monday at $7 \mathrm{AM}$, would you be able to use this tree to predict whether the flight will be delayed? What other information would you need? Is this information available in practice? What information is redundant?

b. Fit another tree, this time using the original scheduled departure time rather than the binned version. Save the formula for this model to the data table (we'll return to this in a future exercise).

i. Compare this tree to the original, in terms of the number of splits and the number of variables involved. What are the key differences?


ii. Does it make more sense to use the original scheduled departure time variable or the binned version? Why?

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Data Mining For Business Analytics Concepts Techniques And Applications With Jmp Pro

ISBN: 246377

1st Edition

Authors: Galit Shmueli ,Peter C Bruce ,Mia L Stephens ,Nitin R Patel

Question Posted: