Question: Project Purpose: Demonstrate your ability to apply modeling tools and methods to a data set. Ultimately, the project should be about building a model. The
Project Purpose: Demonstrate your ability to apply modeling tools and methods to a data set. Ultimately, the project should be about building a model. The first step is the identification of variables. In this course, we will study predictive models for predicting a Y variable. Y variables come in two types: 1) Continuous Y variable. This is a Y variable that takes on a range of values (e.g., $, weight, rating scale, temperature, winning %). For such a variable, we will use the regression model and regression trees. 2) Binary Y variable that takes on two values (yes/no). Examples might be bank customer defaults on loan or not, student graduates or not, customer returns or not, etc. For such a variable, we will use the regression model. Our coverage of the continuous Y variable will make up about 90% of the course. We will not get to the binary Y scenario until the very end of the course. So, I would like your project experience to be the building of models for a continuous Y variable (not a binary Y variable). Your project should minimally complete regression model build; you can consider adding regression trees if your data set is not a time series. Do not include methods we have not covered. I will expect your model searching to include some model selection approaches (such as, AIC/BIC, data splitting, or CV); hypothesis testing (F and t tests) can be discussed but should not be the sole basis for model selection. Determine the Predictor Variables After you have determined the target Y variable that you are trying to predict, the next step is to determine your "X" variables. X variables are the information that you think might be helpful in the prediction of the Y variable. A useful approach to this problem is to draw up a "wish" list of X variables using subject matter considerations. Consult the subject matter research and knowledgeable experts in the area. For example, you might ask current brand managers how they forecast future sales, listening carefully to what factors they use in forming the forecasts. It is not necessary for your experts to understand regression analysis; you are merely milking them for information on the factors that might be predictive of your Y variable. Avoid the mistake of collecting many variables without thinking carefully about whether these variables could have any relevance - "I had that information, so I threw it in anyway." With this said, you should target minimally 5 predictor
2 variables but shoot for many more than that. With software, an analysis of 20 predictor variables is no more difficult than 5 predictor variables. Note: If your Y variable and X variables are timeoriented (e.g., Y is monthly sales), then the time frequency should be the same for the Y and X variables (month vs. month, week vs. week, daily vs. daily, etc.). Don't mix time frequencies (e.g., don't collect annual Y data and try to predict the Y variable with monthly X variables). Be careful that your X variables don't define (or basically define) your Y variable. One student looked at NBA data and picked points scored per game as a Y and had X variables of three pointers made per game (X1), twopointers made per game (X2), and free throw made per game (X3). In my project feedback, I pointed out that there was no statistical modeling to be done here and there is nothing to learn because Y is perfectly defined as follows: Y = 3*X1 + 2*X2 + X3 Similarly, avoid looking at winning percentage as a Y and points scored and points allowed as X variables. You will not learn much more than teams that scored more than allowed were teams that tended to win. Not a deep discovery! Examples Let me emphasize that the goal is to show me your ability to model a set of data. I am open to your data being from your company or some personal interest (data gathered on your own or from the internet). Here are some examples of past projects: Y variable: Monthly Claims Expenses of a particular insurance company. X variables: Variety of monthly economic indicators such as medical CPI. Y variable: Selling price of a used Acuras. X variables: Mileage, age, transmission type, and a variety of accessory information. Y variable: NBA teams' winning percentages. X variables: A slew of statistics (offensive rebounds, defensive rebounds, steals, FT %, etc.). There are millions of possibilities. I have not had a student fail to come up with a data set to analyze. Finance applications: Some students explore stockrelated data. For such data, we generally look at the changes in prices relative to changes to prices rather than the prices per se. Looking at a single stock series and its changes is not sufficient.
3 Data Sources If the data are from your company, then that is your source. If you are looking externally on the internet, there are tens of thousands of sites. I cannot claim to know where to find all data. How Much Data? In the software, each column represents a variable, and each row represents an observation. The number of rows you have in the data table represents the sample size "n." The question is how large should "n" be? In general, more is better. Most statistical software is designed to do "big data" analysis so you could literally have millions of data points. I would hope that you can get as much as you can, especially if you are building a prediction model. You will learn that one way to arrive at a predictive model is to split the data between training and validation sets. This requires larger data sets, at least in the 100's of data points, preferably in the 1000's. This does not mean that smaller data sets can't be analyzed as I often do in lectures. If you have an interesting application, then go for it. My suggestion here is purely rule of thumb. I would minimally target 3050 observations. So, if you are dealing with a time series on monthly sales, then get at least 35 years' worth which will give you 3660 observations; this gives the opportunity to see a repeat of a given month for seasonal estimation. But 30 observations with many X variables is pushing the limits of the multiple regression. One very rough rule is to have 30 observations and then 35 observations for every additional X variable brought into the regression. If you are looking at sports data, then consider getting 23 seasons (I would avoid sports analysis over many seasons since rules have changed, styles have changed, etc.) Software The main software for this course is JMP. The project must be done in JMP as the primary analysis tool. It is fine to supplement JMP output with occasional Excel output. The use of any other software such as R or Python or SAS or SPSS or Minitab is not permitted. Your analysis must be purely your own product. Alone or in a team? Even though most students prefer individual projects, some students are interested in working with others. Hence, if students wish to pair up, voluntary project teams may be formed with as many as two students. I will not be pairing up students. I leave that process to the students themselves. Group projects of more than 2 students dilutes data analysis skill learning process.
4 Please understand that I need to stick firmly to this maximum limit. Group projects will be graded on the same basis as individual projects, and each individual will receive the group grade. Housekeeping Your final report is to be neat, clearly written, and concise. State clearly your objective, methods, and conclusion(s). Some weight will be given to the quality of report writing. Include necessary figures to permit the reader to understand how and why you reached your conclusion(s). I suggest that your report should minimally contain the following: Introduction and statement of the problem being addressed. Source of data. Unambiguous explanation of what variable is being predicted and what variables are the potential predictor variables. Provide clear objective definitions of the variables. Presentation of the data and the data analysis (all relevant software output). Conclusions and recommendations for further study. It is critical that you show ALL output of your data journey (from initial plots to implementing model selection methods to model estimation to model diagnostics to model performance measures to final model). I expect comments with all the output. The comments do not have to be long; a few sentences telling the reader what you see and what your next steps are. Do not put output in an Appendix. Integrate the output of your data journey in your report writeup. Finally, you are not graded on how great your model predicts. Some students will have models that explain a high % of the data while others will find that their X's didn't do so well. You are not graded on this dimension. Data are the data. Let the data speak whatever the results are. In the end, you are graded on your professionalism of presentation, completeness of providing all the steps of your data journey, your comments along the way. Wrap up the paper with a conclusion and next steps. Remember that this is your project (not mine). My role is not to "do" your project. So, have confidence in yourself to tell me a data story. So, please don't send me a draft of the project prior to submission asking, "Is it okay?"
create report in a Word document. Name your Word document with your last name, e.g., "Smithproject.doc". Also, I ask that you electronically submit the data file. Submit the data file as a CSV file (not an Excel file). This means that you will submit two files (no less, no more): (1) the report; and (2) the CSV data file. In summary: build a model!
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
