Question: Project Purpose: Demonstrate your ability to apply modeling tools and methods to a data set. Ultimately, the project should be about building a model. The

Project Purpose: Demonstrate your ability to apply modeling tools and methods to a data set. Ultimately, the project should be about building a model. The first step is the identification of variables. In this course, we will study predictive models for predicting a Y variable. Y variables come in two types: 1) Continuous Y variable. This is a Y variable that takes on a range of values (e.g., $, weight, rating scale, temperature, winning %). For such a variable, we will use the regression model and regression trees. 2) Binary Y variable that takes on two values (yes/no). Examples might be bank customer defaults on loan or not, student graduates or not, customer returns or not, etc. For such a variable, we will use the regression model. Our coverage of the continuous Y variable will make up about 90% of the course. We will not get to the binary Y scenario until the very end of the course. So, I would like your project experience to be the building of models for a continuous Y variable (not a binary Y variable). Your project should minimally complete regression model build; you can consider adding regression trees if your data set is not a time series. Do not include methods we have not covered. I will expect your model searching to include some model selection approaches (such as, AIC/BIC, data splitting, or CV); hypothesis testing (F and t tests) can be discussed but should not be the sole basis for model selection. Determine the Predictor Variables After you have determined the target Y variable that you are trying to predict, the next step is to determine your "X" variables. X variables are the information that you think might be helpful in the prediction of the Y variable. A useful approach to this problem is to draw up a "wish" list of X variables using subject matter considerations. Consult the subject matter research and knowledgeable experts in the area. For example, you might ask current brand managers how they forecast future sales, listening carefully to what factors they use in forming the forecasts. It is not necessary for your experts to understand regression analysis; you are merely milking them for information on the factors that might be predictive of your Y variable. Avoid the mistake of collecting many variables without thinking carefully about whether these variables could have any relevance - "I had that information, so I threw it in anyway." With this said, you should target minimally 5 predictor

2 variables but shoot for many more than that. With software, an analysis of 20 predictor variables is no more difficult than 5 predictor variables. Note: If your Y variable and X variables are timeoriented (e.g., Y is monthly sales), then the time frequency should be the same for the Y and X variables (month vs. month, week vs. week, daily vs. daily, etc.). Don't mix time frequencies (e.g., don't collect annual Y data and try to predict the Y variable with monthly X variables). Be careful that your X variables don't define (or basically define) your Y variable. One student looked at NBA data and picked points scored per game as a Y and had X variables of three pointers made per game (X1), twopointers made per game (X2), and free throw made per game (X3). In my project feedback, I pointed out that there was no statistical modeling to be done here and there is nothing to learn because Y is perfectly defined as follows: Y = 3*X1 + 2*X2 + X3 Similarly, avoid looking at winning percentage as a Y and points scored and points allowed as X variables. You will not learn much more than teams that scored more than allowed were teams that tended to win. Not a deep discovery! Examples Let me emphasize that the goal is to show me your ability to model a set of data. I am open to your data being from your company or some personal interest (data gathered on your own or from the internet). Here are some examples of past projects: Y variable: Monthly Claims Expenses of a particular insurance company. X variables: Variety of monthly economic indicators such as medical CPI. Y variable: Selling price of a used Acuras. X variables: Mileage, age, transmission type, and a variety of accessory information. Y variable: NBA teams' winning percentages. X variables: A slew of statistics (offensive rebounds, defensive rebounds, steals, FT %, etc.). There are millions of possibilities. I have not had a student fail to come up with a data set to analyze. Finance applications: Some students explore stockrelated data. For such data, we generally look at the changes in prices relative to changes to prices rather than the prices per se. Looking at a single stock series and its changes is not sufficient.

3 Data Sources If the data are from your company, then that is your source. If you are looking externally on the internet, there are tens of thousands of sites. I cannot claim to know where to find all data. How Much Data? In the software, each column represents a variable, and each row represents an observation. The number of rows you have in the data table represents the sample size "n." The question is how large should "n" be? In general, more is better. Most statistical software is designed to do "big data" analysis so you could literally have millions of data points. I would hope that you can get as much as you can, especially if you are building a prediction model. You will learn that one way to arrive at a predictive model is to split the data between training and validation sets. This requires larger data sets, at least in the 100's of data points, preferably in the 1000's. This does not mean that smaller data sets can't be analyzed as I often do in lectures. If you have an interesting application, then go for it. My suggestion here is purely rule of thumb. I would minimally target 3050 observations. So, if you are dealing with a time series on monthly sales, then get at least 35 years' worth which will give you 3660 observations; this gives the opportunity to see a repeat of a given month for seasonal estimation. But 30 observations with many X variables is pushing the limits of the multiple regression. One very rough rule is to have 30 observations and then 35 observations for every additional X variable brought into the regression. If you are looking at sports data, then consider getting 23 seasons (I would avoid sports analysis over many seasons since rules have changed, styles have changed, etc.) Software The main software for this course is JMP. The project must be done in JMP as the primary analysis tool. It is fine to supplement JMP output with occasional Excel output. The use of any other software such as R or Python or SAS or SPSS or Minitab is not permitted. Your analysis must be purely your own product. Alone or in a team? Even though most students prefer individual projects, some students are interested in working with others. Hence, if students wish to pair up, voluntary project teams may be formed with as many as two students. I will not be pairing up students. I leave that process to the students themselves. Group projects of more than 2 students dilutes data analysis skill learning process.

4 Please understand that I need to stick firmly to this maximum limit. Group projects will be graded on the same basis as individual projects, and each individual will receive the group grade. Housekeeping Your final report is to be neat, clearly written, and concise. State clearly your objective, methods, and conclusion(s). Some weight will be given to the quality of report writing. Include necessary figures to permit the reader to understand how and why you reached your conclusion(s). I suggest that your report should minimally contain the following: Introduction and statement of the problem being addressed. Source of data. Unambiguous explanation of what variable is being predicted and what variables are the potential predictor variables. Provide clear objective definitions of the variables. Presentation of the data and the data analysis (all relevant software output). Conclusions and recommendations for further study. It is critical that you show ALL output of your data journey (from initial plots to implementing model selection methods to model estimation to model diagnostics to model performance measures to final model). I expect comments with all the output. The comments do not have to be long; a few sentences telling the reader what you see and what your next steps are. Do not put output in an Appendix. Integrate the output of your data journey in your report writeup. Finally, you are not graded on how great your model predicts. Some students will have models that explain a high % of the data while others will find that their X's didn't do so well. You are not graded on this dimension. Data are the data. Let the data speak whatever the results are. In the end, you are graded on your professionalism of presentation, completeness of providing all the steps of your data journey, your comments along the way. Wrap up the paper with a conclusion and next steps. Remember that this is your project (not mine). My role is not to "do" your project. So, have confidence in yourself to tell me a data story. So, please don't send me a draft of the project prior to submission asking, "Is it okay?"

create report in a Word document. Name your Word document with your last name, e.g., "Smithproject.doc". Also, I ask that you electronically submit the data file. Submit the data file as a CSV file (not an Excel file). This means that you will submit two files (no less, no more): (1) the report; and (2) the CSV data file. In summary: build a model!

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

Before you begin this assignment, be sure you have read the UMUC Family Clinic Case Study below and your Week 1 and Week2 syllabus readings, which discuss EHR functionality and Process...

Data gistrun sending inside a pack fills in as conventional allowing subordinate rules to execute on progressive clock cycles. Correspondence between gatherings conventionally causes a surprising...

Question: What as the average weekly safety inventory level of refined sugar from the beginning January 2022 to the end of July 2022? A. 512,465.9691 metric tons per week B. 316,002.1474 metric tons...

Which market segments are the most favorable for Hyundai card? Does the company do an effective job of targeting and reaching those segments? If so, How? If not, why not?

Explain the ways in which senior and middle managers in the Department of Veterans Affairs demonstrated "new ways" of leadership in order to ensure that the innovative ideas and policies of the...

Make the case (whether you agree or not) that managerial capitalism is such a powerful notion that unions should in fact be marginalized. Then make the opposite casethat even if managerial capitalism...

If X has a binomial distribution with parameters n and p, then the random variable Y = n X also has a binomial distribution with parameters n and p.

Journal entries for treasury stock transactions. Prepare journal entries under the cost method to record the following treasury stock transactions of Melissa Corporation. a. Purchases 10,000 shares...

Radon Homes' currem EPS is $7.90. It was $3.53.5 years ago. The compary pays out 60w of is eamings as dividends, and the steek sels for 336 . the nearest cent. \$

Busy Sally Socialite has trouble remembering people's birthdays, so she has organised her friends into what she calls a Birthday Support Team, or BST. Each friend needs only to keep track of three...

Writing Assignment - Chapter 4: Business Income and Expenses - IRA John and Elizabeth have come to you for the following tax advice. They file jointly and have an Adjusted Gross Income of $150,000....

Finally, 21st Century is also considering Project Z Project Z has an up-front after-tax cost of $500,000, and it is expected to produce after-tax cash flows of $100,000 at the end of each of the next...

2. A liquid is heated in a lab using a hot plate. The graph of temperature vs. time is as follows. Temperature of Solution Temperature (C) 888R 8 8 8 8 8 9 110 200 10 20 40 50 60 70 90 Time (min)...

3. Develop a database program that creates a database named cities.db. The cities.db database must have a table named Cities, with the following columns: Column Name Data Type I Attributes City Name...

Prior to the Holiday season, Toys-R-Us would increase their on-hand inventory levels. What is the rational for this (see chapter 9)? What are some of the reasons why Toys-R-Us would not purchase more...

Laker Company reported the following January purchases and sales data for its only product. For specific identification, ending inventory consists of 275 units from the January 30 purchase, 5 units...

Match the factors that relate to distributive and integrative negotiation. Distributive Distributive drop zone 1 of 3 empty. Distributive drop zone 2 of 3 empty. Distributive drop zone 3 of 3 empty....

Using the theoretical sampling strategy, how many samples of size 4 (n = 4) can be drawn from a population of size: (a) N = 5? (b) N = 8? (c) N = 16? (d) N = 50?

1. Flip a coin to determine which hand you will measure first. If the coin lands heads side up, start with the right hand. If the coin lands tails side up, start with the left hand. With the...

10. Would you be comfortable generalizing your conclusions in Step 8 to the population of students at your school? Explain why or why not.

8. Which state, Vermont or Nebraska, is closer to the state in which your school is located? Based on the pie chart, do you think that the students at your school were better able to identify the...