Question: For this chapter s exercise, you will compile your own data set based on people you know and the cars they drive, and then create

For this chapters exercise, you will compile your own data set based on people you know and the cars they drive, and then create a linear discriminant analysis, k-NN, and Nave Bayes of your data in order to predict categories for a scoring data set. Complete the following steps:
1) Open a new blank spreadsheet in OpenOffice Calc or another spreadsheet program of your choice. At the bottom of the spreadsheet there will be three default tabs labeled Sheet1, Sheet2, Sheet3. Rename the first one Training and the second one Scoring. You can rename the tabs by double clicking on their labels. You can delete or ignore the third default sheet.
2) On the training sheet, starting in cell A1 and going across, create attribute labels for six attributes: Age, Gender, Marital_Status, Employment, Housing, and Car_Type.
3) Copy each of these attribute names except Car_Type into the Scoring sheet.
4) On the Training sheet, enter values for each of these attributes for several people that you know who have a car. These could be family members, friends and neighbors, coworkers or fellow students, etc. Try to do at least 20 observations; 30 or more would be better. Enter husband and wife couples as two separate observations, so long as each spouse has a different vehicle. Use the following to guide your data entry:
a. For Age, you could put the persons actual age in years, or you could put them in buckets. For example, you could put 10 for people aged 10-19; 20 for people aged 20-29; etc.
b. For Gender, enter 0 for female and 1 for male.
c. For Marital_Status, use 0 for single, 1 for married, 2 for divorced, and 3 for widowed.
d. For Employment, enter 0 for student, 1 for full-time, 2 for part-time, and 3 for retired.
e. For Housing, use 0 for lives rent-free with someone else, 1 for rents housing, and 2 for owns housing.
f. For Car_Type, you can record data in a number of ways. This will be your label, or the attribute you wish to predict. You could record each persons car by make (e.g. Toyota, Honda, Ford, etc.), or you could record it by body style (e.g. Car, Truck, SUV, etc.). Be consistent in assigning classifications, and note that depending on the size of the data set you create, you wont want to have too many possible classifications, or your predictions in the scoring data set will be spread out too much. With small data sets containing only 20-30 observations, the number of categories should be limited to three or four. You might even consider using Japanese, American, European as your Car_Types values.
5) Once youve compiled your Training data set, switch to the Scoring sheet in OpenOffice Calc. Repeat the data entry process for at least 20 people (more is better) that you know who do not have a car. You will use the training set to try to predict the type of car each of these people would drive if they had one.
6) Use the File > Save As menu option in OpenOffice Calc to save your Training and Scoring sheets as CSV files.
7) Either import your two CSV files into your RapidMiner respository, being sure to give them descriptive names, or read them into a new process using Read CSV.
8) If you have prepared your data well in OpenOffice Calc, you shouldnt have any missing or inconsistent data to contend with, so data preparation should be minimal. Rename the two Retrieve operators (or Read CSV operators) so you can tell the difference between your training and scoring data sets.
9) One necessary data preparation step is to add a Set Role operator and define the Car_Type attribute as your label.
10) Add a Linear Discriminant Analysis operator to your Training stream.
11) Apply your LDA model to your scoring data and run your model. Evaluate and report your results. Did you get any confidence percentages? Do the predicted Car_Types seem reasonable and consistent with your training data? Why or why not?
12) Change your model operator to k-NN, then to Nave Bayes. Compare and contrast the results the outputs from the three modeling methodologies. Describe and discuss the differences.
Please turn in your training data, scoring data and answer document.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!