Question: 1 . Introduction You are given two real datasets which contain individual information in US . In the first real dataset, each individual is associated

1. Introduction
You are given two real datasets which contain individual information in US. In the
first real dataset, each individual is associated with 13 attributes and 1 additional
Boolean attribute called income indicating whether the individual had an income >
50K (per year) or not. The second real dataset is the same as the first dataset but the
second one contains only the first 13 attributes but no attribute income. The
objective of this project is to predict whether each individual in the second dataset has
an income >50K or not.
There are three phases in this project Phase 1, Phase 2 and Phase 3. In Phase 1, you
are required to generate an Excel file from two raw files together with attribute names.
In Phase 2, you are required to write a design report for this project. In Phase 3, you
are required to follow the design report in Phase 2, generate the predicted attribute
files for the second real dataset and write a final report
2. Milestones
1. Phase 1
i. You are given two real datasets in TEXT format, training.txt and test.txt.
ii. File training.txt contains 13 attributes and 1 additional Boolean attribute.
iii. File test.txt contains 13 attributes only.
iv. Open these two TEXT files with MS Excel
v.Save them in one MS Excel file where the content of training.txt is included in Sheet 1 and the content of test.txt is included in Sheet
2. Please re-name Sheet 1 as training and re-name Sheet 2 as test in MS Excel.
vi.Insert a row at the beginning of each of the two sheets of the Excel file where this row gives the attribute names specified in Section 3
2.Phase 2
i. In Phase 2, you are required to write a design report for this project.
ii.You should list 5 possible data mining models you want to try
iii.Note that a possible data mining model can be Decision Tree Classifier with a set of parameters and another possible data mining model can be Decision Tree Classifier with another set of parameters. Obviously, one of the possible data mining models can be Nearest Neighbor Classifier with a set of parameters.
3.
Phase 3
I. In Phase 3, you are required to follow the design report in Phase 2, use the XLMiner software to predict attribute income for the second real dataset and write a final report.
ii. The final report should include the following.
a) All materials in your design report written in Phase 2(i.e., the 5 possible data mining models)
b) Description of the XLMiner results for each of 5 possible data mining models
c) Two examples illustrating what attributes determine an individual to have an income >50K or not for each of 5 possible data mining models.
d) Conclusions drawn from each of 5 possible data mining models and an overall conclusion
iii. In addition to the final report, you are required to generate 5 predicted attribute files for the second real dataset in TEXT file format. Note that each predicted attribute file corresponds to the output of a possible data mining model you proposed in Phase 2. The file format is described in
Section 4.
4. File format of Predicted Attribute File
In Phase 3, you are required to submit 5 predicted attribute files for the second dataset.
The files should be named as predicted1.txt,predicted2.txt,predicted3.txt,
predicted4.txt and predicted5.txt where predicted1.txt corresponds to the output
of the first data mining model proposed in Phase 2 and the other files have a similar
meaning. The file format of each file is shown as follows.
1
st
row: 1 or 0 where 1 corresponds to that the first individual in the second dataset
has an income >50K and 0 corresponds to that s/he does not>
2
nd
row: 1 or 0 where 1 corresponds to that the second individual in the second
dataset has an income >50K and 0 corresponds to that s/he does not >
...
Here is a sample file.
1
0
1
0
0
0
0
0
...
We have an answer file for the predicted attribute file. Among 5 files given by you,
we will select the one with the highest accuracy as the final file for marking.Data Specifications
There are 13 attributes in the first dataset and 1 additional Boolean attribute called
"income".
1 . Introduction You are given two real datasets

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!