Question: Consider the dataset postoperative-patient-data simplified.arff available on moodle. This dataset con- tains health-status attributes of post-operative patients in a hospital, with the target class being
Consider the dataset postoperative-patient-data simplified.arff available on moodle. This dataset con- tains health-status attributes of post-operative patients in a hospital, with the target class being whether the patients should be discharged (S) or remain in the hospital (A). Additional documenta- tion regarding these attributes appears in the arff file.
- Before you run the classifiers, use the weka visualization tool to analyze the data, and report briefly on the types of the different variables and on the variables that appear to be important. (4 marks)
- Run J48 (=C4.5, decision tree), Na ve Bayes and IBk (k-NN) to learn a model that predicts whether a patient should be discharged. Perform 10-fold cross validation, and analyze the results obtained by these algorithms as follows.
- Note: Click on the "Choose" bar to select relevant parameters. Explanations of parameters you should try appear below. You should report on performance of at least two variations of the operational parameters, e.g., minNumObj and unpruned for J48, and KNN and distanceWeighting for k-NN (the parameters debug and saveInstanceData are not operational).
J48
- binarySplits: whether you use binary splits on nominal attributes when building the trees.
- minNumObj: the minimum number of instances per leaf.
- unpruned: whether pruning is performed (try TRUE and FALSE).
- debug: if set to TRUE, the classifier may output additional information.
- saveInstanceData: whether to save the training data for visualization. Na ve Bayes (parameter variations are not relevant to this lab)
- k-NN (IBk) (under lazy in WEKA)
- KNN: the number of neighbours to use.
- crossValidate: whether leave-one-out X-validation will be used to select the best k value
- between 1 and the value specified in the KNN parameter.
- distanceWeighting: specifies the distance weighting method used (when k 1).
- debug: if set to TRUE, the classifier may output additional information.
- (a)J48 (=C4.5)
- i. Examine the decision tree and indicate which are the main variables.
- ii. What is the accuracy of the decision tree? Explain the results in the confusion matrix.
- (b)Na ve Bayes
- Explain the meaning of the "probability distributions" in the output, illustrating it with reference to the BP STBL attribute.
- Calculate (by hand) the probability that a person with the following attribute values would be discharged.
- L-CORE = mid
- L-SURF = low
- L-O2 = good
- L-BP = high SURF-STBL = stable CORE-STBL = stable BP-STBL = mod-stable
- What is the probability that a person with these attributes will remain in hospital and that s/he will be discharged? What would the Na ve Bayess classifier predict for this person?
- What is the accuracy of the Na ve Bayes classifier? Explain the results in the confusion matrix.
(c) k-NN
- Find three instances in the dataset that are similar to the above patient, and use the
- Jaccard coefficient to calculate (by hand) the predicted outcome for this patient. Show
- your calculations.
- What is the accuracy of the k-NN classifier for different values of k? Explain the results
- in the confusion matrix.
3. Draw a table to compare the performance of J48, Na ve Bayes and IBk
using the summary measures produced by WEKA. Which algorithm does better? Explain in terms of WEKA's summary measures. Can you speculate why?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
