Question: Part II: An application (75 marks) 2.1 Background on Credit Card Dataset The data, CreditCard Data.xls, is based on Yeh and hui Lien (2009). The

Part II: An application (75 marks)

2.1 Background on Credit Card Dataset

The data, \CreditCard Data.xls", is based on Yeh and hui Lien (2009). The data

contains 30,000 observations and 23 explanatory variables. The response variable, Y, is a

binary variable where \1" refers to default payment and \0" implies non-default payment.

The description of 23 explanatory variables is as follows:

X1: Amount of the given credit (NT dollar): it includes both the individual con-

sumer credit and his/her family (supplementary) credit.

X2: Gender (1 = male; 2 = female).

X3: Education (0 = unknown; 1 = graduate school; 2 = university; 3 = high school;

4 = others; 5 = unknown; 6 = unknown).

X4: Marital status (0 = unknown; 1 = married; 2 = single; 3 = others).

X5: Age (year).

X6 - X11: History of past payment. The data was tracked the past monthly payment

records (from April to September, 2005) as follows: X6 = the repayment status in

September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the

repayment status in April, 2005. The measurement scale for the repayment status

is: -2= no consumption, -1=pay duly, 0 = the use of revolving credit; 1 = payment

delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay

for eight months; 9 = payment delay for nine months and above.

X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement

in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 =

amount of bill statement in April, 2005.

X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in Septem-

ber, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April,

2005.

2.2Assessment Tasks

2.2.1 Data

(a) Select a random sample of 70% of the full dataset as the training data, retain the

rest as test data. Provide the code and print out the dimensions of the training

data. (5 marks)

4

2.2.2 Tree Based Algorithms

(a) Use an appropriate tree based algorithm to classify credible and non-credible clients.

Specify any underlying assumptions. Justify your model choice as well as hyper-

parameters which are required to be specied in R. (10

marks)

(b) Display model summary and discuss the relationship between the response variable

versus selected features. (10 marks)

(c) Evaluate the performance of the algorithm on the training data and comment on

the results. (5 marks)

2.2.3 Support vector classier

(a) Use an appropriate support vector classier to classify the credible and non-credible

clients. Justify your model choice as well as hyper-parameters which are required

to be specied in R. (10 marks)

(b) Display model summary and discuss the relationship between the response variable

versus selected features. (10 marks)

(c) Evaluate the performance of the algorithm on the training data and comment on

the results. (5 marks)

2.2.4 Prediction

Apply your tted models in 2.2.2 and 2.2.3 to make prediction on the test data. Evaluate

the performance of the algorithms on test data. Which models do you prefer? Are

there any suggestions to further improve the performance of the algorithms? Justify your

answers. (20 marks)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!