Question: Gradient Boosting: ( 2 + 4 = 6 marks ) In this question, we will explore the use of pre - processing methods and Gradient

Gradient Boosting:

(2 + 4 = 6

marks

)

In this question, we will explore the use of pre

-

processing methods and Gradient Boosting on the popular Lending Club dataset. You are provided with two files: loan

_

train.csv and loan

_

test.csv

.

The dataset is almost as provided by the the original source, and you may have to make the necessary changes to make it suitable for applying ML algorithms.

(

If required, you can further divide loan

_

train.csv into a validation set for model selection.

)

Your efforts will be to pre

-

process the data appropriately, and then apply gradient boosting to classify whether a customer should be given a loan or not. The target attribute is in the column loan

_

status, which has values "Fully Paid" for which you can assign

+ 1

,

and "Charged off" for which you can assign

- 1

.

The other records with loan

_

status values "Current"

(

in both train and test

)

are not relevant to this problem. You can see this link and this link to know more about the different attributes on the dataset

(

but please use the provided data, there are several versions of the dataset online.

)

Your tasks are to do the following:

(

)

Pre

-

process the data as needed to apply the classifier to the training data

(

you are free to use pandas or other relevant libraries. Note that test data should not be used for pre

-

processing in any way, but the same pre

-

processing steps can be used on test data. Some steps to consider:

Check for missing values, and how you want to handle them

(

you can delete the records, or replace the missing value with mean

/

median of the attribute

-

this is a decision you must make. Please document your decisions

/

choices in the final submitted report.

)

Check whether you really need all the provided attributes, and choose the necessary attributes.

(

You can employ feature selection methods, if you are familiar; if not, you can eyeball.

)

Transform categorical data into binary features, and any other relevant columns to suitable datatypes

Any other steps that help you perform better

3

(

)

Apply gradient boosting using the function sklearn.ensemble. GradientBoostingClassifier for training the model. You will need to import sklearn, sklearn.ensemble, and numpy. Your effort will be focused on predicting whether or not a loan is likely to default.

Get the best test accuracy you can, and show what hyperparameters led to this accuracy. Report the precision and recall for each of the models that you built.

In particular, study the effect of increasing the number of trees in the classifier.

Compare your final best performance

(

accuracy

,

precision, recall

)

against a simple decision tree built using information gain.

(

You can use sklearn's inbuilt decision tree function for this.

)

Gradient Boosting: ( 2 + 4 = 6 marks ) In this

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

can i get a solution to this question please. Value: Document (include your file name) 15% of final grade How to complete: 1. Read the following Anniversary Party scheduling problem: As the project...

D O NO T Ta KE IF YOU CANNOT ANSWEAR ALL THE QUESTION AS SUPPOSED OR I WILL RATE UNHELPFUL AND REPORT FOR PLAG Peru Domestic: a bond is issued in the US Global: a bond is issued in the US and foreign...

Yego Domestic: a bond is issued in the US Global: a bond is issued in the US and foreign markets Eurobonds: a bond denominated in USD is issued in the foreign market. You need to estimate the...

Model Ensembles: Harnessing the Power of Collective Intelligence Objective: The objective of this homework assignment is to delve into the world of model ensembles, understand their motivation,...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

(a) In SystemVerilog, what is the difference between: (i) The ternary operator ? and if...then...else statements? [2 marks] (ii) always_ff and always_comb? [2 marks] (iii) Blocking, non-blocking and...

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

c) Draw a fully connected and labeled Neural Network diagram with the following specifications: 3 Inputs 3 Hidden Layers An output [8] d) Write Python code that encodes the Color column in the left...

Note: All ML code must be explained clearly (INJAVAXX)and should be free of needless complexity. 2 CST.2016.1.3 2 Foundations of Computer Science Please help. (2c) (a) A prime number sieve is an...

Discuss fully the future trends that will affect training. choose four only. Part 4 Social Responsability and the Future Training for Sustainability Sustainability refers to a company's ability to...

A 100mm diameter well penetrates a 10m thick confined aquifer. The steady state drawdowns were found to be 2.5 and 0.05m at distances of 10m and 40m, respectively, from the centre of the well, when...

Discuss the concept of global privilege does it work well in a distributed authorization or centralized authorization?

When using the percentage of completion method, the company a. recognizes revenues and gross profit each period during the contract. b. accumulates construction costs only in an inventory account...

Allocate overhead costs using direct labor hours and machine hours. Department Overhead Costs ($) Direct Labor Hours Machine Hours Department X $120,000 10,000 hours 5,000 hours Department Y $180,000...