Dataset Description You will get two datasets. train.csv is for training your model, and test.csv contains the
Question:
Dataset Description
You will get two datasets. train.csv is for training your model, and test.csv contains the information to predict. The submission has to be strictly in the format indicated in the sample_submission.csv.
Dataset description
Files
- train.csv- the training set
- test.csv- the test set
download dataset
https://drive.google.com/drive/folders/105jPIlN8sK-lprLibpC6iEMkfv2K135x?usp=sharing
(Note that the outcome has to be the class probabilities)
Columns
Client information
- id- client id (numeric)
- age- age of client (numeric)
- job- type of job (categorical: "admin.","artisan","entrepreneur", "housemaid", "management", "retired", "self-employed", "services", "student", "technician", "unemployed", "unknown")
- civil- marital status of client (categorical: "divorced", "married", "single","unknown"; note: "divorced" means divorced or widowed)
- education- education of client (categorical: "4K", "6K", "K9", "K12", "illiterate", "apprenticeship", "university", "unknown")
- credit- has credit in default? (categorical: "no","yes","unknown")
- hloan- has housing loan? (categorical: "no","yes","unknown")
- ploan- has personal loan? (categorical: "no","yes","unknown")
Campaign details
- ctype- contact communication type (categorical: "cellular","telephone")
- month- last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- day- last contact day of the week (categorical: "mon","tue","wed","thu","fri")
- ccontact- current number of contacts performed during this campaign and for this client (numeric, includes last contact)
- lcdays- number of days that passed by since client was last contacted by a previous campaign (numeric; 999 means client was not previously contacted)
- pcontact- number of contacts performed before this campaign and for this client (numeric)
- presult- outcome previous marketing campaigns (categorical: "failure","nonexistent","success")
Socioeconomic indicators
- employment- employment variation rate - quarterly indicator (numeric)
- cprice- consumer price index - monthly indicator (numeric)
- cconf- consumer confidence index - monthly indicator (numeric)
- euri3- euribor 3 month rate - daily indicator (numeric)
- employees- number of employees - quarterly indicator (numeric)
Outcome variable (target)
- outcome- has the client opened a saving account? (binary: 1 = "yes", 0 = "no")
Model Evaluation
We will evaluate models using Area under ROC (AUC).
AUC is commonly used to compare model accuracy. The maximum value that can be achieved is 1 (perfect model/classifier). An AUC value of 0.5 means that it performs equally to a random classifier. An AUC below a value of 0.5 means your model performs worse than a random one. You see the grading evaluation on details in the grading tab.
Submission for Models
Submission files must be .csv files. Every customer in the given dataset has a unique customer ID under theIdcolumn, as you can obtain it from the test.csv file.
The file should contain a header and have the following format:
Id, outcome
103024, \hat{y}
whereoutcomeis thepredicted probabilityof being class 1 (opened saving account) andidis the customer ID. You can combine your prediction with the test setidvalues, for example, using the command
submission <- cbind(test$id,my.prediction)
write.csv(submission, file ="submission.csv")
Kaggle will match the performance of eachid. This way, Kaggle can ensure correct error calculation even in case you change the order of the test set. There is a submission example in the data section.