Question: //using R STAT 2450 Assignment 7 (32 points) '' Problem : Surviving the Titanic (32 points) Load the libraries library(rpart) library(tree) library(ggplot2) library(randomForest) ## randomForest

//using R

STAT 2450 Assignment 7 (32 points)

''

Problem : Surviving the Titanic (32 points)

Load the libraries

library("rpart") library("tree") library(ggplot2) library(randomForest)

## randomForest 4.7-1

## Type rfNews() to see new features/changes/bug fixes.

## ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2': ## ## margin

Load the data

mytrain=read.csv("https://mathstat.dal.ca/~fullsack/DATA/titanictrain.csv") mytest=read.csv("https://mathstat.dal.ca/~fullsack/DATA/titanictest.csv") mytitanic=rbind(mytest,mytrain) nrec=nrow(mytitanic)

You will be using the column 'Survived' as the outcome in our models. This should be treated as a factor. All other columns are admissible as predictors of this outcome.

HINT-1: you can use the following template to split the data into folds, e.g.for cross-validation.

Randomly shuffle your data

yourData

Create 10 pre-folds of equal size

myfolds

use these pre-folds for cross-validation

for(i in 1:10){ # loop over each of 10 folds # recover the indexes of fold i and define the indexes of the test set testIndexes

HINT-2: Use the following template to split data into a train and a test set of roughly the same size

set.seed(44182) # or use the recommended seed trainindex=sample(1:nrec,nrec/2,replace=F) mytrain=mydata[trainindex,] # training set mytest=mydata[-trainindex,] # testing set = complementary subset of mydata

  1. Define a 5 pre-folds of equal size of 'mytitanic' in a variable called 'myfolds'

(2 points)

set.seed(2255) # shuffle mytitanic

  1. Use pre-fold number 3 to define a testing and a training set named 'mytest' and 'mytrain'

(2 points)

i=3# fold number to use

  1. Fit a Random Forest model to the 'mytrain' dataset. Use the column 'Survived' as a factor outcome. Require importance to be true and set the random seed to 523. (This is the 'trained model'). (2 points)

# Fitting Random Forest Classification to the training set 'mytrain'

  1. Plot the trained model results.
  • Has the OOB error rate roughly equilibrated with 50 trees?
  • Has the OOB error rate roughly equilibrated with 500 trees?
  • What is the stationary value of the OOB error rate?
  • Which of death or survival has the smallest prediction error? (4 points)

#

  1. Calculate the predictions on 'mytest', the misclassification error and the prediction accuracy. (2 points)

# Predicting survival on mytest

  1. Print and plot the importance of predictors in the trained model. (2 points)

Now you are going to have a more direct look at predictors for the records in 'mytest'.

Tabulate the chances of survival by the column 'Title'. What do you conclude? (2 points) Which other predictor would have given you the same information? (1 points)

Are the predictors independent? (1 points)

What is the median fare of passengers ? (1 points) Hint: use the column 'Fare'

Tabulate the survival according to the binary variable mytest$Fare

table(mytest$Fare

## ## 0 1 ## FALSE 43 46 ## TRUE 67 22

rm(mytrain,mytest) #mytrain

  1. Complete the code of the following function, which returns a vector of classification accuracies for
    //using RSTAT 2450 Assignment 7 (32 points)''Problem : Surviving the Titanic (32

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!