Learning from Disaster – The Random Forest Approach.

Standard

Kaggle update:

I’m up 1,311 spots a week from my previous week’s submission. Yay!

2016-10-09_17-11-01

Having tried logistic regression the first time around, I moved on to decision trees and KNN. But unfortunately, those models performed horribly and had to be scrapped.

Random Forest seemed to be the buzz word around the Kaggle forums, so I obviously had to try it out next. I took a couple of days to read up on it, worked out a few examples on my own before re-taking a stab at the titanic dataset.

The ‘caret’ package is a beauty. Seems to be the most widely used package for supervised learning too. I cannot get over how simple and consistent it makes predictive modelling.So far I have been able to do everything from data splitting, to data standardization, to model building, to model tuning – all using one package. And I am still discovering all that it has to offer. Pretty amazing stuff.

I will give you a super quick walk-through of how I applied the random forest algorithm and then go enjoy whatever’s left of my Sunday.

 

Read In The Data:

dat<-read.csv("C:/Users/Amita/Downloads/train (1).csv",header=T,sep=",",
     na.strings = c(""))
test <- read.csv("C:/Users/Amita/Downloads/test (1).csv",header=T,sep=",",
     na.strings = c(""))

Check For Missing Values:

sapply(dat,function(x){sum(is.na(x))}) 

sapply(test,function(x){sum(is.na(x))})

2016-10-09_17-50-11.jpg

The variable ‘Cabin’ seems to have the most missing values and is quite beyond repair – so we’ll drop it. Also, I really don’t think ‘Name’ and ‘Ticket’ could possibly have any relation to the odds of surviving. So we’ll drop that as well. (So reckless! :D)

‘Age’ has quite a few missing values as well, but I have a hunch we’ll need that .So we need to replace the missing values there.

 

dat[is.na(dat$Age),][6]<- mean(dat$Age,na.rm=T)
dat <- dat[,-c(4,9,11)]

test[is.na(test$Age),][5]<- mean(test$Age,na.rm=T) 
test <- test[,-c(3,8,10)]

 

Next, we’ll split the complete training dataset ‘dat’ into two sub-datasets which we shall use for testing our model. Let’s go for a 60:40 split.

set.seed(50)
n<- nrow(dat)
shuffled <- dat[sample(n),]
traindat <- shuffled[1:round(0.6*n),]
testdat<- shuffled[(round(0.6*n) + 1):n,]

 

For this tutorial, we need to install the ‘caret’ package. I am not going to use the ‘randomforest’ package , but instead use the ‘ranger’ package which is supposed to provide a much faster implementation of the algorithm.

install.packages("caret")
install.packages("ranger")
library(caret)
library(ranger)

A little more cleaning prompted by errors thrown along the way. Gotta remove all NAs.

sum(is.na(traindat))
sum(is.na(testdat))

traindat[is.na(traindat$Embarked),][["Embarked"]]<-"C"
testdat[is.na(testdat$Embarked),][["Embarked"]]<-"C"

testdat$Survived<-factor(testdat$Survived)
traindat$Survived<-factor(traindat$Survived)

Convert the ‘Survived’ variable to a factor so that caret builds a classification instead of a regression model.

testdat$Survived<-factor(testdat$Survived)
traindat$Survived<-factor(traindat$Survived)

 

Build The Model:

model <- train(
 Survived ~.,
 tuneLength = 50,
 data = traindat, method ="ranger",
 trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE)
)

As you can see, we are doing a bunch of things in one statement.

The model being trained uses ‘Survived’ as the response variable and all others as predictors. The input dataset is ‘traindat’. The tuneLength argument to caret::train() tells train to explore more models along it’s default tuning grid. A higher value of tuneLength means more accurate results since it evaluates more models along it’s default tuning grid , but it also means that it’ll take longer. caret supports many types of cross-validation, and you can specify which type of cross-validation and the number of cross-validation folds with the trainControl() function, which you pass to the trControl argument in train(). In our statement, we are specifying a 5-fold cross validation. verboseIter =TRUE just shows the progress of the algorithm.

2016-10-09_18-18-28.jpg

The table shows different values of mtry along with their corresponding average accuracies . Caret automatically picks the value of the hyperparameter ‘mtry’ that was the most accurate under cross-validation (mtry = 5 in our case).

We can also plot the model to visually inspect the accuracies of the various mtry values. mtry =5 has the max average accuracy of 81.6%.

 

2016-10-09_18-20-30

Make Predictions on ‘testdat’ :

Let’s apply the model to predict the survival on our test dataset ,’testdat’, which is 40% of our whole training dataset.

pred <- predict(model,newdata=testdat[,-2])

#create confusion matrix
conf<- table(testdat$Survived,pred)

#compute accuracy
accuracy<- sum(diag(conf))/sum(conf)
accuracy

The accuracy is returned at 80.8%. Pretty close to what we saw above.

 

And finally ,

Make Predictions on the Kaggle test dataset, ‘test’.

test$Survived <- predict(model, newdata = test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submissionrf.csv", row.names = FALSE)

 

 

Get Result :

2016-10-09_18-34-08

77.5 % as opposed to last week’s score of  75.86 % .

Not bad.

 

We’ll make it better next week.

Meanwhile, please feel free to leave any pointers for me in the comments section below.I am always game for guidance and feedback!

 

P.S.  I have been really bad about uploading code to github – but I’ll get around to it in a day or two and put up a link here – I promise!

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s