Top 16% : How to plateau using the ‘Feature Engineering’ approach.

Standard

Ladies and Gents! I am now placed in the top 16% of the leaderboard rankings with a score of .80383.

2016-10-14_10-08-27

 

I have also plateaued horribly. No matter what other features I try to ‘engineer’ my score just won’t budge. It get’s worse, sure, but never better. Bummer.

Everything pretty much remains the same as the previous post in terms of data reading and cleaning. In this post, let’s look at what I did differently.

This attempt was a departure from applying the algorithms as is and hoping for a better prediction (Admit it.We’re all guilty.) This time I incorporated the ‘human’ element – I even tried to recall scenes from the movie for that extra insight(Still unfair how Rose hogged the entire wooden plank).

Some of the theories I considered:

  • Women and children were given priority and evacuated first.
  • Mothers would look out for their children.
  • First class passengers were given priority over those in 2nd or 3rd class.
  • Women and children were probably given priority over males in every class.
  • Families travelling together probably had a better chance of survival since they’d try to stick together and help each other out.
  • Older people would have trouble evacuating and hence, would have lower odds of survival.

 

Also, this time around, I played around with the ‘Name’ and ‘Cabin’ variables and that made a huge diffference!

So what you need to do to plateau with an 80.4% prediction is as follows:

Identify the unique titles and create a new variable unique:

# check for all the unique titles 

unique <- gsub(".*?,\\s(.*?)\\..*$","\\1",dat$Name)

dat$unique<- unique
dat$unique[dat$unique %in% c("Mlle","Mme")] <-"Mlle"
dat$unique[dat$unique %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
dat$unique[dat$unique %in% c('Dona', 'Lady', 'the Countess','Jonkheer')] <- 'Lady'

table(dat$unique) # check the distribution of different titles

# passenger’s title 
dat$unique <- factor(dat$unique)

Identify the children and create a new variable isChild  :

dat$ischild <- factor(ifelse(dat$Age<=16,"Child","Adult"))

Identify the mothers and create a new variable isMother:

dat$isMother<- "Not Mother"
dat$isMother[dat$Sex=="female" & dat$Parch>0 & unique!="Miss"] <- "Mother"
dat$isMother<- factor(dat$isMother)

Uniquely identify the Cabins: This variable leads to somewhat of an overfit.

dat$Cabin <- substr(dat$Cabin,1,1)
dat$Cabin[dat$Cabin %in% c("F","G","T",NA)] <- "X"
dat$Cabin<- factor(dat$Cabin)

Compute the family size and create a new variable familysize :

dat$familysize <- dat$SibSp + dat$Parch + 1

Use the ‘familysize‘ variable and the surname of the passenger to designate the family size as “Small” or “Big” in the new variable unit :

pass_names <- dat$Name
extractsurname <- function(x){
  if(grepl(".*?,\\s.*?",x)){
  gsub("^(.*?),\\s.*?$","\\1",x)
 }
}

surnames <- vapply(pass_names, FUN=extractsurname,FUN.VALUE = character(1),USE.NAMES = F)
fam<-paste(as.character(dat$familysize),surnames,sep=" ")


famsize<- function(x){
 if(substr(x,1,2) > 2){
 
 x <- "Big"
 }else{
 x <- "Small"
 }
}

unit <- vapply(fam, FUN=famsize,FUN.VALUE = character(1),USE.NAMES = F)
dat$unit <- unit
dat$unit <- factor(dat$unit)

 

Split the ‘dat’ dataset into train and test (60 : 40 split) and fit the randomforest model.

n<- nrow(dat)
shuffled <- dat[sample(n),]

traindat <- shuffled[1:round(0.6*n),]
testdat<- shuffled[(round(0.6*n) + 1):n,]

dim(traindat)
dim(testdat)

require(caret)
require(ranger)
model <- train(
 Survived ~.,
 tuneLength = 50,
 data = traindat, method ="ranger",
 trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE)
)

pred <- predict(model,newdata=testdat[,-2])
conf<- table(testdat$Survived,pred)
accuracy<- sum(diag(conf))/sum(conf)
accuracy

Using the model to predict survival (minus Cabin) gives us 83.14% accuracy on our test data’testdat’ and 80.34% on Kaggle.

Using the model to predict survival (with Cabin) gives us 83.71% accuracy on our test data’testdat’ which drops to around 79% on Kaggle.

Although, I still haven’t tinkered around with ‘Fare’, ‘Ticket’, and ‘Embarked’ (the urge to do so is strong), I think I’ll just leave it alone for the time being – but I will be revisiting for that elusive ‘eureka’ moment!

You can find the code here .

 

Learning from Disaster – The Random Forest Approach.

Standard

Kaggle update:

I’m up 1,311 spots a week from my previous week’s submission. Yay!

2016-10-09_17-11-01

Having tried logistic regression the first time around, I moved on to decision trees and KNN. But unfortunately, those models performed horribly and had to be scrapped.

Random Forest seemed to be the buzz word around the Kaggle forums, so I obviously had to try it out next. I took a couple of days to read up on it, worked out a few examples on my own before re-taking a stab at the titanic dataset.

The ‘caret’ package is a beauty. Seems to be the most widely used package for supervised learning too. I cannot get over how simple and consistent it makes predictive modelling.So far I have been able to do everything from data splitting, to data standardization, to model building, to model tuning – all using one package. And I am still discovering all that it has to offer. Pretty amazing stuff.

I will give you a super quick walk-through of how I applied the random forest algorithm and then go enjoy whatever’s left of my Sunday.

 

Read In The Data:

dat<-read.csv("C:/Users/Amita/Downloads/train (1).csv",header=T,sep=",",
     na.strings = c(""))
test <- read.csv("C:/Users/Amita/Downloads/test (1).csv",header=T,sep=",",
     na.strings = c(""))

Check For Missing Values:

sapply(dat,function(x){sum(is.na(x))}) 

sapply(test,function(x){sum(is.na(x))})

2016-10-09_17-50-11.jpg

The variable ‘Cabin’ seems to have the most missing values and is quite beyond repair – so we’ll drop it. Also, I really don’t think ‘Name’ and ‘Ticket’ could possibly have any relation to the odds of surviving. So we’ll drop that as well. (So reckless! :D)

‘Age’ has quite a few missing values as well, but I have a hunch we’ll need that .So we need to replace the missing values there.

 

dat[is.na(dat$Age),][6]<- mean(dat$Age,na.rm=T)
dat <- dat[,-c(4,9,11)]

test[is.na(test$Age),][5]<- mean(test$Age,na.rm=T) 
test <- test[,-c(3,8,10)]

 

Next, we’ll split the complete training dataset ‘dat’ into two sub-datasets which we shall use for testing our model. Let’s go for a 60:40 split.

set.seed(50)
n<- nrow(dat)
shuffled <- dat[sample(n),]
traindat <- shuffled[1:round(0.6*n),]
testdat<- shuffled[(round(0.6*n) + 1):n,]

 

For this tutorial, we need to install the ‘caret’ package. I am not going to use the ‘randomforest’ package , but instead use the ‘ranger’ package which is supposed to provide a much faster implementation of the algorithm.

install.packages("caret")
install.packages("ranger")
library(caret)
library(ranger)

A little more cleaning prompted by errors thrown along the way. Gotta remove all NAs.

sum(is.na(traindat))
sum(is.na(testdat))

traindat[is.na(traindat$Embarked),][["Embarked"]]<-"C"
testdat[is.na(testdat$Embarked),][["Embarked"]]<-"C"

testdat$Survived<-factor(testdat$Survived)
traindat$Survived<-factor(traindat$Survived)

Convert the ‘Survived’ variable to a factor so that caret builds a classification instead of a regression model.

testdat$Survived<-factor(testdat$Survived)
traindat$Survived<-factor(traindat$Survived)

 

Build The Model:

model <- train(
 Survived ~.,
 tuneLength = 50,
 data = traindat, method ="ranger",
 trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE)
)

As you can see, we are doing a bunch of things in one statement.

The model being trained uses ‘Survived’ as the response variable and all others as predictors. The input dataset is ‘traindat’. The tuneLength argument to caret::train() tells train to explore more models along it’s default tuning grid. A higher value of tuneLength means more accurate results since it evaluates more models along it’s default tuning grid , but it also means that it’ll take longer. caret supports many types of cross-validation, and you can specify which type of cross-validation and the number of cross-validation folds with the trainControl() function, which you pass to the trControl argument in train(). In our statement, we are specifying a 5-fold cross validation. verboseIter =TRUE just shows the progress of the algorithm.

2016-10-09_18-18-28.jpg

The table shows different values of mtry along with their corresponding average accuracies . Caret automatically picks the value of the hyperparameter ‘mtry’ that was the most accurate under cross-validation (mtry = 5 in our case).

We can also plot the model to visually inspect the accuracies of the various mtry values. mtry =5 has the max average accuracy of 81.6%.

 

2016-10-09_18-20-30

Make Predictions on ‘testdat’ :

Let’s apply the model to predict the survival on our test dataset ,’testdat’, which is 40% of our whole training dataset.

pred <- predict(model,newdata=testdat[,-2])

#create confusion matrix
conf<- table(testdat$Survived,pred)

#compute accuracy
accuracy<- sum(diag(conf))/sum(conf)
accuracy

The accuracy is returned at 80.8%. Pretty close to what we saw above.

 

And finally ,

Make Predictions on the Kaggle test dataset, ‘test’.

test$Survived <- predict(model, newdata = test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submissionrf.csv", row.names = FALSE)

 

 

Get Result :

2016-10-09_18-34-08

77.5 % as opposed to last week’s score of  75.86 % .

Not bad.

 

We’ll make it better next week.

Meanwhile, please feel free to leave any pointers for me in the comments section below.I am always game for guidance and feedback!

 

P.S.  I have been really bad about uploading code to github – but I’ll get around to it in a day or two and put up a link here – I promise!