Top 16% : How to plateau using the ‘Feature Engineering’ approach.

Standard

Ladies and Gents! I am now placed in the top 16% of the leaderboard rankings with a score of .80383.

2016-10-14_10-08-27

 

I have also plateaued horribly. No matter what other features I try to ‘engineer’ my score just won’t budge. It get’s worse, sure, but never better. Bummer.

Everything pretty much remains the same as the previous post in terms of data reading and cleaning. In this post, let’s look at what I did differently.

This attempt was a departure from applying the algorithms as is and hoping for a better prediction (Admit it.We’re all guilty.) This time I incorporated the ‘human’ element – I even tried to recall scenes from the movie for that extra insight(Still unfair how Rose hogged the entire wooden plank).

Some of the theories I considered:

  • Women and children were given priority and evacuated first.
  • Mothers would look out for their children.
  • First class passengers were given priority over those in 2nd or 3rd class.
  • Women and children were probably given priority over males in every class.
  • Families travelling together probably had a better chance of survival since they’d try to stick together and help each other out.
  • Older people would have trouble evacuating and hence, would have lower odds of survival.

 

Also, this time around, I played around with the ‘Name’ and ‘Cabin’ variables and that made a huge diffference!

So what you need to do to plateau with an 80.4% prediction is as follows:

Identify the unique titles and create a new variable unique:

# check for all the unique titles 

unique <- gsub(".*?,\\s(.*?)\\..*$","\\1",dat$Name)

dat$unique<- unique
dat$unique[dat$unique %in% c("Mlle","Mme")] <-"Mlle"
dat$unique[dat$unique %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
dat$unique[dat$unique %in% c('Dona', 'Lady', 'the Countess','Jonkheer')] <- 'Lady'

table(dat$unique) # check the distribution of different titles

# passenger’s title 
dat$unique <- factor(dat$unique)

Identify the children and create a new variable isChild  :

dat$ischild <- factor(ifelse(dat$Age<=16,"Child","Adult"))

Identify the mothers and create a new variable isMother:

dat$isMother<- "Not Mother"
dat$isMother[dat$Sex=="female" & dat$Parch>0 & unique!="Miss"] <- "Mother"
dat$isMother<- factor(dat$isMother)

Uniquely identify the Cabins: This variable leads to somewhat of an overfit.

dat$Cabin <- substr(dat$Cabin,1,1)
dat$Cabin[dat$Cabin %in% c("F","G","T",NA)] <- "X"
dat$Cabin<- factor(dat$Cabin)

Compute the family size and create a new variable familysize :

dat$familysize <- dat$SibSp + dat$Parch + 1

Use the ‘familysize‘ variable and the surname of the passenger to designate the family size as “Small” or “Big” in the new variable unit :

pass_names <- dat$Name
extractsurname <- function(x){
  if(grepl(".*?,\\s.*?",x)){
  gsub("^(.*?),\\s.*?$","\\1",x)
 }
}

surnames <- vapply(pass_names, FUN=extractsurname,FUN.VALUE = character(1),USE.NAMES = F)
fam<-paste(as.character(dat$familysize),surnames,sep=" ")


famsize<- function(x){
 if(substr(x,1,2) > 2){
 
 x <- "Big"
 }else{
 x <- "Small"
 }
}

unit <- vapply(fam, FUN=famsize,FUN.VALUE = character(1),USE.NAMES = F)
dat$unit <- unit
dat$unit <- factor(dat$unit)

 

Split the ‘dat’ dataset into train and test (60 : 40 split) and fit the randomforest model.

n<- nrow(dat)
shuffled <- dat[sample(n),]

traindat <- shuffled[1:round(0.6*n),]
testdat<- shuffled[(round(0.6*n) + 1):n,]

dim(traindat)
dim(testdat)

require(caret)
require(ranger)
model <- train(
 Survived ~.,
 tuneLength = 50,
 data = traindat, method ="ranger",
 trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE)
)

pred <- predict(model,newdata=testdat[,-2])
conf<- table(testdat$Survived,pred)
accuracy<- sum(diag(conf))/sum(conf)
accuracy

Using the model to predict survival (minus Cabin) gives us 83.14% accuracy on our test data’testdat’ and 80.34% on Kaggle.

Using the model to predict survival (with Cabin) gives us 83.71% accuracy on our test data’testdat’ which drops to around 79% on Kaggle.

Although, I still haven’t tinkered around with ‘Fare’, ‘Ticket’, and ‘Embarked’ (the urge to do so is strong), I think I’ll just leave it alone for the time being – but I will be revisiting for that elusive ‘eureka’ moment!

You can find the code here .

 

Advertisements

One thought on “Top 16% : How to plateau using the ‘Feature Engineering’ approach.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s