The non-scholarly write-up : Logistic Regression with XGBoost.

Standard

This post is a long time coming.

 UPDATE: I have inched my way to the top 13% of the titanic competition (starting out at the ‘top’ 85%, who’d a thunk it. I love persevering. :D)

Anyway.

My last attempt involved XGBoost (Extreme Gradient Boosting) , which did not beat my top score – It barely scraped past a 77%. That being said, I thought it deserved a dedicated post considering I have achieved great results with the algorithm on other Kaggle competitions.

In a nutshell, it

  • is a very, very fast version of the GBM,
  • needs parameter tuning which can get pretty frustrating (But hey, patience is a virtue!)
  • Supports cross validation
  • is equiped to help find the variable importance
  • is robust to outliers and noisy data

 

Cutting to the chase.

Step 1: Load libraries

require(xgboost)
require(Matrix)

Step 2: Read the datasets

dat<-read.csv("C:/Users/Amita/Downloads/train (1).csv",header=T,sep=",",
              na.strings = c(""))
test <- read.csv("C:/Users/Amita/Downloads/test (1).csv",header=T,sep=",",
              na.strings = c(""))

Step 3:  Process the datasets

This is the same process as outlined in a previous blog post.

Step 4:  Extract the response variable column

label <- dat$Survived
dat <- dat[,-2] # remove the 'Survived' response column from the training dataset

Step 5:  Combine the training and test datasets

combi <- rbind(dat,test)

Step 6: Create a sparse matrix  to ‘dummify’ the categorical variables i.e. convert all categorical variables to binary

One thing to remember with XGBoost is that it ONLY works with numerical data types. So datatype conversion is necessary before you proceed with model building.

data_sparse <- sparse.model.matrix(~.-1, data = as.data.frame(combi))
cat("Data size: ", data_sparse@Dim[1], " x ", data_sparse@Dim[2], " \n", sep = "")

If you’re familiar with the ‘caret’ package, it has a pretty cool dummyVars function do this exactly what we did above.

# dummify the data
dummify <- dummyVars(" ~ .", data = combi)
finaldummy <- data.frame(predict(dummify, newdata = combi))

Here, dummyVars will transform all characters and factors columns (the function never transforms numeric columns) and return the entire data set.

 

Step 7: Divide the dummified back into train and test

dtrain <- xgb.DMatrix(data = data_sparse[1:nrow(dat), ], label = label)
dtest <- xgb.DMatrix(data = data_sparse[(nrow(dat)+1):nrow(combi), ])

Step 8: Cross Validate

In order to evaluate the overfit and underfit of the models,we compute cross validation error.

set.seed(12345678) # for reproducibility

cv_model <- xgb.cv(data = dtrain,
 nthread = 8,  # number of threads allocated to the execution of XGBoost
 nfold = 5,  # the original data is divided into 4 equal random samples
 nrounds = 1000000, # number of iterations
 max_depth = 6, # maximum depth of a tree
 eta = 0.05, # controls the learning rate. 0 < eta < 1
 subsample = 0.70, #subsample ratio of the training instance. 
 colsample_bytree = 0.70, #subsample ratio of columns when constructing each tree
 booster = "gbtree", # gbtree or gblinear
 eval_metric = "error", #binary classification error rate
 maximize = FALSE, #maximize=TRUE means the larger the evaluation score the better
 early_stopping_rounds = 25, # training with a validation set will 
                             # stop if the performance keeps getting worse 
                             # consecutively for k rounds.
 objective = "reg:logistic", # logistic regression
 print_every_n = 10, # output is printed every 10 iterations
 verbose = TRUE) # print the output 

Everything you need to know about the xgb.cv parameters and beyond is answered here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

Step 9: Build model

temp_model <- xgb.train(data = dtrain,
 nthread = 8,
 nrounds = cv_model$best_iteration,
 max_depth = 6,
 eta = 0.05,
 subsample = 0.70,
 colsample_bytree = 0.70,
 booster = "gbtree",
 eval_metric = "error",
 maximize = FALSE,
 objective = "reg:logistic",
 print_every_n = 10,
 verbose = TRUE,
 watchlist = list(trainrep = dtrain))

Easy reference : https://rdrr.io/cran/xgboost/man/xgb.train.html

Step 10: Predict ‘Survived’ values.

prediction <- predict(temp_model,dtest)
prediction <- ifelse(prediction>0.5,1,0)

 

Step 11: Check  and plot the variable importance

Certain predictors drag down the performance of the model even though it makes complete sense gut-wise to keep them there. On a couple of occasions,variable importance has helped me decide the relevance of the predictors, which positively impacted the accuracy of my model.

importance <- xgb.importance(feature_names = data_sparse@Dimnames[[2]], 
              model = temp_model) #Grab all important features
xgb.plot.importance(importance) #Plot

2016-10-27_16-52-20.jpg

 

For everything XGBoost, I frequented this page and this page . Pretty thorough resources, IMHO.

Annnd, that’s pretty much it!

Go get ’em!