This post is a long time coming.
UPDATE: I have inched my way to the top 13% of the titanic competition (starting out at the ‘top’ 85%, who’d a thunk it. I love persevering. :D)
My last attempt involved XGBoost (Extreme Gradient Boosting) , which did not beat my top score – It barely scraped past a 77%. That being said, I thought it deserved a dedicated post considering I have achieved great results with the algorithm on other Kaggle competitions.
In a nutshell, it
- is a very, very fast version of the GBM,
- needs parameter tuning which can get pretty frustrating (But hey, patience is a virtue!)
- Supports cross validation
- is equiped to help find the variable importance
- is robust to outliers and noisy data
Cutting to the chase.
Step 1: Load libraries
Step 2: Read the datasets
dat<-read.csv("C:/Users/Amita/Downloads/train (1).csv",header=T,sep=",", na.strings = c("")) test <- read.csv("C:/Users/Amita/Downloads/test (1).csv",header=T,sep=",", na.strings = c(""))
Step 3: Process the datasets
This is the same process as outlined in a previous blog post.
Step 4: Extract the response variable column
label <- dat$Survived dat <- dat[,-2] # remove the 'Survived' response column from the training dataset
Step 5: Combine the training and test datasets
combi <- rbind(dat,test)
Step 6: Create a sparse matrix to ‘dummify’ the categorical variables i.e. convert all categorical variables to binary
One thing to remember with XGBoost is that it ONLY works with numerical data types. So datatype conversion is necessary before you proceed with model building.
data_sparse <- sparse.model.matrix(~.-1, data = as.data.frame(combi)) cat("Data size: ", data_sparse@Dim, " x ", data_sparse@Dim, " \n", sep = "")
If you’re familiar with the ‘caret’ package, it has a pretty cool dummyVars function do this exactly what we did above.
# dummify the data dummify <- dummyVars(" ~ .", data = combi) finaldummy <- data.frame(predict(dummify, newdata = combi))
Here, dummyVars will transform all characters and factors columns (the function never transforms numeric columns) and return the entire data set.
Step 7: Divide the dummified back into train and test
dtrain <- xgb.DMatrix(data = data_sparse[1:nrow(dat), ], label = label) dtest <- xgb.DMatrix(data = data_sparse[(nrow(dat)+1):nrow(combi), ])
Step 8: Cross Validate
In order to evaluate the overfit and underfit of the models,we compute cross validation error.
set.seed(12345678) # for reproducibility cv_model <- xgb.cv(data = dtrain, nthread = 8, # number of threads allocated to the execution of XGBoost nfold = 5, # the original data is divided into 4 equal random samples nrounds = 1000000, # number of iterations max_depth = 6, # maximum depth of a tree eta = 0.05, # controls the learning rate. 0 < eta < 1 subsample = 0.70, #subsample ratio of the training instance. colsample_bytree = 0.70, #subsample ratio of columns when constructing each tree booster = "gbtree", # gbtree or gblinear eval_metric = "error", #binary classification error rate maximize = FALSE, #
maximize=TRUEmeans the larger the evaluation score the better early_stopping_rounds = 25, # training with a validation set will # stop if the performance keeps getting worse # consecutively for
krounds. objective = "reg:logistic", # logistic regression print_every_n = 10, # output is printed every 10 iterations verbose = TRUE) # print the output
Everything you need to know about the xgb.cv parameters and beyond is answered here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
Step 9: Build model
temp_model <- xgb.train(data = dtrain, nthread = 8, nrounds = cv_model$best_iteration, max_depth = 6, eta = 0.05, subsample = 0.70, colsample_bytree = 0.70, booster = "gbtree", eval_metric = "error", maximize = FALSE, objective = "reg:logistic", print_every_n = 10, verbose = TRUE, watchlist = list(trainrep = dtrain))
Easy reference : https://rdrr.io/cran/xgboost/man/xgb.train.html
Step 10: Predict ‘Survived’ values.
prediction <- predict(temp_model,dtest) prediction <- ifelse(prediction>0.5,1,0)
Step 11: Check and plot the variable importance
Certain predictors drag down the performance of the model even though it makes complete sense gut-wise to keep them there. On a couple of occasions,variable importance has helped me decide the relevance of the predictors, which positively impacted the accuracy of my model.
importance <- xgb.importance(feature_names = data_sparse@Dimnames[], model = temp_model) #Grab all important features xgb.plot.importance(importance) #Plot
Annnd, that’s pretty much it!
Go get ’em!