THE NON-SCHOLARLY WRITE-UP : LeNet architecture applied to the MNIST dataset: 99% accuracy.


For about a week or so , I have been working on the ‘Digit Recognizer’ competition over at Kaggle. I started out with my favourite go-to algorithm: Random Forest, and eventually moved on to other implementations and variations including

  • KNN
  • KNN with PCA
  • XGBoost
  • deeplearning with H2O
  • GBM with H2O
  • Ensembling

And then I plateaued at 97.8% .

A quick google search not too different from ‘ improve score + Digit Recognizer +MNIST’, threw up a bunch of pages all of which seemed to talk about Neural Networks.I’m like huh? Isn’t that biology?

Sure is. Who’da thunk it!

Anyway ,I spent  considerable time pouring over a few AMAZING bookmark-able resources and implemented my first ConvNet (I feel so accomplished!).

The implementation in question is called the LeNet. One of the best convolutional networks is the LeNet architecture that is used to read zip codes, digits, etc.


The model consists of a convolutional layer followed by a pooling layer, another convolution layer followed by a pooling layer, and then two fully connected layers similar to the conventional multilayer perceptrons.

Step 1: Load libraries


Step 2: Read the datasets

These are available from the Kaggle ‘Digit Recognizer’ competition page here.

Here every image is represented as a single row. The pixel range for each image lies between 0 and 255.

trainorig <-read.csv("C:/Users/Amita/Downloads/train.csv",header=T,sep=",")
testorig <-  read.csv("C:/Users/Amita/Downloads/test.csv",header=T,sep=",")

Step 3: Convert the training and testing datasets into matrices


Step 4: Extract the labels

train.y<-train[,1] # labels

Step 5: Scale the data and transpose  the matrices since mxnet seems to prefer observations in columns instead of rows.


The transposed matrix contains data in the form npixel x nexample.

Step 6: Convert the matrices into arrays for lenet

train.array <- train.x
 dim(train.array) <- c(28, 28, 1, ncol(train.x))
 test.array <- test
 dim(test.array) <- c(28, 28, 1, ncol(test))

Each input x is a 28x28x1 array representing one image, where the first two numbers represent the width and height in pixels, the third number is the number of channels (which is 1 for grayscale images, 3 for RGB images).

Step 7: Configure the structure of the network

# Convolutional NN
 data <- mx.symbol.Variable('data')
 # first conv
 conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20)
 relu1 <- mx.symbol.Activation(data=conv1, act_type="relu")
 pool1 <- mx.symbol.Pooling(data=relu1, pool_type="max",
 kernel=c(2,2), stride=c(2,2))
 # second conv
 conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50)
 relu2 <- mx.symbol.Activation(data=conv2, act_type="relu")
 pool2 <- mx.symbol.Pooling(data=relu2, pool_type="max",
 kernel=c(2,2), stride=c(2,2))
 # first fullc
 flatten <- mx.symbol.Flatten(data=pool2)
 fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500)
 relu3 <- mx.symbol.Activation(data=fc1, act_type="relu")
 # second fullc
 fc2 <- mx.symbol.FullyConnected(data=relu3, num_hidden=10)
 # loss
 lenet <- mx.symbol.SoftmaxOutput(data=fc2)

Step 8: Train the model

 model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
 ctx=devices, num.round=20, array.batch.size=100,
 learning.rate=0.05, momentum=0.9, wd=0.00001,

Step 9: predict on the test dataset and calculate accuracy

preds <- predict(model, test.array) 
pred.label <- max.col(t(preds)) - 1

Step 10: Predict on the final test dataset and submit to Kaggle

# predict on the kaggle dataset 
 testorig <- as.matrix(testorig)
 testorig.array <- testorig
 dim(testorig.array) <- c(28, 28, 1, ncol(testorig))

 predictions <- data.frame(ImageId=1:nrow(testo), Label=predlabel)
write.csv(predictions, "CNN.csv",row.names=FALSE)

and *ba-dum-tsss* !!! a 0.99086 !

If anybody has any ideas on how to improve this score , please share! TIA!


The non-scholarly write-up : Logistic Regression with XGBoost.


This post is a long time coming.

 UPDATE: I have inched my way to the top 13% of the titanic competition (starting out at the ‘top’ 85%, who’d a thunk it. I love persevering. :D)


My last attempt involved XGBoost (Extreme Gradient Boosting) , which did not beat my top score – It barely scraped past a 77%. That being said, I thought it deserved a dedicated post considering I have achieved great results with the algorithm on other Kaggle competitions.

In a nutshell, it

  • is a very, very fast version of the GBM,
  • needs parameter tuning which can get pretty frustrating (But hey, patience is a virtue!)
  • Supports cross validation
  • is equiped to help find the variable importance
  • is robust to outliers and noisy data


Cutting to the chase.

Step 1: Load libraries


Step 2: Read the datasets

dat<-read.csv("C:/Users/Amita/Downloads/train (1).csv",header=T,sep=",",
              na.strings = c(""))
test <- read.csv("C:/Users/Amita/Downloads/test (1).csv",header=T,sep=",",
              na.strings = c(""))

Step 3:  Process the datasets

This is the same process as outlined in a previous blog post.

Step 4:  Extract the response variable column

label <- dat$Survived
dat <- dat[,-2] # remove the 'Survived' response column from the training dataset

Step 5:  Combine the training and test datasets

combi <- rbind(dat,test)

Step 6: Create a sparse matrix  to ‘dummify’ the categorical variables i.e. convert all categorical variables to binary

One thing to remember with XGBoost is that it ONLY works with numerical data types. So datatype conversion is necessary before you proceed with model building.

data_sparse <- sparse.model.matrix(~.-1, data =
cat("Data size: ", data_sparse@Dim[1], " x ", data_sparse@Dim[2], " \n", sep = "")

If you’re familiar with the ‘caret’ package, it has a pretty cool dummyVars function do this exactly what we did above.

# dummify the data
dummify <- dummyVars(" ~ .", data = combi)
finaldummy <- data.frame(predict(dummify, newdata = combi))

Here, dummyVars will transform all characters and factors columns (the function never transforms numeric columns) and return the entire data set.


Step 7: Divide the dummified back into train and test

dtrain <- xgb.DMatrix(data = data_sparse[1:nrow(dat), ], label = label)
dtest <- xgb.DMatrix(data = data_sparse[(nrow(dat)+1):nrow(combi), ])

Step 8: Cross Validate

In order to evaluate the overfit and underfit of the models,we compute cross validation error.

set.seed(12345678) # for reproducibility

cv_model <- = dtrain,
 nthread = 8,  # number of threads allocated to the execution of XGBoost
 nfold = 5,  # the original data is divided into 4 equal random samples
 nrounds = 1000000, # number of iterations
 max_depth = 6, # maximum depth of a tree
 eta = 0.05, # controls the learning rate. 0 < eta < 1
 subsample = 0.70, #subsample ratio of the training instance. 
 colsample_bytree = 0.70, #subsample ratio of columns when constructing each tree
 booster = "gbtree", # gbtree or gblinear
 eval_metric = "error", #binary classification error rate
 maximize = FALSE, #maximize=TRUE means the larger the evaluation score the better
 early_stopping_rounds = 25, # training with a validation set will 
                             # stop if the performance keeps getting worse 
                             # consecutively for k rounds.
 objective = "reg:logistic", # logistic regression
 print_every_n = 10, # output is printed every 10 iterations
 verbose = TRUE) # print the output 

Everything you need to know about the parameters and beyond is answered here

Step 9: Build model

temp_model <- xgb.train(data = dtrain,
 nthread = 8,
 nrounds = cv_model$best_iteration,
 max_depth = 6,
 eta = 0.05,
 subsample = 0.70,
 colsample_bytree = 0.70,
 booster = "gbtree",
 eval_metric = "error",
 maximize = FALSE,
 objective = "reg:logistic",
 print_every_n = 10,
 verbose = TRUE,
 watchlist = list(trainrep = dtrain))

Easy reference :

Step 10: Predict ‘Survived’ values.

prediction <- predict(temp_model,dtest)
prediction <- ifelse(prediction>0.5,1,0)


Step 11: Check  and plot the variable importance

Certain predictors drag down the performance of the model even though it makes complete sense gut-wise to keep them there. On a couple of occasions,variable importance has helped me decide the relevance of the predictors, which positively impacted the accuracy of my model.

importance <- xgb.importance(feature_names = data_sparse@Dimnames[[2]], 
              model = temp_model) #Grab all important features
xgb.plot.importance(importance) #Plot



For everything XGBoost, I frequented this page and this page . Pretty thorough resources, IMHO.

Annnd, that’s pretty much it!

Go get ’em!

Top 16% : How to plateau using the ‘Feature Engineering’ approach.


Ladies and Gents! I am now placed in the top 16% of the leaderboard rankings with a score of .80383.



I have also plateaued horribly. No matter what other features I try to ‘engineer’ my score just won’t budge. It get’s worse, sure, but never better. Bummer.

Everything pretty much remains the same as the previous post in terms of data reading and cleaning. In this post, let’s look at what I did differently.

This attempt was a departure from applying the algorithms as is and hoping for a better prediction (Admit it.We’re all guilty.) This time I incorporated the ‘human’ element – I even tried to recall scenes from the movie for that extra insight(Still unfair how Rose hogged the entire wooden plank).

Some of the theories I considered:

  • Women and children were given priority and evacuated first.
  • Mothers would look out for their children.
  • First class passengers were given priority over those in 2nd or 3rd class.
  • Women and children were probably given priority over males in every class.
  • Families travelling together probably had a better chance of survival since they’d try to stick together and help each other out.
  • Older people would have trouble evacuating and hence, would have lower odds of survival.


Also, this time around, I played around with the ‘Name’ and ‘Cabin’ variables and that made a huge diffference!

So what you need to do to plateau with an 80.4% prediction is as follows:

Identify the unique titles and create a new variable unique:

# check for all the unique titles 

unique <- gsub(".*?,\\s(.*?)\\..*$","\\1",dat$Name)

dat$unique<- unique
dat$unique[dat$unique %in% c("Mlle","Mme")] <-"Mlle"
dat$unique[dat$unique %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
dat$unique[dat$unique %in% c('Dona', 'Lady', 'the Countess','Jonkheer')] <- 'Lady'

table(dat$unique) # check the distribution of different titles

# passenger’s title 
dat$unique <- factor(dat$unique)

Identify the children and create a new variable isChild  :

dat$ischild <- factor(ifelse(dat$Age<=16,"Child","Adult"))

Identify the mothers and create a new variable isMother:

dat$isMother<- "Not Mother"
dat$isMother[dat$Sex=="female" & dat$Parch>0 & unique!="Miss"] <- "Mother"
dat$isMother<- factor(dat$isMother)

Uniquely identify the Cabins: This variable leads to somewhat of an overfit.

dat$Cabin <- substr(dat$Cabin,1,1)
dat$Cabin[dat$Cabin %in% c("F","G","T",NA)] <- "X"
dat$Cabin<- factor(dat$Cabin)

Compute the family size and create a new variable familysize :

dat$familysize <- dat$SibSp + dat$Parch + 1

Use the ‘familysize‘ variable and the surname of the passenger to designate the family size as “Small” or “Big” in the new variable unit :

pass_names <- dat$Name
extractsurname <- function(x){

surnames <- vapply(pass_names, FUN=extractsurname,FUN.VALUE = character(1),USE.NAMES = F)
fam<-paste(as.character(dat$familysize),surnames,sep=" ")

famsize<- function(x){
 if(substr(x,1,2) > 2){
 x <- "Big"
 x <- "Small"

unit <- vapply(fam, FUN=famsize,FUN.VALUE = character(1),USE.NAMES = F)
dat$unit <- unit
dat$unit <- factor(dat$unit)


Split the ‘dat’ dataset into train and test (60 : 40 split) and fit the randomforest model.

n<- nrow(dat)
shuffled <- dat[sample(n),]

traindat <- shuffled[1:round(0.6*n),]
testdat<- shuffled[(round(0.6*n) + 1):n,]


model <- train(
 Survived ~.,
 tuneLength = 50,
 data = traindat, method ="ranger",
 trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE)

pred <- predict(model,newdata=testdat[,-2])
conf<- table(testdat$Survived,pred)
accuracy<- sum(diag(conf))/sum(conf)

Using the model to predict survival (minus Cabin) gives us 83.14% accuracy on our test data’testdat’ and 80.34% on Kaggle.

Using the model to predict survival (with Cabin) gives us 83.71% accuracy on our test data’testdat’ which drops to around 79% on Kaggle.

Although, I still haven’t tinkered around with ‘Fare’, ‘Ticket’, and ‘Embarked’ (the urge to do so is strong), I think I’ll just leave it alone for the time being – but I will be revisiting for that elusive ‘eureka’ moment!

You can find the code here .