The non-scholarly write up – Wordclouds in R : Oh, The possibilities!

Standard

Okay, so I am stoked to report that I can now build them pretty wordclouds ! I am even more pleased with how easy the process is. There’s a whole array of plots you can play around with, including :

Commonality Cloud : Allows you to view words common to both corpora

Comparison Cloud: Allows you view words which are not common to both the corpora

Polarized Plot: A better take on the commonality cloud, allowing you to tell which corpra                                   has a greater concentration of a particular word.

Visualized word Network : shows the network of words associated with a main word.

Let’s jump right into it.

Step 1: Load libraries

require("tm") # the text mining package
require("qdap") # for qdap package's cleaning functions
require("twitteR") # to connect to twitter and extract tweets
require("plotrix") # for the pyramid plot

Step 2: Read in your choice of tweets

After connecting to twitter, I downloaded 5000 tweets each found from a search of the key words “hillary” and “trump”. And this was minutes after the US elections 2016 results were declared . Twitter has never been so lit!

hillary<-searchTwitter("hillary",n=5000,lang = "en")
trump<- searchTwitter("trump",n=5000,lang="en")

 

Step 3: Write and apply functions to perform data transformation and                       cleaning

a) Function to extract text from the tweets which get downloaded in the list form.We do this using  getText  which is an accessor method.

convert_to_text <- function(x){
 x$getText()
 }

b) Function to process our tweets to remove duplicates and  urls.

replacefunc <- function(x){
 gsub("https://(.*)", "", x)
 
}


replace_dup <- function(x){
 gsub("^(rt|RT)(.*)", "", x)
 
}

c) Function to further clean the character vector , for example, to remove brackets, replace abbreviations and symbols with their word equivalents and  contractions with their fully expanded versions.

clean_qdap <- function(x){
 x<- bracketX(x)
 x<- replace_abbreviation(x)
 x<- replace_contraction(x)
 x<- replace_symbol(x)
 x<-tolower(x)
 return(x)
}

d) Apply the above functions

hillary_text <- sapply(hillary,convert_to_text)
hillary_text1 <- hillary_text
hill_remove_url<- replacefunc(hillary_text1)
hill_sub <- replace_dup(hill_remove_url)
hill_indx <- which(hill_sub=="")
hill_sub_complete <- hill_sub[-hill_indx]

trump_text <- sapply(trump,convert_to_text)
trump_text1 <- trump_text
trump_remove_url<- replacefunc(trump_text1)
trump_sub <- replace_dup(trump_remove_url)
trump_indx <- which(trump_sub=="")
trump_sub_complete <- trump_sub[-trump_indx]

# encode to UTF-8 : capable of encoding all possible characters defined by unicode
trump_sub_complete <- paste(trump_sub_complete,collapse=" ")
Encoding(trump_sub_complete) <- "UTF-8"
trump_sub_complete <- iconv(trump_sub_complete, "UTF-8", "UTF-8",sub='') 
                         #replace non UTF-8 by empty space
trump_clean <- clean_qdap(trump_sub_complete)
trump_clean1 <- trump_clean

hill_sub_complete <- paste(hill_sub_complete,collapse=" ")
Encoding(hill_sub_complete) <- "UTF-8"
hill_sub_complete <- iconv(hill_sub_complete, "UTF-8", "UTF-8",sub='') 
                     #replace non UTF-8 by empty space
hillary_clean <- clean_qdap(hill_sub_complete)
hillary_clean1 <- hillary_clean

 

Step 4: Convert the character vectors to VCorpus objects

trump_corpus <- VCorpus(VectorSource(trump_clean1))
hill_corpus <- VCorpus(VectorSource(hillary_clean1))

 

Step 5: Define and apply function to  format the corpus object 

clean_corpus <- function(corpus){
 corpus <- tm_map(corpus, removePunctuation)
 corpus <- tm_map(corpus, stripWhitespace)
 corpus <- tm_map(corpus, removeNumbers)
 corpus <- tm_map(corpus, content_transformer(tolower))
 corpus <- tm_map(corpus, removeWords, 
 c(stopwords("en"),"supporters","vote","election","like","even","get","will","can"
,"amp","still","just","will","now"))
 return(corpus)
}

trump_corpus_clean <- clean_corpus(trump_corpus)
hill_corpus_clean <- clean_corpus(hill_corpus)
  • Note : qdap cleaner functions can be used with character vectors, but tm functions need a corpus as input.

Step 6: Convert the corpora into TermDocumentMatrix(TDM) objects

Tdmobjecthillary <- TermDocumentMatrix(hill_corpus_clean1)
Tdmobjecttrump <- TermDocumentMatrix(trump_corpus_clean1)

Step 7: Convert the TDM objects into matrices

Tdmobjectmatrixhillary <- as.matrix(Tdmobjecthillary)
Tdmobjectmatrixtrump <- as.matrix(Tdmobjecttrump)

 

Step 8: Sum rows and create term-frequency dataframe

Freq <- rowSums(Tdmobjectmatrixhillary)
Word_freq <- data.frame(term= names(Freq),num=Freq)

Freqtrump <- rowSums(Tdmobjectmatrixtrump)
Word_freqtrump <- data.frame(term= names(Freqtrump),num=Freqtrump)

Step 9: Prep for fancier wordclouds

# unify the corpora
cc <- c(trump_corpus_clean,hill_corpus_clean)

# convert to TDM
all_tdm <- TermDocumentMatrix(cc)
colnames(all_tdm) <- c("Trump","Hillary")

# convert to matrix
all_m <- as.matrix(all_tdm)


# Create common_words
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)

# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])

# Combine common_words and difference
common_words <- cbind(common_words, difference)

# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]

# Create top25_df
top25_df <- data.frame(x = common_words[1:25, 1], 
 y = common_words[1:25, 2], 
 labels = rownames(common_words[1:25, ]))

Step 10: It’s word cloud time!

 

a) The ‘everyday’ cloud

wordcloud(Word_freq$term, Word_freq$num, scale=c(3,0.5),max.words=1000, 
          random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, 
          colors=brewer.pal(5, "Blues"))

wordcloud(Word_freqtrump$term, Word_freqtrump$num, scale=c(3,0.5),max.words=1000,
          random.order=FALSE, rot.per=0.35, use.r.layout=FALSE,
          colors=brewer.pal(5, "Reds"))

2016-11-09_22-19-43trumpnew

 

b) The Polarized pyramid plot

# Create the pyramid plot
pyramid.plot(top25_df$x, top25_df$y, labels = top25_df$labels, 
 gap = 70, top.labels = c("Trump", "Words", "Hillary"), 
 main = "Words in Common", laxlab = NULL, 
 raxlab = NULL, unit = NULL)

2016-11-11_19-55-31.jpg

 

c) The comparison cloud 

comparison.cloud(all_m, colors = c("red", "blue"),max.words=100)

2016-11-11_19-59-18.jpg

d) The commonality cloud 

commonality.cloud(all_m, colors = "steelblue1",max.words=100)

2016-11-09_22-19-43.jpg

 

We made it!  That’s it for this post, folks.

Coming up next: Mining deeper into text.

 

 

 

 

Advertisements

Top 16% : How to plateau using the ‘Feature Engineering’ approach.

Standard

Ladies and Gents! I am now placed in the top 16% of the leaderboard rankings with a score of .80383.

2016-10-14_10-08-27

 

I have also plateaued horribly. No matter what other features I try to ‘engineer’ my score just won’t budge. It get’s worse, sure, but never better. Bummer.

Everything pretty much remains the same as the previous post in terms of data reading and cleaning. In this post, let’s look at what I did differently.

This attempt was a departure from applying the algorithms as is and hoping for a better prediction (Admit it.We’re all guilty.) This time I incorporated the ‘human’ element – I even tried to recall scenes from the movie for that extra insight(Still unfair how Rose hogged the entire wooden plank).

Some of the theories I considered:

  • Women and children were given priority and evacuated first.
  • Mothers would look out for their children.
  • First class passengers were given priority over those in 2nd or 3rd class.
  • Women and children were probably given priority over males in every class.
  • Families travelling together probably had a better chance of survival since they’d try to stick together and help each other out.
  • Older people would have trouble evacuating and hence, would have lower odds of survival.

 

Also, this time around, I played around with the ‘Name’ and ‘Cabin’ variables and that made a huge diffference!

So what you need to do to plateau with an 80.4% prediction is as follows:

Identify the unique titles and create a new variable unique:

# check for all the unique titles 

unique <- gsub(".*?,\\s(.*?)\\..*$","\\1",dat$Name)

dat$unique<- unique
dat$unique[dat$unique %in% c("Mlle","Mme")] <-"Mlle"
dat$unique[dat$unique %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
dat$unique[dat$unique %in% c('Dona', 'Lady', 'the Countess','Jonkheer')] <- 'Lady'

table(dat$unique) # check the distribution of different titles

# passenger’s title 
dat$unique <- factor(dat$unique)

Identify the children and create a new variable isChild  :

dat$ischild <- factor(ifelse(dat$Age<=16,"Child","Adult"))

Identify the mothers and create a new variable isMother:

dat$isMother<- "Not Mother"
dat$isMother[dat$Sex=="female" & dat$Parch>0 & unique!="Miss"] <- "Mother"
dat$isMother<- factor(dat$isMother)

Uniquely identify the Cabins: This variable leads to somewhat of an overfit.

dat$Cabin <- substr(dat$Cabin,1,1)
dat$Cabin[dat$Cabin %in% c("F","G","T",NA)] <- "X"
dat$Cabin<- factor(dat$Cabin)

Compute the family size and create a new variable familysize :

dat$familysize <- dat$SibSp + dat$Parch + 1

Use the ‘familysize‘ variable and the surname of the passenger to designate the family size as “Small” or “Big” in the new variable unit :

pass_names <- dat$Name
extractsurname <- function(x){
  if(grepl(".*?,\\s.*?",x)){
  gsub("^(.*?),\\s.*?$","\\1",x)
 }
}

surnames <- vapply(pass_names, FUN=extractsurname,FUN.VALUE = character(1),USE.NAMES = F)
fam<-paste(as.character(dat$familysize),surnames,sep=" ")


famsize<- function(x){
 if(substr(x,1,2) > 2){
 
 x <- "Big"
 }else{
 x <- "Small"
 }
}

unit <- vapply(fam, FUN=famsize,FUN.VALUE = character(1),USE.NAMES = F)
dat$unit <- unit
dat$unit <- factor(dat$unit)

 

Split the ‘dat’ dataset into train and test (60 : 40 split) and fit the randomforest model.

n<- nrow(dat)
shuffled <- dat[sample(n),]

traindat <- shuffled[1:round(0.6*n),]
testdat<- shuffled[(round(0.6*n) + 1):n,]

dim(traindat)
dim(testdat)

require(caret)
require(ranger)
model <- train(
 Survived ~.,
 tuneLength = 50,
 data = traindat, method ="ranger",
 trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE)
)

pred <- predict(model,newdata=testdat[,-2])
conf<- table(testdat$Survived,pred)
accuracy<- sum(diag(conf))/sum(conf)
accuracy

Using the model to predict survival (minus Cabin) gives us 83.14% accuracy on our test data’testdat’ and 80.34% on Kaggle.

Using the model to predict survival (with Cabin) gives us 83.71% accuracy on our test data’testdat’ which drops to around 79% on Kaggle.

Although, I still haven’t tinkered around with ‘Fare’, ‘Ticket’, and ‘Embarked’ (the urge to do so is strong), I think I’ll just leave it alone for the time being – but I will be revisiting for that elusive ‘eureka’ moment!

You can find the code here .

 

Stepwise Regression – What’s not to like ?

Standard

 

Plenty, apparently.

Besides encouraging you not to think , it doesn’t exactly do a great job at what it claims to do. Given a set of predictors, there is no guarantee that stepwise regression will find the optimal combination. Many of my statisticians buddies , whom I consult from time to time,  have a  gripe with it because it’s not  sensitive to the context of the research. Seems fair.

 

I built an interactive Shiny app to evaluate results from Stepwise regression (direction = “backward) when applied to different predictors and datasets. What I observed during model building and cross-validation was that the model performed better on the data at hand but performs much worse when subjected to cross-validation.After a lot of different random selections and testing, I eventually did find a model that worked well on both the fitted dataset and the cross-validation set, but it performed poorly when applied to new data.Therein lies at least most of the problem.

 

Shiny app can be found here.

The initial model was built to predict the ‘Life Expectancy’ and it does, to a certain extent, do it’s job . But when generalized, it pretty much turned out to be a bit of an uncertainty ridden damp squib. For example, predictions for the variables from the same dataset , such as, ‘Population’ ,’Frost’, ‘Area’ are nowhere close to the observed values. At the same time, the model did okay for variables such as ‘Illiteracy’ and ‘HS.Grad’.

Given all these drawbacks ( and more! ), people do find the motivation to use stepwise regression to produce a simpler model in terms of number of coefficients. It does not necessarily find the optimal model, but it does give a hunch of the possible combination of predictors.

While no one would conclude a statistical study based on stepwise results or publish a paper with it, some might find uses for it, say, to  verify models already created by software systems. Or as an easy-to-use tool for initial exploratory data analysis (with all the necessary caveats in place !) .

You win some, you lose some.

What do you think ? Leave a comment!

p.s. You can find the (needs-to-be-cleaned-up) code for the Shiny app here.

 

 

 

Mixed-design ANOVA : 2 between-subject factors and 1 within-subject factor

Standard

Suppose you want to examine the impact of diet and exercise on pulse rate. To investigate these issues, you collect a sample of 18 individuals and group them according to their dietary preferences: meat eaters and vegetarians. You then divide each diet category into three groups, randomly assigning each group to one of three types of exercise: exercise1, exercise2, exercise 3.  In addition to these between-subjects factors, you want to include a single within-subjects factor in the analysis. Each subject’s pulse rate will be measured at three levels of exertion: intensity1, intensity2, intensity3.

So we have 3 factors to work with:

  • Two between-subjects (grouping) factors: dietary preference and exercise type.
  • One within-subjects factor : intensity (of exertion)

This is what our data looks like. Onwards, then!

1 112 166 215 1
1 111 166 225 1
1 89 132 189 1
1 95 134 186 2
1 66 109 150 2
1 69 119 177 2
2 125 177 241 1
2 85 117 186 1
2 97 137 185 1
2 93 151 217 2
2 77 122 178 2
2 78 119 173 2
3 81 134 205 1
3 88 133 180 1
3 88 157 224 1
3 58 99 131 2
3 85 132 186 2
3 78 110 164 2

After reading in the file,  we give the columns appropriate names.

diet <- read_excel(path,col_names=F)
names(diet) <- c("subject","exercise","intensity1","intensity2","intensity3",
"diet")


Then we convert ‘exercise’,’subject’ and ‘diet’ into factors .

diet$exercise<- factor(diet$exercise)
diet$diet<- factor(diet$diet)
diet$subject <- factor(diet$subject)

For repeated measures ANOVA, the data must be in the long form . We will use the melt() form the reshape2 package to achieve this. We are now at one row per participant per condition.

diet.long <- melt(diet, id = c("subject","diet","exercise"), 
 measure = c("intensity1","intensity2","intensity3"), 
 variable.name="intensity")

At this point we’re ready to actually construct our ANOVA.

Our anova looks like this –

mod <- aov(value ~ diet*exercise*intensity + Error(subject/intensity) , 
data=diet.long)

The asterisk specifies that we want to look at the interaction between the three factors. But since this is a repeated measures design as well, we need to specify an error term that accounts for natural variation from participant to participant.

Running a summary() on our anova above  yields the following results –

2016-09-19_16-21-54

The main conclusions we can arrive at are as follows:

  • There is a significant main effect of ‘diet’ on the pulse rate. We can conclude that a statistically significant difference exists between vegetarians and meat eaters on their overall pulse rates.
  • There is a statistically significant within-subjects main effect for intensity.
  • There is a marginally statistically significant interaction between diet and intensity. We’ll look at this later.
  • The type of exercise has no statistically significant effect on overall pulse rates.

 

Let’s plot the average pulse rate as explained by diet, exercise, and the intensity.

mean_pulse1<-with(diet.long,tapply(value,list(diet,intensity,exercise),mean))
mean_pulse1
mp1 <- stack(as.data.frame(mean_pulse1))
mp1<- separate(mp1,ind,c("Intensity","Exercise"))
mp1$Diet<- rep(seq_len(nrow(mean_pulse1)),ncol(mean_pulse1))
mp1$Diet <- factor(mp1$Diet,labels = c("Meat","Veg"))
mp1$Intensity<-factor(mp1$Intensity)
mp1$Exercise<-factor(mp1$Exercise)
ggplot(mp1,aes(Intensity,values,group=Diet,color=Diet)) + geom_line(lwd=1) + xlab("Intensity of the exercise") +
 ylab("Mean Pulse Rate") + ggtitle("Mean Pulse rate - \n Exercise Intensity vs Diet") + theme_grey()+
 facet_grid(Exercise ~.)

 

 

Rplot06.png

 

The plot agrees with our observations from earlier.

 

 

UPDATE: Understanding the Results

Earlier we had rejected a null hypothesis and concluded that change in mean pulse rate across intensity levels marginally depends upon dietary preference. Now ,we will turn our attention to the study of this interaction.

We begin by plotting an interaction plot as follows:

interaction.plot(mp1$Intensity, mp1$Diet, mp1$values , type="b", col=c("red","blue"), legend=F,
 lwd=2, pch=c(18,24),
 xlab="Exertion intensity", 
 ylab="Mean pulse rate ", 
 main="Interaction Plot")

legend("bottomright",c("Meat","Veg"),bty="n",lty=c(1,2),lwd=2,pch=c(18,24),col=c("red","blue"),title="Diet")

Rplot.png

We see that the mean pulse rate increases across exertion intensity(‘trials’) : this is the within-subject effect.

Further, it’s clear that vegetarians have a lower average pulse rate than do meat eaters at every trial: this is the diet main effect.

The difference between the mean pulse rate of meat-eaters vs vegetarians is different at each exertion level. This is the result of the diet by intensity interaction.

The main effect for diet is reflected in the fact that meat-eaters had a mean pulse rate roughly 10 to 20 points higher than that for vegetarians.

The main effect of intensity is reflected in the fact  for both diet groups, the mean pulse rate after jogging increased about 50 points beyond the rate after warmup exercises, and increased another  55 points (approx.) after running.

The interaction effect of diet and intensity is reflected in the fact that the gap between the two dietary groups changes across the three intensities. But this change is not as significant as the main effects of diet and intensity.

 

 

 

That’s all,folks.

Did you find this article helpful?  Can it be improved upon ? Let me know!

Also, you can find the code here.

Until next time!

 

DataViz : Making life easier one plot at a time.

Standard

Data visualization has brought about a sweeping change to the way analysts work with data. And with the data growing exponentially every second, data viz is going to continue to address the pressing need to be able to explore the data creatively, dig for deeper insights and address goals in an engaging manner. It’s a no-brainer that people prefer easy-to-digest visualizations of copious amounts of complex data than to pore over spreadsheets or reports to draw informed conclusions.

Let’s work with this year’s Summer Olympics data as an example. Suppose one were curious about the country-wise rankings at Rio and is presented with the spreadsheet like the one below to make sense of things. Chances are that even after having spent considerable time ploughing through it, results may be inaccurate coupled with other side effects, including but not limited to, an immediate loss of interest, no subsequent data retention and annoyance,even.That’s a serious nay in my books.

2016-08-28_11-54-01.jpg

On the other hand, meaningful data visualizations promote the easy understanding of data, better data retention and the instant access to trends,anomalies,correlations and patterns. Take for example the visualization in Figure 1A – The data from the spreadsheet shown above has been cleaned, analyzed and visualized so that the results can make sense almost immediately.

2016-08-28_1-28-28.jpg

Interactive plot here.

(Side note : It’s a real bummer that WordPress.com does not allow embedding from Shiny or Plotly servers. Please click the link to experience the interactivity. Also, plots may take slightly longer to load for slower internet connections so please make allowance for that.)

In the quest for visualizing and interacting with diverse and more often than not, complex data sets, one is supported by a plethora of data plotting techniques – right from the simple data graphics to the more sophisticated and unusual ones.The type of visualizations selected to represent the data and whether or not interactivity and aesthetics are included also has an enormous impact on whether or not the analysis is being communicated accurately and meaningfully. To understand this, let’s look at a static chloropleth vs an interactive chloropleth.

The choropleth map uses a coloring scheme inside defined areas on a map in order to show value levels and indicate the average values of some property in those areas.Figure 2A uses a color encoding to communicate the country-wise medal tally from this year’s Rio Olympics.The biggest benefit of the map is the big picture perspective.By using color density, it quickly illustrates which countries excelled at the games and the ones that did not fare as well. However, the viewer can not gain detailed information, such as in this case, the number of medals per country.

abc.png

Figure 2A

This can be addressed somewhat by the addition of a certain level of interactivity as shown in Figure 2B below.

2016-08-28_1-20-59.jpg

Figure 2B

Interactive chloropleth here.

 

Now, if we represent the same information using a bar plot, which works with discrete data, the key values are available at a glance, providing a deeper level of detail to the user (And as a bonus, there is no need to know where Côte d’Ivoire (Ivory Coast) is on the world map 😀 )

2016-08-28_1-11-11.jpg

Interactive bar-plot here.

It is also interesting to note that effective manipulations and re-ordering of graph categories can emphasize certain effects. For e.g. the two bar-plots below use and represent the same data, but which one would you use to find the country that stood 10th overall?

noorder.png

orddr.png

Personally, I find that besides the ‘clean-the-data’ component of analytics, it’s the data visualization that keeps me coming back to perform independent data analysis. With SO many supporting languages and tools out there, it’s a wicked learning curve, but the possibility of revealing underlying patterns and stories is just too exciting and is what keeps me motivated and engaged.Or, if that doesn’t work,I just get really stoked by the prospect of  how cool I am going to make the data look!

As per a Harvard Business Review article that I read recently, more data crosses the internet every second than were stored in the entire internet just 20 years ago.For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20 million filing cabinets’ worth of text. An exabyte is 1,000 times that amount, or one billion gigabytes. Would you rather read one billion gigabytes or see one gigabytes? Try wrapping your brain around that kind of data.Or let data viz help you with that. It’s no happy coincidence that  data viz with it’s ‘why read fast when you can visualize faster’ philosophy is increasingly being embraced by companies across all sectors. It just works. Onwards, then!

 

 

I am happy to hear feedback and suggestions, so please feel free to leave a comment!

 

Data source: www.rio2016.com

All visualizations coded using R and hosted on Shiny & Plotly servers.

 

 

Analysing Whatsapp Data

Standard

 

This first project happened quite by accident when I stumbled upon the ‘export chat’ option on Whatsapp the other day. The file gets emailed to you as a messy .txt and you will just have to plough through cleaning the data until you find the version you’re willing to work with. Fun!

So let’s walk you through it-

 

Loading the libraries.

First step will be to load the libraries that you will need  read in , clean the data from the CSV file and to make your plots.

install.packages("ggplot2")
install.packages("tidyr")
install.packages("lubridate")
install.packages("readr")
install.packages("stringr")
install.packages("dplyr")

library(ggplot2)
library(tidyr)
library(lubridate)
library(readr)
library(stringr)
library(dplyr)

 

Importing the .txt file .

This is where you get an error right off the bat –


path1<-file.path("C:","Users","Amita","Downloads","chat.txt")
> whatsapp <- read.table(path1,header = FALSE, sep = " ")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
line 1 did not have 7 elements

That’s because .txt file isn’t formatted like a .csv or a proper table, i.e., not every row has the same amount of elements. This is why as a first step, you will need to use Excel to convert the file to .csv ( open Excel -> Open the .txt file -> Set the file type to “Delimited” -> check separate by spaces and  un-check “Tab”->leave the data format at “General” and click “Finish” -> save this as a .csv file -> good to go!)

path<- file.path("C:","Users","Amita","Downloads","chat.csv")
chat<- read.csv(path,stringsAsFactors = FALSE,header=FALSE) 

Cleaning the data.

Now, the data isn’t too bad to work with, but there are some obvious tweaks that need to be done to make the process easier. For example,

  • The date format will need to be changed.
  • Some rows do not have timestamp but immediately start with text if a message has multiple paragraphs.
  • Some names have one name, two names, or just plain phone numbers depending on how they have been stored – there needs to be uniformity here.
  • There are a bunch of special characters that must go.
  • Columns must be renamed to be more meaningful.
  • Only the relevant columns must be kept. I ended up keeping only 6 out of the original 49.
  • Missing values must deleted.

 

Formatting the date into yyyy-mm-dd. I used the lubridate package – it’s super easy to work with – you may use whatever works better for you.

dmy(chat$V1)

Keeping only the relevant columns

chat1[,c(1:6)]

Removing the NAs

remove<- which(is.na(chat))
chat[-remove,]

 

Then, I wanted to separate the newly formatted Date column into Year,Month and Day for which I used the ‘separate()’ from the tidyr package.

separate(chat,V1,c("Year","Month","Day"))

Renaming the columns

names(chat2)<- c("Year","Month","Day","Time","AM/PM","Person")

 

The special characters had to go, the names and numbers had to be replaced with ‘Personx’ for anonymity. This was done using a combination of the handy ‘gsub()’ from the R base package and ‘str_replace_all()’ from the stringr package.

gsub("[^[:alnum:][:blank:]+?&/\\]", "", chat2$Person)
gsub("['â'|'Â']", "", chat2$Person)
 gsub(":","",chat2$`AM/PM`)
str_replace_all(chat2$Person,"^.*2623949$","Person1")

. . .  and so on.

 

. . Plotting the data!

Let’s try plotting monthly data faceted by years –

 ggplot(chat3, aes(x = Month, fill = Person)) +
 stat_count(position = "dodge", show.legend = TRUE) +
 ggtitle("Whatsapp group Traffic") +
 ylab("# of messages") + xlab("month") +
 theme(plot.title = element_text(face = "italic")) + facet_grid(Year~.)

Looks like I am one of the least active members on the group( Sorry guys ! I will try to be better !) and my lovely whatsapp savvy friend (Person8) clocked the highest number of pings mid-May in 2015.

Rplot

I looked into a couple of other scenarios-the group’s monthly chat trends in the AM vs PM across the years and the group’s hourly chat trend using a count all the messages sent in morning vs evening. And along the way,I found that I need to make a few more data transformation to drill down deeper . For example, for one the analysis, I need to unite () the columns Date and time, for another, I only wanted a subset of the columns which I got by applying a combination of substr() and strptime() to extract data from specific positions. At one time, I played around with the date column changing the format of the date column to POSIXlt. POSIXlt is really cool because it allows us to  extract only the hours, weekdays or minutes from our timestamp using the “$” operator , like we would the elements  of a list or columns of a data frame.

 

This was a really a very engaging exercise for me because besides depicting the frequency and usual time of activity, other explorations also showed me the trends  based my geographical locations(India,Paris and Australia) .

I am going to continue to muck around with this and in the meantime,the R code used to make this blog post is available here.

Bouquets and Brickbats welcome!