This first project happened quite by accident when I stumbled upon the ‘export chat’ option on Whatsapp the other day. The file gets emailed to you as a messy .txt and you will just have to plough through cleaning the data until you find the version you’re willing to work with. Fun!
So let’s walk you through it-
Loading the libraries.
First step will be to load the libraries that you will need read in , clean the data from the CSV file and to make your plots.
install.packages("ggplot2") install.packages("tidyr") install.packages("lubridate") install.packages("readr") install.packages("stringr") install.packages("dplyr") library(ggplot2) library(tidyr) library(lubridate) library(readr) library(stringr) library(dplyr)
Importing the .txt file .
This is where you get an error right off the bat –
path1<-file.path("C:","Users","Amita","Downloads","chat.txt") > whatsapp <- read.table(path1,header = FALSE, sep = " ") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 7 elements
That’s because .txt file isn’t formatted like a .csv or a proper table, i.e., not every row has the same amount of elements. This is why as a first step, you will need to use Excel to convert the file to .csv ( open Excel -> Open the .txt file -> Set the file type to “Delimited” -> check separate by spaces and un-check “Tab”->leave the data format at “General” and click “Finish” -> save this as a .csv file -> good to go!)
path<- file.path("C:","Users","Amita","Downloads","chat.csv") chat<- read.csv(path,stringsAsFactors = FALSE,header=FALSE)
Cleaning the data.
Now, the data isn’t too bad to work with, but there are some obvious tweaks that need to be done to make the process easier. For example,
- The date format will need to be changed.
- Some rows do not have timestamp but immediately start with text if a message has multiple paragraphs.
- Some names have one name, two names, or just plain phone numbers depending on how they have been stored – there needs to be uniformity here.
- There are a bunch of special characters that must go.
- Columns must be renamed to be more meaningful.
- Only the relevant columns must be kept. I ended up keeping only 6 out of the original 49.
- Missing values must deleted.
Formatting the date into yyyy-mm-dd. I used the lubridate package – it’s super easy to work with – you may use whatever works better for you.
Keeping only the relevant columns
Removing the NAs
remove<- which(is.na(chat)) chat[-remove,]
Then, I wanted to separate the newly formatted Date column into Year,Month and Day for which I used the ‘separate()’ from the tidyr package.
Renaming the columns
The special characters had to go, the names and numbers had to be replaced with ‘Personx’ for anonymity. This was done using a combination of the handy ‘gsub()’ from the R base package and ‘str_replace_all()’ from the stringr package.
gsub("[^[:alnum:][:blank:]+?&/\\]", "", chat2$Person) gsub("['â'|'Â']", "", chat2$Person) gsub(":","",chat2$`AM/PM`) str_replace_all(chat2$Person,"^.*2623949$","Person1") . . . and so on.
. . Plotting the data!
Let’s try plotting monthly data faceted by years –
ggplot(chat3, aes(x = Month, fill = Person)) + stat_count(position = "dodge", show.legend = TRUE) + ggtitle("Whatsapp group Traffic") + ylab("# of messages") + xlab("month") + theme(plot.title = element_text(face = "italic")) + facet_grid(Year~.)
Looks like I am one of the least active members on the group( Sorry guys ! I will try to be better !) and my lovely whatsapp savvy friend (Person8) clocked the highest number of pings mid-May in 2015.
I looked into a couple of other scenarios-the group’s monthly chat trends in the AM vs PM across the years and the group’s hourly chat trend using a count all the messages sent in morning vs evening. And along the way,I found that I need to make a few more data transformation to drill down deeper . For example, for one the analysis, I need to unite () the columns Date and time, for another, I only wanted a subset of the columns which I got by applying a combination of substr() and strptime() to extract data from specific positions. At one time, I played around with the date column changing the format of the date column to POSIXlt. POSIXlt is really cool because it allows us to extract only the hours, weekdays or minutes from our timestamp using the “$” operator , like we would the elements of a list or columns of a data frame.
This was a really a very engaging exercise for me because besides depicting the frequency and usual time of activity, other explorations also showed me the trends based my geographical locations(India,Paris and Australia) .
I am going to continue to muck around with this and in the meantime,the R code used to make this blog post is available here.
Bouquets and Brickbats welcome!