The non-scholarly write up – Wordclouds in R : Oh, The possibilities!

Standard

Okay, so I am stoked to report that I can now build them pretty wordclouds ! I am even more pleased with how easy the process is. There’s a whole array of plots you can play around with, including :

Commonality Cloud : Allows you to view words common to both corpora

Comparison Cloud: Allows you view words which are not common to both the corpora

Polarized Plot: A better take on the commonality cloud, allowing you to tell which corpra                                   has a greater concentration of a particular word.

Visualized word Network : shows the network of words associated with a main word.

Let’s jump right into it.

Step 1: Load libraries

require("tm") # the text mining package
require("qdap") # for qdap package's cleaning functions
require("twitteR") # to connect to twitter and extract tweets
require("plotrix") # for the pyramid plot

Step 2: Read in your choice of tweets

After connecting to twitter, I downloaded 5000 tweets each found from a search of the key words “hillary” and “trump”. And this was minutes after the US elections 2016 results were declared . Twitter has never been so lit!

hillary<-searchTwitter("hillary",n=5000,lang = "en")
trump<- searchTwitter("trump",n=5000,lang="en")

 

Step 3: Write and apply functions to perform data transformation and                       cleaning

a) Function to extract text from the tweets which get downloaded in the list form.We do this using  getText  which is an accessor method.

convert_to_text <- function(x){
 x$getText()
 }

b) Function to process our tweets to remove duplicates and  urls.

replacefunc <- function(x){
 gsub("https://(.*)", "", x)
 
}


replace_dup <- function(x){
 gsub("^(rt|RT)(.*)", "", x)
 
}

c) Function to further clean the character vector , for example, to remove brackets, replace abbreviations and symbols with their word equivalents and  contractions with their fully expanded versions.

clean_qdap <- function(x){
 x<- bracketX(x)
 x<- replace_abbreviation(x)
 x<- replace_contraction(x)
 x<- replace_symbol(x)
 x<-tolower(x)
 return(x)
}

d) Apply the above functions

hillary_text <- sapply(hillary,convert_to_text)
hillary_text1 <- hillary_text
hill_remove_url<- replacefunc(hillary_text1)
hill_sub <- replace_dup(hill_remove_url)
hill_indx <- which(hill_sub=="")
hill_sub_complete <- hill_sub[-hill_indx]

trump_text <- sapply(trump,convert_to_text)
trump_text1 <- trump_text
trump_remove_url<- replacefunc(trump_text1)
trump_sub <- replace_dup(trump_remove_url)
trump_indx <- which(trump_sub=="")
trump_sub_complete <- trump_sub[-trump_indx]

# encode to UTF-8 : capable of encoding all possible characters defined by unicode
trump_sub_complete <- paste(trump_sub_complete,collapse=" ")
Encoding(trump_sub_complete) <- "UTF-8"
trump_sub_complete <- iconv(trump_sub_complete, "UTF-8", "UTF-8",sub='') 
                         #replace non UTF-8 by empty space
trump_clean <- clean_qdap(trump_sub_complete)
trump_clean1 <- trump_clean

hill_sub_complete <- paste(hill_sub_complete,collapse=" ")
Encoding(hill_sub_complete) <- "UTF-8"
hill_sub_complete <- iconv(hill_sub_complete, "UTF-8", "UTF-8",sub='') 
                     #replace non UTF-8 by empty space
hillary_clean <- clean_qdap(hill_sub_complete)
hillary_clean1 <- hillary_clean

 

Step 4: Convert the character vectors to VCorpus objects

trump_corpus <- VCorpus(VectorSource(trump_clean1))
hill_corpus <- VCorpus(VectorSource(hillary_clean1))

 

Step 5: Define and apply function to  format the corpus object 

clean_corpus <- function(corpus){
 corpus <- tm_map(corpus, removePunctuation)
 corpus <- tm_map(corpus, stripWhitespace)
 corpus <- tm_map(corpus, removeNumbers)
 corpus <- tm_map(corpus, content_transformer(tolower))
 corpus <- tm_map(corpus, removeWords, 
 c(stopwords("en"),"supporters","vote","election","like","even","get","will","can"
,"amp","still","just","will","now"))
 return(corpus)
}

trump_corpus_clean <- clean_corpus(trump_corpus)
hill_corpus_clean <- clean_corpus(hill_corpus)
  • Note : qdap cleaner functions can be used with character vectors, but tm functions need a corpus as input.

Step 6: Convert the corpora into TermDocumentMatrix(TDM) objects

Tdmobjecthillary <- TermDocumentMatrix(hill_corpus_clean1)
Tdmobjecttrump <- TermDocumentMatrix(trump_corpus_clean1)

Step 7: Convert the TDM objects into matrices

Tdmobjectmatrixhillary <- as.matrix(Tdmobjecthillary)
Tdmobjectmatrixtrump <- as.matrix(Tdmobjecttrump)

 

Step 8: Sum rows and create term-frequency dataframe

Freq <- rowSums(Tdmobjectmatrixhillary)
Word_freq <- data.frame(term= names(Freq),num=Freq)

Freqtrump <- rowSums(Tdmobjectmatrixtrump)
Word_freqtrump <- data.frame(term= names(Freqtrump),num=Freqtrump)

Step 9: Prep for fancier wordclouds

# unify the corpora
cc <- c(trump_corpus_clean,hill_corpus_clean)

# convert to TDM
all_tdm <- TermDocumentMatrix(cc)
colnames(all_tdm) <- c("Trump","Hillary")

# convert to matrix
all_m <- as.matrix(all_tdm)


# Create common_words
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)

# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])

# Combine common_words and difference
common_words <- cbind(common_words, difference)

# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]

# Create top25_df
top25_df <- data.frame(x = common_words[1:25, 1], 
 y = common_words[1:25, 2], 
 labels = rownames(common_words[1:25, ]))

Step 10: It’s word cloud time!

 

a) The ‘everyday’ cloud

wordcloud(Word_freq$term, Word_freq$num, scale=c(3,0.5),max.words=1000, 
          random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, 
          colors=brewer.pal(5, "Blues"))

wordcloud(Word_freqtrump$term, Word_freqtrump$num, scale=c(3,0.5),max.words=1000,
          random.order=FALSE, rot.per=0.35, use.r.layout=FALSE,
          colors=brewer.pal(5, "Reds"))

2016-11-09_22-19-43trumpnew

 

b) The Polarized pyramid plot

# Create the pyramid plot
pyramid.plot(top25_df$x, top25_df$y, labels = top25_df$labels, 
 gap = 70, top.labels = c("Trump", "Words", "Hillary"), 
 main = "Words in Common", laxlab = NULL, 
 raxlab = NULL, unit = NULL)

2016-11-11_19-55-31.jpg

 

c) The comparison cloud 

comparison.cloud(all_m, colors = c("red", "blue"),max.words=100)

2016-11-11_19-59-18.jpg

d) The commonality cloud 

commonality.cloud(all_m, colors = "steelblue1",max.words=100)

2016-11-09_22-19-43.jpg

 

We made it!  That’s it for this post, folks.

Coming up next: Mining deeper into text.

 

 

 

 

Advertisements

Linear mixed-effects model with bootstrapping.

Standard

Dataset here.

We are going to perform a linear mixed effects analysis of the relationship between height and treatment of trees, as studied over a period of time.

Begin by reading in the data and making sure that the variables have the appropriate datatype.

tree<- read.csv(path, header=T,sep=",")
tree$ID<- factor(tree$ID)
tree$plotnr <- factor(tree$plotnr)

Plot the initial graph to get a sense of the relationship between the variables.

library(lattice)
bwplot(height ~ Date|treat,data=tree)

Rplot.png

There seems to be a linear-ish relationship between height and time . The height increases over time.The effect of the treatments(I,IL,C,F) is not clear at this point.

Based on the linear nature of relationship, we can model time to be numeric, i.e. , number of days.

tree$Date<- as.Date(tree$Date)
tree$time<- as.numeric(tree$Date - min(tree$Date))

The data shows us that the data was collected from the same plot on multiple dates, making it  a candidate for repeated measures.

Let’s visually check for the by-plot variability.

boxplot(height ~ plotnr,data=tree)

Rplot01.png

The variation between plots isn’t big – but there are still noticeable differences which will need to be accounted for in our model.

 

 

 

Let’s fit two models : one with interaction between the predictors and one without.

library(lme4)
library(lmerTest)

#fit a model

mod <- lmer(height ~ time + treat + (1|plotnr),data=tree) #without interaction
mod.int <- lmer(height ~ time * treat + (1|plotnr),data=tree) #with interaction

Upon comparing the two,  it’s evident that the model with the interaction would perform better at predicting responses.

anova(mod,mod.int)
> anova(mod,mod.int)
refitting model(s) with ML (instead of REML)
Data: tree
Models:
object: height ~ time + treat + (1 | plotnr)
..1: height ~ time * treat + (1 | plotnr)
       Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) 
object 7 22654 22702 -11320 22640 
..1   10 20054 20122 -10017 20034 2605.9 3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>  

So far, we have a random intercept model, we should also check whether adding a random slope can help make the model more efficient. We do this by adding a random slope to the model and comparing that with the  random intercept model from before.

mod.int.r <- lmer(height ~ time * treat + (treat|plotnr),data=tree)
anova(mod.int,mod.int.r)

> anova(mod.int,mod.int.r) # we don't need the random slope
refitting model(s) with ML (instead of REML)
Data: tree
Models:
object: height ~ time * treat + (1 | plotnr)
..1: height ~ time * treat + (treat | plotnr)
 Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
object 10 20054 20122 -10017 20034 
..1    19 20068 20197 -10015 20030 3.6708 9 0.9317 #AIC goes up

Nope. Don’t need the slope. Dropping it.

Continuing with mode.int. Let’s determine the p-values using Anova() from the car package.

2016-09-23_22-25-49.jpg

We can see that the main effects of time and treat are highly significant, as is the interaction between the two.

Running a summary() spews out the following log.

2016-09-23_21-36-23.jpg

As seen,  time:treatF is not  significantly different from the first level , time:treatC , that is, there is no difference between F treatment and C treatment in terms of the interaction with time.Treatments I and IL are significantly different from treatment C in terms of interaction with time.Although there is a significant main effect of treatment, none of the levels are actually different from the first level.

Graphically represented, the affect of time on height, by treatment would like this.

Rplot.png

 

Let’s check the data for adherence to model assumptions.

The plot of residuals versus fitted values doesn’t reflect any obvious pattern in the residuals.So, the assumption of linearity has not been violated and looking at the ‘blob’-like nature of the plot suggests the preservation of Homoskedasticity (One day, I shall pronounce this right).

 

rplot03

We’ll check for the assumption of normality of residuals by plotting a histogram. Looks okay to me.

Rplot04.png

There are several methods for dealing with multicollinearity. In this case, we’ll follow the simplest one, which is to consult the correlation matrix from the output of summary(mod.int) from before. Since none of the correlations between predictors approach 1, we can say that the assumption of  multicollinearity has not been violated.

 

Moving on.

Bootstrapping is a great resampling technique to arrive at CIs when dealing with mixed effect models where the degree of  various output from them is not clearly known.While one would want to aim  for as high a sampling number as possible to get tighter CIs , we also need to make allowance for longer associated processing times.

So here, we want predictions for all values of height. We will create a test dataset, newdat, which will go as input to predict().

newdat <- subset(tree,treat=="C")
bb <- bootMer(mod.int, FUN=function(x)predict(x, newdat, re.form=NA),
 nsim=999)

#extract the quantiles and the fitted values.
lci <- apply(bb$t, 2, quantile, 0.025)   
uci <- apply(bb$t, 2, quantile, 0.975)   
pred <- predict(mod.int,newdat,re.form=NA)

 

Plotting the confidence intervals.

library(scales)
palette(alpha(c("blue","red","forestgreen","darkorange"),0.5))
plot(height~jitter(time),col=treat,data=tree[tree$treat=="C",],pch=16)
lines(pred~time,newdat,lwd=2,col="orange",alpha=0.5)
lines(lci~time,newdat,lty=2,col="orange")
lines(uci~time,newdat,lty=2,col="orange")

Rplot02.png

 

 

 

Conclusion

Using R and lme4 (Bates, Maechler & Bolker, 2012) We performed a linear mixed effects analysis of the relationship between height and treatment of trees, as studied over a period of time. As fixed effects, we entered time and treatment  (with an interaction term) into the model. As random effects, we had intercepts for plotnr (plot numbers). Visual inspection of residual plots did not reveal any obvious deviations from homoscedasticity or normality. P-values were obtained by likelihood ratio tests via anova of the full
model with the effect in question against the model without the effect in question.

 

 

. . ..  . and it’s a wrap,folks!

Thanks again for stopping by, and again, I’d be happy to hear your comments and suggestions. Help me make this content better! 🙂

 

 

 

Chick Weight vs Diet – A case for one-way ANOVA.

Standard

The chick1 dataset is a data frame consisting  of 578 rows and 4 columns  “weight” “Time” “Chick” & “Diet”  which represents the progression of weight of several chicks. The little chicklings are each given a specific diet. There are four types of diet and the farmer wants to know which one fattens the chicks the fastest.

It’s time to do some exploratory statistics on the data frame!

Once the data has been read in using read.csv, we perform a few operations as shown below to make the data ggplot2 ready. The data by itself is pretty clean and doesn’t require much work .

 

chick1$Weight<-as.numeric(chick1$Weight)
chick1$Time<- as.numeric(chick1$Time)

chick1$Chick<-factor(chick1$Chick)

chick1$Diet<-factor(chick1$Diet)

chick1<-chick1[-579,] #remove the one NA

And now for the ggplot-ing .

 

Use ggplot() to map Time to x and Weight to y within the aes() function.Add col=diet to highlight the diet fed to the chicks.

Add geom_line() at the end to draw the lines.To draw one line per chick, add group=chick to the aes() of geom_line()

Finally, add the geom_smooth  to add a smoothing curve that shows the general trend of the data.The gray area around the curve is a confidence interval, suggesting how much uncertainty there is in this smoothing curve.

ggplot(chick1,aes(x=Time,y=Weight,col=Diet)) + geom_line(aes(group=Chick)) + 
geom_smooth(lwd=2)

Rplot

If we want to turn off the confidence interval, we can add an option to the geom_smooth later; specifically “se=FALSE”, where “s.e.” stands for “standard error.”

ggplot(chick1,aes(x=Time,y=Weight,col=Diet)) + geom_line(aes(group=Chick)) + 
geom_smooth(lwd=2,se=FALSE)

 

Rplot01.png

The visual clearly shows that Diet 3 fattens up the chicks the fastest.

Now we have 4 groups of the independent variable i.e. Diet and one dependent variable. This is a good case to be analysed by ANOVA (Analysis of Variance). ANOVA is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. In one-way ANOVA the data is sub-divided into groups based on a single classification factor.

Our assumptions here are :
# Dependent variable:Weight
# Independent variable : Diet 1,2,3,4  (treatment)
# Null hypothesis H0: all diets lead to equal weight gain
# Alternative hypothesis H1: All diets do not lead to equal weight gain

When we conduct an analysis of variance, the null hypothesis considered is that there is no difference in treatments mean, so once rejected the null hypothesis, the question is what treatment differs

ggplot(chick1,aes(x=Time,y=Weight,col=Diet)) + geom_line(aes(group=Chick)) + 
geom_smooth(lwd=2,se=FALSE)

Next, we need to test whether or not the assumption of homogeneity of variance holds true . The latter must hold true for the results of ANOVA to be valid. This is done using the Levene’s test.

leveneTest(chick1$Weight~chick1$Diet)

The test gives a high F-value of 9.6 and a low P value ,this means the data does not show homogeneity of variance i.e. we have violated principles of ANOVA and need to take steps to rectify this.We do this by logarithmic adjustment.

chick2<-chick1
chick2$Weight<-log(chick2$Weight)

With the log transformation , it gives a p value of 0.049 ~ 0.05 .A value of p = .05 is probably good enough to consider the homogeneity of variance assumption to be met . Now we can perform ANOVA knowing that the results would hold true.

aov2<-aov(chick2$Weight~chick2$Diet)
summary(aov2)

The results show the f-value as 8.744 and p value as 1.12e-05  showing significant effect . As a result we have 8.744 times between group variance as within group variance. i.e.the variation of weight means among different diets is much larger than the variation of weight within each diet, and our p-value is less than 0.05 (as suggested by normal scientific standard). Hence we can conclude that for our confidence interval we accept the alternative hypothesis H1 that there is a significant relationship between diets and weight gain.

 

 

The F-test showed a significant effect somewhere among the groups. However, it did not tell us which pairwise comparisons are significant. This is where post-hoc
tests come into play. They help you to find out which groups differ significantly from one other and which do not. More formally, post-hoc tests allow for multiple pairwise comparisons without inflating the type I error (i.e. rejecting the type-1 error when it is valid).

In R you can use Tukey’s procedure via the TukeyHSD() function. Input to TukeyHSD() is anova.

tukey<-TukeyHSD(aov2)
> tukey
 Tukey multiple comparisons of means
 95% family-wise confidence level

Fit: aov(formula = chick2$Weight ~ chick2$Diet)

$`chick2$Diet`
           diff      lwr      upr     p adj
2-1 0.15278600 -0.01463502 0.3202070 0.0879198
3-1 0.28030756 0.11288653 0.4477286 0.0001109
4-1 0.26747943 0.09914285 0.4358160 0.0002824
3-2 0.12752155 -0.06293542 0.3179785 0.3114949
4-2 0.11469342 -0.07656886 0.3059557 0.4112309
4-3 -0.01282813 -0.20409041 0.1784342 0.9981647

From the table above (looking at “diff” and “p adj” columns) we can see which diets have significant differences in weight from others. For example we can conclude that:

  • There is no significant difference in weight between groups 3-2 (p=0.311) ,groups 4-2 (p=0.411) , groups 4-3 (p=0.998)  & groups 2-1 (p=0.0878).
  • There is a significant difference in weight between groups  3-1 ( p=0.0001109) and 4-1 (p=0.0002824).

Finally, the results obtained above can also be visualized  by plotting the the “tukey” object in R. Significant differences are the ones which not cross the zero value.

plot(tukey) # plot the confidence intervals.

Rplot02

Each line represents the mean difference between both groups with the according confidence interval. Whenever the confidence interval doesn’t include zero (the vertical line), the difference between both groups is significant.

I hope you found it helpful and as always, comments are welcome!

Thanks for stopping by!