Stepwise Regression – What’s not to like ?



Plenty, apparently.

Besides encouraging you not to think , it doesn’t exactly do a great job at what it claims to do. Given a set of predictors, there is no guarantee that stepwise regression will find the optimal combination. Many of my statisticians buddies , whom I consult from time to time,  have a  gripe with it because it’s not  sensitive to the context of the research. Seems fair.


I built an interactive Shiny app to evaluate results from Stepwise regression (direction = “backward) when applied to different predictors and datasets. What I observed during model building and cross-validation was that the model performed better on the data at hand but performs much worse when subjected to cross-validation.After a lot of different random selections and testing, I eventually did find a model that worked well on both the fitted dataset and the cross-validation set, but it performed poorly when applied to new data.Therein lies at least most of the problem.


Shiny app can be found here.

The initial model was built to predict the ‘Life Expectancy’ and it does, to a certain extent, do it’s job . But when generalized, it pretty much turned out to be a bit of an uncertainty ridden damp squib. For example, predictions for the variables from the same dataset , such as, ‘Population’ ,’Frost’, ‘Area’ are nowhere close to the observed values. At the same time, the model did okay for variables such as ‘Illiteracy’ and ‘HS.Grad’.

Given all these drawbacks ( and more! ), people do find the motivation to use stepwise regression to produce a simpler model in terms of number of coefficients. It does not necessarily find the optimal model, but it does give a hunch of the possible combination of predictors.

While no one would conclude a statistical study based on stepwise results or publish a paper with it, some might find uses for it, say, to  verify models already created by software systems. Or as an easy-to-use tool for initial exploratory data analysis (with all the necessary caveats in place !) .

You win some, you lose some.

What do you think ? Leave a comment!

p.s. You can find the (needs-to-be-cleaned-up) code for the Shiny app here.





Mixed-design ANOVA : 2 between-subject factors and 1 within-subject factor


Suppose you want to examine the impact of diet and exercise on pulse rate. To investigate these issues, you collect a sample of 18 individuals and group them according to their dietary preferences: meat eaters and vegetarians. You then divide each diet category into three groups, randomly assigning each group to one of three types of exercise: exercise1, exercise2, exercise 3.  In addition to these between-subjects factors, you want to include a single within-subjects factor in the analysis. Each subject’s pulse rate will be measured at three levels of exertion: intensity1, intensity2, intensity3.

So we have 3 factors to work with:

  • Two between-subjects (grouping) factors: dietary preference and exercise type.
  • One within-subjects factor : intensity (of exertion)

This is what our data looks like. Onwards, then!

1 112 166 215 1
1 111 166 225 1
1 89 132 189 1
1 95 134 186 2
1 66 109 150 2
1 69 119 177 2
2 125 177 241 1
2 85 117 186 1
2 97 137 185 1
2 93 151 217 2
2 77 122 178 2
2 78 119 173 2
3 81 134 205 1
3 88 133 180 1
3 88 157 224 1
3 58 99 131 2
3 85 132 186 2
3 78 110 164 2

After reading in the file,  we give the columns appropriate names.

diet <- read_excel(path,col_names=F)
names(diet) <- c("subject","exercise","intensity1","intensity2","intensity3",

Then we convert ‘exercise’,’subject’ and ‘diet’ into factors .

diet$exercise<- factor(diet$exercise)
diet$diet<- factor(diet$diet)
diet$subject <- factor(diet$subject)

For repeated measures ANOVA, the data must be in the long form . We will use the melt() form the reshape2 package to achieve this. We are now at one row per participant per condition.

diet.long <- melt(diet, id = c("subject","diet","exercise"), 
 measure = c("intensity1","intensity2","intensity3"),"intensity")

At this point we’re ready to actually construct our ANOVA.

Our anova looks like this –

mod <- aov(value ~ diet*exercise*intensity + Error(subject/intensity) , 

The asterisk specifies that we want to look at the interaction between the three factors. But since this is a repeated measures design as well, we need to specify an error term that accounts for natural variation from participant to participant.

Running a summary() on our anova above  yields the following results –


The main conclusions we can arrive at are as follows:

  • There is a significant main effect of ‘diet’ on the pulse rate. We can conclude that a statistically significant difference exists between vegetarians and meat eaters on their overall pulse rates.
  • There is a statistically significant within-subjects main effect for intensity.
  • There is a marginally statistically significant interaction between diet and intensity. We’ll look at this later.
  • The type of exercise has no statistically significant effect on overall pulse rates.


Let’s plot the average pulse rate as explained by diet, exercise, and the intensity.

mp1 <- stack(
mp1<- separate(mp1,ind,c("Intensity","Exercise"))
mp1$Diet<- rep(seq_len(nrow(mean_pulse1)),ncol(mean_pulse1))
mp1$Diet <- factor(mp1$Diet,labels = c("Meat","Veg"))
ggplot(mp1,aes(Intensity,values,group=Diet,color=Diet)) + geom_line(lwd=1) + xlab("Intensity of the exercise") +
 ylab("Mean Pulse Rate") + ggtitle("Mean Pulse rate - \n Exercise Intensity vs Diet") + theme_grey()+
 facet_grid(Exercise ~.)





The plot agrees with our observations from earlier.



UPDATE: Understanding the Results

Earlier we had rejected a null hypothesis and concluded that change in mean pulse rate across intensity levels marginally depends upon dietary preference. Now ,we will turn our attention to the study of this interaction.

We begin by plotting an interaction plot as follows:

interaction.plot(mp1$Intensity, mp1$Diet, mp1$values , type="b", col=c("red","blue"), legend=F,
 lwd=2, pch=c(18,24),
 xlab="Exertion intensity", 
 ylab="Mean pulse rate ", 
 main="Interaction Plot")



We see that the mean pulse rate increases across exertion intensity(‘trials’) : this is the within-subject effect.

Further, it’s clear that vegetarians have a lower average pulse rate than do meat eaters at every trial: this is the diet main effect.

The difference between the mean pulse rate of meat-eaters vs vegetarians is different at each exertion level. This is the result of the diet by intensity interaction.

The main effect for diet is reflected in the fact that meat-eaters had a mean pulse rate roughly 10 to 20 points higher than that for vegetarians.

The main effect of intensity is reflected in the fact  for both diet groups, the mean pulse rate after jogging increased about 50 points beyond the rate after warmup exercises, and increased another  55 points (approx.) after running.

The interaction effect of diet and intensity is reflected in the fact that the gap between the two dietary groups changes across the three intensities. But this change is not as significant as the main effects of diet and intensity.




That’s all,folks.

Did you find this article helpful?  Can it be improved upon ? Let me know!

Also, you can find the code here.

Until next time!


Multiple Regression (sans interactions) : A case study.



state.x77 – Standard built-in dataset with 50 rows and 8 columns giving the following statistics in the respective columns.

Populationpopulation estimate as of July 1, 1975

Incomeper capita income (1974)

Illiteracy: illiteracy (1970, percent of population)

Life Exp: life expectancy in years (1969–71)

Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)

HS Grad: percent high-school graduates (1970)

Frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city

Area: land area in square miles




Conduct multiple regression to arrive at an optimal model which is able to predict the life expectancy .



We begin this exercise by running a few standard diagnostics on the data including by not limited to  head(), tail() , class(), str(), summary() ,names() etc.

The input dataset turns out to be a matrix which will need to be coerced into a data frame using

Next,we’ll do a quick exploratory analysis on our data to examine the variables for outliers and distribution before proceeding. One way to do this is to plot qqplots for all the variables in the dataset.

for(i in 1:ncol(st)){



It’s a good idea to check the correlation between the variables to get a sense of the dependencies . We do this here using the cor(dataframe) function.


We can see that life expectancy appears to be moderately to strongly related to “Income” (positively), “Illiteracy” (negatively), “Murder” (negatively), “HS.Grad” (positively), and “Frost” (no. of days per year with frost, positively).


We can also view this by plotting our correlation matrix using pairs(dataframe)  which corroborates the cor.matrix  as seen above.


As a starting point for multiple regression would be to build a “full” model which will have Life Expectancy as the response variable and all other variables as explanatory variables. The summary of that linear model will be our “square one” and we will proceed to whittle it down until we reach our optimal model.

> model1 = lm(Life.Exp ~ ., data=st)
> summary(model1)

lm(formula = Life.Exp ~ ., data = st)

 Min         1Q     Median   3Q       Max 
-1.47514 -0.45887 -0.06352 0.59362 1.21823 

             Estimate Std. Error t value Pr(>|t|) 
(Intercept) 6.995e+01  1.843e+00  37.956  < 2e-16 ***
Population  6.480e-05  3.001e-05   2.159   0.0367 * 
Income      2.701e-04  3.087e-04   0.875   0.3867 
Illiteracy  3.029e-01  4.024e-01   0.753   0.4559 
Murder     -3.286e-01  4.941e-02  -6.652   5.12e-08 ***
HS.Grad     4.291e-02  2.332e-02    1.840  0.0730 . 
Frost      -4.580e-03  3.189e-03   -1.436  0.1585 
Area       -1.558e-06  1.914e-06   -0.814  0.4205 
Density    -1.105e-03  7.312e-04   -1.511  0.1385 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7337 on 41 degrees of freedom
Multiple R-squared: 0.7501, Adjusted R-squared: 0.7013 
F-statistic: 15.38 on 8 and 41 DF, p-value: 3.787e-10

We see that this 70% of the variation in Life Expectancy is explained by all the variables together (Adjusted R-squared).

Let’s see if this can be made better. We’ll try to do that by removing one predictor at a time from model1, starting with “Illiteracy” (p-value = 0.4559 i.e. least significance) , until we have a model with all significant predictors.

> model2 = update(model1, .~. -Illiteracy)
> summary(model2)

lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + 
 Frost + Area + Density, data = st)

 Min 1Q Median 3Q Max 
-1.47618 -0.38592 -0.05728 0.58817 1.42334 

              Estimate Std. Error t value Pr(>|t|) 
(Intercept)  7.098e+01  1.228e+00   57.825   < 2e-16 ***
Population   5.675e-05  2.789e-05     2.034    0.0483 * 
Income       1.901e-04  2.883e-04     0.659    0.5133 
Murder       -3.122e-01 4.409e-02     -7.081    1.11e-08 ***
HS.Grad       3.652e-02 2.161e-02      1.690    0.0984 . 
Frost        -6.059e-03 2.499e-03      -2.425    0.0197 * 
Area         -8.638e-07 1.669e-06      -0.518     0.6075 
Density       -8.612e-04 6.523e-04       -1.320   0.1939 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7299 on 42 degrees of freedom
Multiple R-squared: 0.7466, Adjusted R-squared: 0.7044 
F-statistic: 17.68 on 7 and 42 DF, p-value: 1.12e-10


Next, Area must go (p-value = 0.6075!)


> model3 = update(model2, .~. -Area)
> summary(model3)

lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + 
 Frost + Density, data = st)

 Min 1Q Median 3Q Max 
-1.49555 -0.41246 -0.05336 0.58399 1.50535 

            Estimate Std. Error t value Pr(>|t|) 
(Intercept)  7.131e+01 1.042e+00 68.420 < 2e-16 ***
Population   5.811e-05 2.753e-05 2.110 0.0407 * 
Income       1.324e-04 2.636e-04 0.502 0.6181 
Murder      -3.208e-01 4.054e-02 -7.912 6.32e-10 ***
HS.Grad      3.499e-02 2.122e-02 1.649 0.1065 
Frost       -6.191e-03 2.465e-03 -2.512 0.0158 * 
Density     -7.324e-04 5.978e-04 -1.225 0.2272 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7236 on 43 degrees of freedom
Multiple R-squared: 0.745, Adjusted R-squared: 0.7094 
F-statistic: 20.94 on 6 and 43 DF, p-value: 2.632e-11

We will continue to throw out the predictors in this fashion . . . . .

> model4 = update(model3, .~. -Income)
> summary(model4)
> model5 = update(model4, .~. -Density)
> summary(model5)

. . . .  until we reach this point

> summary(model5)

lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
 data = st)

 Min 1Q Median 3Q Max 
-1.47095 -0.53464 -0.03701 0.57621 1.50683 

              Estimate Std. Error t value Pr(>|t|) 
(Intercept)  7.103e+01 9.529e-01 74.542   < 2e-16 ***
Population   5.014e-05 2.512e-05 1.996     0.05201 . 
Murder      -3.001e-01 3.661e-02 -8.199    1.77e-10 ***
HS.Grad      4.658e-02 1.483e-02 3.142     0.00297 ** 
Frost       -5.943e-03 2.421e-03 -2.455    0.01802 * 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.7126 
F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12

Now, here we have “Population” with a borderline p-value of 0.052 . Do we throw it out ? Do we keep it ? One way to know is to test for interactions.

> interact= lm(Life.Exp ~ Population * Murder * HS.Grad * Frost,data=st)
> anova(model5,interact)
Analysis of Variance Table

Model 1: Life.Exp ~ Population + Murder + HS.Grad + Frost
Model 2: Life.Exp ~ Population * Murder * HS.Grad * Frost
 Res.Df RSS Df Sum of Sq F Pr(>F) 
1 45 23.308 
2 34 14.235 11 9.0727 1.9699 0.06422 .
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

P-value of 0.06422 indicates that there is no significant interaction between the terms so we can drop “Population” as it doesn’t really contribute to anything at this point.BUT, if we drop “Population”, the adj R-squared drops a little as well . So as a trade-off, we’ll just keep it. 

So there you have it,folks.Our minimal optimal model with statistically significant slopes.


Run model diagnostics

par(mfrow=c(2,2)) # to view all 4 plots at once


1. Residuals vs Fitted

We find equally spread residuals around a horizontal line without distinct patterns so that is a that is a good indication we don’t have non-linear relationships.

2. Normal Q-Q plot

#This plot shows if residuals are normally distributed. Our residuals more or less follow a straight line well, so that’s an encouraging sign.

3. Scale-location
This plot shows if residuals are spread equally along the ranges of predictors. This is how we can check the assumption of equal variance (homoscedasticity). It’s good that we can see a horizontal(-ish) line with equally (randomly) spread points.

4. Residuals vs Leverage

This plot helps us locate the influential cases that have an effect on the regression line.We look for subjects outside the dashed line , the Cook’s distance.The regression results will be altered if we exclude those cases.
‘Full’ model VS ‘Optimal’ model

> anova(model1,model5)
Analysis of Variance Table

Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
 Frost + Area + Density
Model 2: Life.Exp ~ Population + Murder + HS.Grad + Frost
 Res.Df RSS Df Sum of Sq F Pr(>F)
1 41 22.068 
2 45 23.308 -4 -1.2397 0.5758 0.6818

The p-value of 0.6818 suggests that we haven’t lost any significance from when we started, so we’re good.

Onto some cross-validation

We will perform a 10-fold cross validation (m=10 in CVlm function).

A comparison of the cross- validation plots shows that model5 performs the best out of all other models.The lines of best fit are relatively parallel(-ish) and most big symbols are close to the lines indicating a decent accuracy of prediction.Also, the mean squared error is small , 0.6 .



cvResults <- suppressWarnings(CVlm(data=st, form.lm=model5, m=10, dots=FALSE, seed=29, legend.pos="topleft", printit=FALSE));

mean_squared_error <- attr(cvResults, 'ms')



The relative importance of predictors

The relaimpo package provides measures of relative importance for each of the predictors in the model.


 # Bootstrap Measures of Relative Importance (1000 samples) 
 boot <- boot.relimp(model5, b = 1000, type = c("lmg", 
 "last", "first", "pratt"), rank = TRUE, 
 diff = TRUE, rela = TRUE)

# print result
 plot(booteval.relimp(boot,sort=TRUE)) # plot result


Yup. ‘Murder’ is the most important predictor of Life Expectancy, followed by ‘HS.Grad’ and ‘Frost. ‘Population’  sits smug at the other end of the spectrum with marginal effect and importance.

Well, looks like this is as good as it is going to get with this model.

I will leave you here, folks. Please feel free to work through various datasets (as I will) to get a firm grasp on things.

Did you find this article helpful?  Can it be improved upon ? Please leave me a comment !

Until next time!











An Investigation for Determining the Optimum Length of Chopsticks : A case for Single-Factor Repeated Measures ANOVA.


In the pursuit to determine the optimum length of chopsticks, two laboratory studies were conducted, using a randomised complete block design, to evaluate the effects of the length of the chopsticks on the food-serving performance of adults and children. Thirty-one male junior college students and 21 primary school pupils served as subjects for the experiment.The response was recorded in terms of food-pinching efficiency (number of peanuts picked, and placed in a cup).

Data source

The data was read in using read.csv() and allotted appropriate column names.

stick<-read.csv(file.choose(),header = TRUE,stringsAsFactors = FALSE)


Next, we’ll convert  the ‘Length’ and ‘Individual’ to factors and the dependent variable ‘Efficiency’ to numeric to carry out our repeated measures ANOVA analysis.

chop$Length<- factor(chop$Length)

Let’s use the tapply() to find out the mean pinching efficiency grouped by the chopstick length and plot the result.


plot(group_mean,type="p",xlab="Chopstick Length",ylab="Mean Efficiency",
 main="Average Food Pinching Efficiency \n by Chopstick Length",col="green",pch=16)



Visually, we can see that the efficiency grows steadily from 180 mm through to the 240  and then falling sharply at 270 mm and despite a slight increase at 300,it continues to fall at 330.

Before we can perform the repeated measures ANOVA, we will need to check for the Sphericity which assumes the homogeneity of variance among differences between all possible pairs of groups. This is done by the Mauchly’s test.

chop1<- cbind(chop$Efficiency[chop$Length==180],

mauchly.test (lm (chop1 ~ 1), X = ~1)

> mauchly.test (lm (chop1 ~ 1), X = ~1)

           Mauchly's test of sphericity
           Contrasts orthogonal to

data: SSD matrix from lm(formula = chop1 ~ 1)
W = 0.43975, p-value = 0.05969

Since, (thankfully!) the p-value is >0.05, the homogeneity of variance assumption holds and no further adjustment is necessary.

We can now get on with the repeated measures ANOVA.

aov.chop = aov(Efficiency~Length + Error(Individual/Length),data=chop)

Error: Individual
 Df Sum Sq Mean Sq F value Pr(>F)
Residuals 30 2278 75.92 

Error: Individual:Length
         Df Sum Sq Mean Sq F value Pr(>F) 
Length    5   106.9 21.372   5.051 0.000262 ***
Residuals 150 634.6 4.231 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The F-ratio is 5.051 and p-value is 0.000262 which means there is significant effect of the length of the chopstick on the eating efficiency.

Since, the F-ratio is significant , we can now process the post-hoc analysis to perform pair-wise comparisons.

1. Holm’s adjustment


> with(chop,pairwise.t.test(Efficiency,Length,paired=T))

 Pairwise comparisons using paired t tests 

data: Efficiency and Length 

      180 210   240   270    300 
210 1.000 - - - - 
240 0.323 1.000 - - - 
270 1.000 0.034 0.035 - - 
300 1.000 1.000 0.198 1.000 - 
330 0.630 0.048 8.1e-05 1.000 0.435

P value adjustment method: holm

2. Bonferroni adjustment


> with(chop,pairwise.t.test(Efficiency,Length,paired=T,p.adjust.method="bonferroni"))

 Pairwise comparisons using paired t tests 

data: Efficiency and Length 

       180 210 240 270 300 
210 1.000 - - - - 
240 0.485 1.000 - - - 
270 1.000 0.036 0.040 - - 
300 1.000 1.000 0.269 1.000 - 
330 1.000 0.060 8.1e-05 1.000 0.725

P value adjustment method: bonferroni

As we can see, the Bonferroni adjustment inflates the p-values as compared to Holm’s adjustment.

3. Treatment-by-Subjects

Another method which can be employed is the Treatment-by-Subjects i.e. as a 2 factor design without interaction , without replication. The advantage of this approach is that it allows us to employ the TukeyHSD() post-hoc .


aov.chop1 = aov(Efficiency ~ Length + Individual, data=chop)

> summary(aov.chop1)
         Df   Sum Sq Mean Sq     F value Pr(>F) 
Length     5   106.9       21.37 5.051    0.000262 ***
Individual 30 2277.5       75.92 17.944   < 2e-16 ***
Residuals 150 634.6 4.23 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> tukey
 Tukey multiple comparisons of means
 95% family-wise confidence level

Fit: aov(formula = Efficiency ~ Length + Individual, data = chop)

           diff        lwr       upr        p adj
210-180 0.54870968 -0.9595748 2.05699418 0.8999148
240-180 1.38774194 -0.1205426 2.89602644 0.0904885
270-180 -0.61129032 -2.1195748 0.89699418 0.8503866
300-180 0.03290323 -1.4753813 1.54118773 0.9999999
330-180 -0.93548387 -2.4437684 0.57280063 0.4749602
240-210 0.83903226 -0.6692522 2.34731676 0.5959492
270-210 -1.16000000 -2.6682845 0.34828450 0.2346843
300-210 -0.51580645 -2.0240910 0.99247805 0.9213891
330-210 -1.48419355 -2.9924781 0.02409096 0.0565555
270-240 -1.99903226 -3.5073168 -0.49074775 0.0025803
300-240 -1.35483871 -2.8631232 0.15344579 0.1053005
330-240 -2.32322581 -3.8315103 -0.81494130 0.0002412
300-270 0.64419355 -0.8640910 2.15247805 0.8199855
330-270 -0.32419355 -1.8324781 1.18409096 0.9893780
330-300 -0.96838710 -2.4766716 0.53989741 0.4349561

By studying the difference between the means and the Tukey adjusted p-value, the most significant diff is between the groups 240-330 , with a mean difference of 2.32.



The results show that the food-pinching performance is significantly affected by the length of the chopsticks, and that chopsticks of about 240 mm long performed the best.



Beauty is talent: A case for 2-way ANOVA.





Landy and Sigall (1974). “Beauty is Talent: Task Evaluation as a Function of the Performer’s Physical Attraction,” Journal of Personality and Social Psychology, 29:299-304.

60 male undergraduates read an essay supposedly written by a female college freshman. They then evaluated the quality of the essay and the ability of its writer on several dimensions. By means of a photo attached to the essay, 20 evaluators were led to believe that the writer was physically attractive and 20 that she was unattractive. The remaining evaluators read the essay without any information about the writer’s appearance. 30 evaluators read a version of the essay that was well written while the other evaluators read a version that was poorly written. Significant main effects for essay quality and writer attractiveness were predicted and obtained.


Two Factor experiment to assess if appearance effects  judgment of student’s essay.             Appearance (Attractive/Control/Unattractive) and Essay Quality (Good/Poor) are factors (Pictures given with essay, except in control).


Grade on essay. Data simulated to match their means and standard deviations.

Essay Quality 8 /* 1=Good, 2=Poor */
Student Attactiveness 16 /* 1= Attractive, 2=Control, 3=Unattractive */
Score 18-24


When we import halo.dat using read.delim() ,we get a single  column V1 containing all the data values. The separation of V1 into 3 distinct columns and other subsequent transformations are done using functions from the very handy dplyr package, as shown below.

hal<-read.delim(file.choose(),header = FALSE,stringsAsFactors = FALSE)
halo<- select(halo,2:4)

Next we’ll convert the 2 independent variables (Essay Quality, Attractiveness) to factors and the dependent variable (Score) to numeric to carry out our ANOVA analysis.

halo$attractiveness <- factor(halo$attractiveness)

We can test 3 hypotheses:

1. Are scores higher for more attractive students ?
2. Are scores higher for higher quality essay?
3. Whether student attractiveness and essay quality interact resulting in higher scores?

Let’s start by look at the means of our groups. Here, we will use the tapply() function to generate a table of means and use the results to plot a graph.


# Plot a bar-plot
barplot(halo_mean, beside = TRUE, 
 col = c("orange","blue","green"), 
 main = "Mean Scores", 
 xlab = "Essay Quality", 
 ylab = "Score")

# Add a legend
legend("topright", fill = c("orange","blue","green"), 
 title = "Attractiveness",c("1","2","3"),

> halo_mean
    1       2
1 17.900 14.899
2 17.900 13.400
3 15.499 8.701


Visually, we can see that for a good quality essay,the students were evaluated most favorably when they were attractive or when their appearance was unknown , least when they were unattractive.For a poor grade essay,the students were evaluated most favorably when they were attractive, least when they were unattractive, and intermediately when their appearance was unknown.

Let’s explore and further refine our study by performing a 2-way ANOVA. But before we do that, we need to test the homogeneity of the variance assumption.The assumption must hold for the results of an ANOVA analysis to be valid.


The result gives us a p-value of 0.2989 which means that the assumption of homogeneity of variance holds. There is not a significant difference in the variances across the groups.

And now for the ANOVA.


> summary(aov_halo)
                                    Df Sum Sq Mean Sq F value Pr(>F) 
halo$quality                         1   340.8  340.8 17.231   0.000118 ***
halo$attractiveness                  2   211.0 105.5  5.335    0.007687 ** 
halo$quality:halo$attractiveness     2    36.6  18.3  0.925    0.402832 
Residuals                           54  1067.9  19.8 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The very low p-values of quality and attractiveness indicate high significance of these 2 factors independently to drive the higher scores. But the high p-value of the interaction shows that both together do not drive higher scores.Since,we do not have a significant interaction effect , we do not need to follow it up to see exactly where the interaction is coming from.


We can now proceed with Post-hoc analysis to test the main effect pairwise comparisons via the following  methods:

  1. TukeyHSD
  2. pairwise.t.test()
  3. Bonferroni Adjustment
  4. Holm’s adjustment


Pair -wise comparison using TukeyHSD

> TukeyHSD(aov_halo,"halo$quality")
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = halo$scoretotal ~ halo$quality * halo$attractiveness)
      diff      lwr        upr     p adj
2-1 -4.766333 -7.068384 -2.464282 0.0001182

By studying the difference between the means and the Tukey adjusted p-value, ,GoodQ_mean-PoorQ_mean = 4.77 | p-value = 0.0001182
The mean score is higher for more good quality essays than for poor quality essays.

> TukeyHSD(aov_halo,"halo$attractiveness")
 Tukey multiple comparisons of means
 95% family-wise confidence level
Fit: aov(formula = halo$scoretotal ~ halo$quality * halo$attractiveness)
      diff      lwr      upr      p adj
2-1 -0.7495 -4.138616 2.6396163 0.8555096
3-1 -4.2995 -7.688616 -0.9103837 0.0095632
3-2 -3.5500 -6.939116 -0.1608837 0.0381079

By studying the difference between the means and the Tukey adjusted p-value, ,
Attractive_mean – Control_mean = 0.75 | p-value = 0.8555096
Attractive_mean – Unattractive_mean = 4.99 | p-value = 0.0095632
Control_mean – Unattractive_mean = 3.55 | p-value = 0.0381079

The mean score is higher for more attractive people than for unattractive people (The halo affect).


Pairwise comparison using pairwise.t.test()

pairwise.t.test(halo$scoretotal, halo$attractiveness, p.adj = "none")
> pairwise.t.test(halo$scoretotal, halo$attractiveness, p.adj = "none")

 Pairwise comparisons using t tests with pooled SD 

data: halo$scoretotal and halo$attractiveness 

   1       2 
2 0.6397 - 
3 0.0091 0.0297

P value adjustment method: none 

With no adjustments, the Attractive-Unattractive (1-3) and Control-Unattractive (2-3) comparison are statistically significant,whereas, the Attractive-Control (1-2) comparison is not.This suggests that both the attractive and control groups scored superior to the Unattractive group, but that there is insufficient statistical support to distinguish between the Attractive and Control groups.


pairwise.t.test(halo$scoretotal, halo$quality, p.adj = "none")
> pairwise.t.test(halo$scoretotal, halo$quality, p.adj = "none")

 Pairwise comparisons using t tests with pooled SD 

data: halo$scoretotal and halo$quality 

2 0.00027

P value adjustment method: none 

With no adjustments, the Good Quality-Poor Quality (1-2) comparison is statistically significant. This suggests that good quality essays scored superior as compared to the poor quality essay group.


Pairwise comparison using Bonferroni adjustment

The Bonferroni adjustment simply divides the Type I error rate (.05) by the number of tests (in this case, three for attractiveness and 2 for essay quality). Hence, this method is often considered overly conservative. The Bonferroni adjustment can be made using p.adj = “bonferroni” in the pairwise.t.test() function.

pairwise.t.test(halo$scoretotal,halo$attractiveness,p.adj = "bonferroni")
> pairwise.t.test(halo$scoretotal,halo$attractiveness,p.adj = "bonferroni")

 Pairwise comparisons using t tests with pooled SD 

data: halo$scoretotal and halo$attractiveness 

 1       2 
2 1.000 - 
3 0.027 0.089

P value adjustment method: bonferroni 

Using the Bonferroni adjustment, only the Attractive-Unattractive (1-3) group comparison is statistically significant. This suggests that the attractive group is treated superior to the Unattractive group, but that there is insufficient statistical support to distinguish between the Control-Unattractive (2-3) and the the Attractive-Control (1-2) group comparisons.
Notice that these results are more conservative than with no adjustment.

pairwise.t.test(halo$scoretotal,halo$quality,p.adj = "bonferroni")
> pairwise.t.test(halo$scoretotal,halo$quality,p.adj = "bonferroni")

 Pairwise comparisons using t tests with pooled SD 

data: halo$scoretotal and halo$quality 

2 0.00027

P value adjustment method: bonferroni 

Using the Bonferroni adjustment, the Good Quality-Poor Quality (1-2) comparison is statistically significant.This suggests that good quality essays scored superior as compared to the poor quality essay group.


Pairwise comparison using Holm’s adjustment

The Holm adjustment sequentially compares the lowest p-value with a Type I error rate that is reducedfor each consecutive test. In our case of attractiveness, this means that our first p-value is tested at the .05/3 level (.017),second at the .05/2 level (.025), and third at the .05/1 level (.05).This method is generally considered superior to the Bonferroni adjustment and can be employed using p.adj = “holm” in the pairwise.t.test() function.

pairwise.t.test(halo$scoretotal,halo$attractiveness,p.adj = "holm")
> pairwise.t.test(halo$scoretotal,halo$attractiveness,p.adj = "holm")

 Pairwise comparisons using t tests with pooled SD 

data: halo$scoretotal and halo$attractiveness 

   1     2 
2 0.640 - 
3 0.027 0.059

P value adjustment method: holm 

Using the Holm procedure, our results are almost identical to using no adjustment.



Based on our statistical tests, we can conclude that

1. The students were evaluated most favorably when they were attractive , least when they were unattractive.
2. Good quality essays scored higher than poor quality ones.
3. There is no statistical proof to show that student attractiveness and essay quality interact resulting in higher scores.

Chick Weight vs Diet – A case for one-way ANOVA.


The chick1 dataset is a data frame consisting  of 578 rows and 4 columns  “weight” “Time” “Chick” & “Diet”  which represents the progression of weight of several chicks. The little chicklings are each given a specific diet. There are four types of diet and the farmer wants to know which one fattens the chicks the fastest.

It’s time to do some exploratory statistics on the data frame!

Once the data has been read in using read.csv, we perform a few operations as shown below to make the data ggplot2 ready. The data by itself is pretty clean and doesn’t require much work .


chick1$Time<- as.numeric(chick1$Time)



chick1<-chick1[-579,] #remove the one NA

And now for the ggplot-ing .


Use ggplot() to map Time to x and Weight to y within the aes() function.Add col=diet to highlight the diet fed to the chicks.

Add geom_line() at the end to draw the lines.To draw one line per chick, add group=chick to the aes() of geom_line()

Finally, add the geom_smooth  to add a smoothing curve that shows the general trend of the data.The gray area around the curve is a confidence interval, suggesting how much uncertainty there is in this smoothing curve.

ggplot(chick1,aes(x=Time,y=Weight,col=Diet)) + geom_line(aes(group=Chick)) + 


If we want to turn off the confidence interval, we can add an option to the geom_smooth later; specifically “se=FALSE”, where “s.e.” stands for “standard error.”

ggplot(chick1,aes(x=Time,y=Weight,col=Diet)) + geom_line(aes(group=Chick)) + 



The visual clearly shows that Diet 3 fattens up the chicks the fastest.

Now we have 4 groups of the independent variable i.e. Diet and one dependent variable. This is a good case to be analysed by ANOVA (Analysis of Variance). ANOVA is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. In one-way ANOVA the data is sub-divided into groups based on a single classification factor.

Our assumptions here are :
# Dependent variable:Weight
# Independent variable : Diet 1,2,3,4  (treatment)
# Null hypothesis H0: all diets lead to equal weight gain
# Alternative hypothesis H1: All diets do not lead to equal weight gain

When we conduct an analysis of variance, the null hypothesis considered is that there is no difference in treatments mean, so once rejected the null hypothesis, the question is what treatment differs

ggplot(chick1,aes(x=Time,y=Weight,col=Diet)) + geom_line(aes(group=Chick)) + 

Next, we need to test whether or not the assumption of homogeneity of variance holds true . The latter must hold true for the results of ANOVA to be valid. This is done using the Levene’s test.


The test gives a high F-value of 9.6 and a low P value ,this means the data does not show homogeneity of variance i.e. we have violated principles of ANOVA and need to take steps to rectify this.We do this by logarithmic adjustment.


With the log transformation , it gives a p value of 0.049 ~ 0.05 .A value of p = .05 is probably good enough to consider the homogeneity of variance assumption to be met . Now we can perform ANOVA knowing that the results would hold true.


The results show the f-value as 8.744 and p value as 1.12e-05  showing significant effect . As a result we have 8.744 times between group variance as within group variance. i.e.the variation of weight means among different diets is much larger than the variation of weight within each diet, and our p-value is less than 0.05 (as suggested by normal scientific standard). Hence we can conclude that for our confidence interval we accept the alternative hypothesis H1 that there is a significant relationship between diets and weight gain.



The F-test showed a significant effect somewhere among the groups. However, it did not tell us which pairwise comparisons are significant. This is where post-hoc
tests come into play. They help you to find out which groups differ significantly from one other and which do not. More formally, post-hoc tests allow for multiple pairwise comparisons without inflating the type I error (i.e. rejecting the type-1 error when it is valid).

In R you can use Tukey’s procedure via the TukeyHSD() function. Input to TukeyHSD() is anova.

> tukey
 Tukey multiple comparisons of means
 95% family-wise confidence level

Fit: aov(formula = chick2$Weight ~ chick2$Diet)

           diff      lwr      upr     p adj
2-1 0.15278600 -0.01463502 0.3202070 0.0879198
3-1 0.28030756 0.11288653 0.4477286 0.0001109
4-1 0.26747943 0.09914285 0.4358160 0.0002824
3-2 0.12752155 -0.06293542 0.3179785 0.3114949
4-2 0.11469342 -0.07656886 0.3059557 0.4112309
4-3 -0.01282813 -0.20409041 0.1784342 0.9981647

From the table above (looking at “diff” and “p adj” columns) we can see which diets have significant differences in weight from others. For example we can conclude that:

  • There is no significant difference in weight between groups 3-2 (p=0.311) ,groups 4-2 (p=0.411) , groups 4-3 (p=0.998)  & groups 2-1 (p=0.0878).
  • There is a significant difference in weight between groups  3-1 ( p=0.0001109) and 4-1 (p=0.0002824).

Finally, the results obtained above can also be visualized  by plotting the the “tukey” object in R. Significant differences are the ones which not cross the zero value.

plot(tukey) # plot the confidence intervals.


Each line represents the mean difference between both groups with the according confidence interval. Whenever the confidence interval doesn’t include zero (the vertical line), the difference between both groups is significant.

I hope you found it helpful and as always, comments are welcome!

Thanks for stopping by!

DataViz : Making life easier one plot at a time.


Data visualization has brought about a sweeping change to the way analysts work with data. And with the data growing exponentially every second, data viz is going to continue to address the pressing need to be able to explore the data creatively, dig for deeper insights and address goals in an engaging manner. It’s a no-brainer that people prefer easy-to-digest visualizations of copious amounts of complex data than to pore over spreadsheets or reports to draw informed conclusions.

Let’s work with this year’s Summer Olympics data as an example. Suppose one were curious about the country-wise rankings at Rio and is presented with the spreadsheet like the one below to make sense of things. Chances are that even after having spent considerable time ploughing through it, results may be inaccurate coupled with other side effects, including but not limited to, an immediate loss of interest, no subsequent data retention and annoyance,even.That’s a serious nay in my books.


On the other hand, meaningful data visualizations promote the easy understanding of data, better data retention and the instant access to trends,anomalies,correlations and patterns. Take for example the visualization in Figure 1A – The data from the spreadsheet shown above has been cleaned, analyzed and visualized so that the results can make sense almost immediately.


Interactive plot here.

(Side note : It’s a real bummer that does not allow embedding from Shiny or Plotly servers. Please click the link to experience the interactivity. Also, plots may take slightly longer to load for slower internet connections so please make allowance for that.)

In the quest for visualizing and interacting with diverse and more often than not, complex data sets, one is supported by a plethora of data plotting techniques – right from the simple data graphics to the more sophisticated and unusual ones.The type of visualizations selected to represent the data and whether or not interactivity and aesthetics are included also has an enormous impact on whether or not the analysis is being communicated accurately and meaningfully. To understand this, let’s look at a static chloropleth vs an interactive chloropleth.

The choropleth map uses a coloring scheme inside defined areas on a map in order to show value levels and indicate the average values of some property in those areas.Figure 2A uses a color encoding to communicate the country-wise medal tally from this year’s Rio Olympics.The biggest benefit of the map is the big picture perspective.By using color density, it quickly illustrates which countries excelled at the games and the ones that did not fare as well. However, the viewer can not gain detailed information, such as in this case, the number of medals per country.


Figure 2A

This can be addressed somewhat by the addition of a certain level of interactivity as shown in Figure 2B below.


Figure 2B

Interactive chloropleth here.


Now, if we represent the same information using a bar plot, which works with discrete data, the key values are available at a glance, providing a deeper level of detail to the user (And as a bonus, there is no need to know where Côte d’Ivoire (Ivory Coast) is on the world map 😀 )


Interactive bar-plot here.

It is also interesting to note that effective manipulations and re-ordering of graph categories can emphasize certain effects. For e.g. the two bar-plots below use and represent the same data, but which one would you use to find the country that stood 10th overall?



Personally, I find that besides the ‘clean-the-data’ component of analytics, it’s the data visualization that keeps me coming back to perform independent data analysis. With SO many supporting languages and tools out there, it’s a wicked learning curve, but the possibility of revealing underlying patterns and stories is just too exciting and is what keeps me motivated and engaged.Or, if that doesn’t work,I just get really stoked by the prospect of  how cool I am going to make the data look!

As per a Harvard Business Review article that I read recently, more data crosses the internet every second than were stored in the entire internet just 20 years ago.For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20 million filing cabinets’ worth of text. An exabyte is 1,000 times that amount, or one billion gigabytes. Would you rather read one billion gigabytes or see one gigabytes? Try wrapping your brain around that kind of data.Or let data viz help you with that. It’s no happy coincidence that  data viz with it’s ‘why read fast when you can visualize faster’ philosophy is increasingly being embraced by companies across all sectors. It just works. Onwards, then!



I am happy to hear feedback and suggestions, so please feel free to leave a comment!


Data source:

All visualizations coded using R and hosted on Shiny & Plotly servers.