Analysis of Variance

library( tidyverse )
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

We load some example data found on the web

medley <- 
   read_csv( url( "https://raw.githubusercontent.com/aurielfournier/biometry_materials/master/20160225_biometry_lab_6_anova_assumptions/medley.csv" ) )[,1:3]
## Parsed with column specification:
## cols(
##   STREAM = col_character(),
##   ZINC = col_character(),
##   DIVERSITY = col_double(),
##   ZNGROUP = col_double(),
##   RESID1 = col_double(),
##   PREDICT1 = col_double(),
##   RESID2 = col_double(),
##   PREDICT2 = col_double()
## )
colnames(medley) <- str_to_lower( colnames(medley) )   

medley

This is data from Medley and Clement, Responses of Diatom Communities to heavy metals in streams: the influence of longitudinal variation (Ecological Applications, 8:631 (1998)). I have taken the idea to use this as example from Quinn and Keough’s book Experimental design and data analysis for biologists (Cambridge University Press, 2002).

Medley and Clement have measured, for different sites along five streams, the concentration of the zinc and the Shannon-Wiener diversity of diatom species. Do high levels of zinc cause a reduction in species diversity?

In order to discuss one-way anova, let’s only use one column, the zinc concentration, for now, and ignore the stream column.

As a fisrt step, we should plot the data

ggplot( medley ) +
   geom_point( aes( x = zinc, y = diversity ) )

Let’s bring the x axis in the correct order and then try again

medley$zinc <- fct_relevel( medley$zinc, 
   c( "BACK", "LOW", "MED", "HIGH" ) )

library( ggbeeswarm )

ggplot( medley ) +
   geom_point( aes( x = zinc, y = diversity ) )

In most papers, however, you will see this

ggplot( medley ) +
   geom_boxplot( aes( x = zinc, y = diversity ) )

or even this

medley %>%
   group_by( zinc ) %>%
   summarise(
      mean = mean( diversity ),
      sem = sd( diversity ) / sqrt( n() ) ) %>%
   rename( diversity = mean ) %>%
ggplot +
   geom_bar( aes( x = zinc, y = diversity, width = .7 ), stat="identity" ) +
   geom_errorbar( aes( x = zinc, ymin = diversity-sem, ymax = diversity+sem ), 
      width = .3, stat="identity" ) 
## Warning: Ignoring unknown aesthetics: width

Don’t hide your data, so don’t use barcharts.

Maybe combine boxplots and beeswarm plots:

ggplot( medley, aes( x = zinc, y = diversity ) ) +
   geom_boxplot( color = "gray40" ) + geom_point()

Back to the original question.

Is there a difference in species diversity between the four groups?

We could make t tests:

t.test( 
   medley %>% filter( zinc == "BACK" ) %>% pull(diversity),
   medley %>% filter( zinc == "MED" ) %>% pull(diversity) )
## 
##  Welch Two Sample t-test
## 
## data:  medley %>% filter(zinc == "BACK") %>% pull(diversity) and medley %>% filter(zinc == "MED") %>% pull(diversity)
## t = 0.33233, df = 14.88, p-value = 0.7443
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4319547  0.5913992
## sample estimates:
## mean of x mean of y 
##  1.797500  1.717778

But do we really want to compare each group against every other group? We will only get a multiple testing problem.

Let’s make a short detour to see why this is not a good idea.

We have four groups, and, say 30 measurement values, taken completely at random, with no connection between measurement values and group. We do all 6 pair-wise comparisons with t tests. How often do we get a significant result? Run the following code a couple of times.

tibble(
   group = sample( c( "A", "B", "C", "D" ), 30, replace=TRUE ),
   value = rnorm( 30, mean=10, sd=3 ) ) ->
      simdata

expand( simdata, group1=group, group2 = group ) %>%
   filter( group1 < group2 ) %>%
   group_by_all() %>%
   summarise( pval = t.test( 
      simdata %>% filter( group == group1 ) %>% pull(value),
      simdata %>% filter( group == group2 ) %>% pull(value) )$p.value ) %>%
   mutate( signif = pval < 0.05 )

So, multiple pairwise tests are not a good idea. How can we compare all four groups at once?

Sneak preview: We use one-way ANOVA:

anova( lm( diversity ~ zinc, medley ) )

As we can see, there is a signifcant p value. But what do all the other numbers mean?

To understand, let’s break down the ANOVA calculation in its piece and to it “on foot”.

In analysis of variance, or ANOVA, we see how much variance the groups explain. FIrst, a graphical representation of the idea

cowplot::plot_grid( 
   ggplot( medley ) + geom_jitter( aes( x = "all", y = diversity ), width=.1 ) +
       ylim(0.2,3.3),
   ggplot( medley ) + geom_jitter( aes( x = zinc, y = diversity ),  width=.1 ) +   
       ylim(0.2,3.3),
   rel_widths = c( 1, 3 ) )

The total variance of our diversity measurements is

var( medley$diversity )
## [1] 0.2752431

Remember that to get to this number, we calculate the squared deviations from the mean and then divide by the number of samples (minus 1):

( medley$diversity - mean(medley$diversity) )^2
##  [1] 3.316405e-01 1.972405e-01 2.960640e-01 5.493426e-03 3.460208e-05
##  [6] 1.132346e+00 1.266522e-01 8.172872e-02 4.278699e-01 2.458993e-01
## [11] 1.647405e-01 2.559170e-01 1.338699e-01 4.238754e-02 3.455225e-02
## [16] 7.125346e-01 8.650519e-02 2.360817e-01 1.846401e-02 3.455225e-02
## [21] 1.061993e-01 6.045813e-02 1.647405e-01 4.704346e-01 6.975813e-02
## [26] 1.050522e-01 3.122837e-03 1.290229e+00 2.693460e-02 8.725758e-01
## [31] 7.994464e-01 1.164014e-03 5.099640e-01 3.836990e-02
sum( ( medley$diversity - mean(medley$diversity) )^2 ) / ( nrow(medley) - 1 )
## [1] 0.2752431

For the moment, let’s ignore the denominator and focus on the numerator of the variance, the sum of squares (SS or SumSq). Here we calculate the sum of squares once more, now without dividing by ( nrow(medley) - 1 ).

medley %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) 

We call this quantity the total sum of squares (TSS).

In the TSS, we have subtracted from each value the mean over all samples. What happens if we take the mean separately for each of the four groups?

medley %>%
   group_by( zinc ) %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) )

Let’s sum this up:

medley %>%
   group_by( zinc ) %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
   summarise( ss = sum(ss) )

Now, we get a lower sum of squares! Why? Because the means in the individual groups can be closer to the values than the overall mean. This can be because the values scatter around different, groups-specific true means, or simply by chance.

First, some terminology: We call this second sum the residual sum of squares (RSS), because it is what is left over after we make use of our knowledge of the differences in zinc concentyration. The ratio, RSS/TSS, is called the fraction of variance unexplained. It seems that here, about 72% of the total varaince is not “explained” by the zinc concentration bin while 28% of the total variance is “explained” by it.

Why the quotation marks? Because even noise can seem to explain something: Let’s try the same on our simulation data.

First the total sum of squares (TSS):

simdata %>%
   summarise( ss = sum( ( value - mean(value) )^2 ) )

Now the residual sum of squares (RSS):

simdata %>%
   group_by( group ) %>%
   summarise( ss = sum( ( value - mean(value) )^2 ) ) %>%
   summarise( ss = sum( ss ) )

How do we know what fraction of variance we can expect to see “explained”" by chance? For our simulation, we can just run it many times:

replicate( 3000, {

   tibble(
      group = sample( c( "A", "B", "C", "D" ), 30, replace=TRUE ),
      value = rnorm( 30, mean=10, sd=3 ) ) ->
        simdata
   
   simdata %>%
      summarise( ss = sum( ( value - mean(value) )^2 ) ) %>%
      pull( ss ) -> 
         tss
   
   simdata %>%
      group_by( group ) %>%
      summarise( ss = sum( ( value - mean(value) )^2 ) ) %>%
      summarise( ss = sum( ss ) ) %>%
      pull( ss ) -> 
         rss
   
   rss/tss
}) -> frac_unexpl


hist( frac_unexpl, 30 )

Instead of looking at this unexplained fraction of the TSS, it is more common to look at the so-called F statistic

Fstat <- ( 1 - frac_unexpl ) / frac_unexpl * 26/3

This is the ratio of explained to unexplained variance. Why the 26/3? Because we have used sum of squares but the F statistic is calculated on variances. So, we need to reintroduce the \((N-1)\) denominator – only that now it is \((N-4)\), because we had \(N=30\) values, and subtracted \(K=4\) means for the \(K\) groups instead of the usual 1 mean. This is called the number of residual degrees of freedom, while \(N-1\) is the total number of degrees of freedom, and the differnce between them, \(K-1=-3\) is the number of degrees of freedom associated with the grouip factor. If this sopunds complicated, don’t bother – we need this only for historical reasons, i.e., in order to be able to compare with standard tables.

Here is the same histogram, but now for F:

hist( Fstat, 100, freq=FALSE )
fg <- seq( 0, 10, length.out=100 )
lines( fg, df( fg, 3, 26 ), col="magenta" )

The point of the F statistic is that there is a formula for its distribution under the null hypothesis (i.e., under the assumptions of our null simulation). So, we do not actually need to perform the simulation (which takes time, and would have taken way too much time before we had fast computers), we can simply look things up.

Coming back to the zinc data: There, our TSS and RSS values are (see above):

medley %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
   pull(ss) ->
tss   

medley %>%
   group_by( zinc ) %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
   summarise( ss = sum(ss) ) %>%
   pull(ss) ->
rss  

c( tss = tss, rss = rss, unexpl_frac = rss/tss)
##         tss         rss unexpl_frac 
##   9.0830235   6.5164111   0.7174275

Also, we have \(N=34\) samples and \(K=4\) groups, and hence \(K-1=3\) degrees of freedom associated with the group (zinc) and \(N-K=30\) remaining degrees of freedom. Our F value is hence

(tss-rss) / rss * 30/3
## [1] 3.93869

The probability of explaining more variance by chance than what zinc seems to explain here, i.e., of gettin a lower RSS or a higher F value by chance, is, according to the F distribution formula

pf( (tss-rss) / rss * 30/3, 3, 30, lower.tail=FALSE )
## [1] 0.01755956

All this lengthy computation can be done automatically with a single call to the anova function:

anova( lm( diversity ~ zinc, medley ) )

Permutation test

medley %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
   pull ->
     tss

medley %>%
   group_by( zinc ) %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
   summarise( ss = sum(ss) ) %>%
   pull ->
rss

replicate( 1000, {
   medley %>%
   mutate( zinc = sample( zinc ) ) %>%
   group_by( zinc ) %>%
   summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
   summarise( ss = sum(ss) ) %>%
   pull } ) -> 
rss_perm

hist( rss_perm, 30 )
abline( v = rss, col="green" )
abline( v = tss, col="purple" )

table( rss_perm < rss)
## 
## FALSE  TRUE 
##   986    14
( sum( rss_perm < rss ) + 1 ) / ( length(rss_perm) + 1 )
## [1] 0.01498501

Two-way anova

So far, we have ignored the “stream” column. Let’s plot it:

ggplot( medley ) +
   geom_point( aes( x = zinc, y = diversity, col = stream ) ) +
   scale_color_brewer( palette="Set3" )

Here is how a two-way ANOVA looks:

anova( lm( diversity ~ stream + zinc, medley ) )

Details to follow.