SFB 1036 Course on Data Analysis: Lecture 4

Reminder: Fisher test

To connect to the previous lecture, one more example:

Of 20 mice with a genetic predisposition for cancer, 10 get a novel antidrug in their chow, which supposedly protects against cancer and 10 get control chow. After 12 months, we check for tumours, and get this result:

tbl <- 
   rbind( c( 2, 8 ), c( 6, 4 ) )

dimnames(tbl) <- list(
   chow = c( "ctrl", "drug" ),
   tumours = c( "no", "yes" ) )

tbl

##       tumours
## chow   no yes
##   ctrl  2   8
##   drug  6   4

Only half as many mice grew tumours of those who got the prophylactic drug.

Does this mean that the drug has a protective effect? Let’s try a FIsher test.

fisher.test( tbl )

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tbl
## p-value = 0.1698
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.01252647 1.65925396
## sample estimates:
## odds ratio 
##  0.1841181

Let’s also try the same with a simulation, just to comapre:

Our null hypothesis is that the drug has no effect. Then 12 of the 20 mice got cancer just because of their genetics, i.e., the probability to get cancer is (under the null hypothesis) somewhere close to

p0_cancer <- 12 / 20
p0_cancer

## [1] 0.6

For 10 random mice, we expect 6 to get cancer. In practice, it could be, for example

runif( 10 ) < .6

##  [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

(runif(n) returns n random numbers, drawn uniformly from the interval [0;1]. They are below 0.6 with probability 0.6.)

To get a row of the contingency table, we use

has_tumour <- runif( 10 ) < 0.6;
has_tumour

##  [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

c( no = sum(!has_tumour), yes = sum(has_tumour) )

##  no yes 
##   3   7

Under the null hypothesis, the drug has no effect, so we can do the same twice to get the two rows:

ctrl_has_tumour <- runif( 10 ) < 0.6
drug_has_tumour <- runif( 10 ) < 0.6
tbl0 <- rbind( 
   ctrl = c( no = sum(ctrl_has_tumour), yes = sum(!ctrl_has_tumour) ),
   drug = c( no = sum(drug_has_tumour), yes = sum(!drug_has_tumour) ) )
tbl0

##      no yes
## ctrl  6   4
## drug  5   5

We need a test statistics. Let’s use the odds ratio. For each row, the “odds” of getting cancer are the “yes” counts divided by the “no” counts. The odds ratio is simply the ratio of the odds in the drug row to that in the control row, i.e., we calculate a ratio of ratios, involving all four counts, as follows:

( tbl0["drug","yes"] / tbl0["drug","no"] ) / 
   ( tbl0["ctrl","yes"] / tbl0["ctrl","no"] )

## [1] 1.5

What was the odds ratio in our original data?

odds_ratio <- 
   ( tbl["drug","yes"] / tbl["drug","no"] ) / 
   ( tbl["ctrl","yes"] / tbl["ctrl","no"] )
odds_ratio

## [1] 0.1666667

Let’s get a null distribuition of odds ratios by repeating the above steps 100,000 times

odds_ratio_null <-
   replicate( 100000, {
      ctrl_has_tumour <- runif( 10 ) < 0.6
      drug_has_tumour <- runif( 10 ) < 0.6
      tbl0 <- rbind( 
         ctrl = c( no = sum(ctrl_has_tumour), yes = sum(!ctrl_has_tumour) ),
         drug = c( no = sum(drug_has_tumour), yes = sum(!drug_has_tumour) ) )
      ( tbl0["drug","yes"] / tbl0["drug","no"] ) / 
         ( tbl0["ctrl","yes"] / tbl0["ctrl","no"] )
} )

A histogram of these

hist( odds_ratio_null, breaks=100 )

That was a bad histogram.

Always use logarithm when visualizing ratios

hist( log2( odds_ratio_null ), 100 )
abline( v = log2(odds_ratio), col="green" )

() Now it’s symmetric. And we’ve added a green line to mark the actual value of the odds ratio.

What’s our p value from this analysis? It’s the number of values in the null distribution further out than the actual value.

mean( abs(log2(odds_ratio_null)) > abs(log2(odds_ratio)), na.rm=TRUE )

## [1] 0.06073182

Continuous data: The t test

Let’s load the NHANES data again:

library( tidyverse )

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

inner_join( 
    haven::read_xpt( "NHANES/DEMO_I.XPT" ), 
    haven::read_xpt( "NHANES/BMX_I.XPT" ), 
    by="SEQN" ) %>%
  select( 
    subjectId = SEQN,
    age = RIDAGEYR,
    sex = RIAGENDR,
    height = BMXHT,
    weight = BMXWT,
    ethnicity = RIDRETH3,
    bornInUS = DMDBORN4,
     ) %>%
  mutate(
    sex = fct_recode( factor(sex), male="1", female="2" ),
    ethnicity = fct_recode( factor(ethnicity), mexican="1", hispanic="2",
       white="3", black="4", asian="6", other="7" ),
    bornInUS = fct_recode( factor(bornInUS), yes="1", no="2" ),
  ) -> 
    nhanes

nhanes

Last time, we noticed that Mexicans born in the US tend to be taller than Mexicans born not in the US:

nhanes %>%
   filter( age>20, sex=="male", ethnicity=="mexican",  !is.na(height) ) %>%
ggplot + 
   geom_density( aes( x=height, col=bornInUS ) )

Let’s first visualize this differently, with a beeswarm plot

library(ggbeeswarm)

nhanes %>%
   filter( age>20, sex=="male", ethnicity=="mexican",  !is.na(height) ) %>%
ggplot + 
   geom_beeswarm( aes( y=height, x=bornInUS ) )

Now, extract the heights into two variable to have same example data to discuss t tests

nhanes %>%
   filter( age>20, sex=="male", ethnicity=="mexican", bornInUS=="yes", !is.na(height) ) %>%
   pull( height ) ->
      height_A

nhanes %>%
   filter( age>20, sex=="male", ethnicity=="mexican", bornInUS=="no", !is.na(height) ) %>%
   pull( height ) ->
      height_B

Is the difference seen above significant?

t.test( height_A, height_B )

## 
##  Welch Two Sample t-test
## 
## data:  height_A and height_B
## t = 5.3311, df = 335.76, p-value = 1.795e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.459991 5.336814
## sample estimates:
## mean of x mean of y 
##  170.6442  166.7458

Would we have seen this also if we had measured only 50 people of each group?

smplA <- sample( height_A, 50 )
smplB <- sample( height_B, 50 )

t.test( smplA, smplB )

## 
##  Welch Two Sample t-test
## 
## data:  smplA and smplB
## t = 2.7229, df = 91.505, p-value = 0.00775
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.090827 6.973173
## sample estimates:
## mean of x mean of y 
##   170.936   166.904

Actually, there are two variants of the t test. The previous one (Welch’s version of Student’s t test, var.equal=FALSE, the default) estimated the standard deviation for each group separately, the next one (Student’s original version of the t test, var.equal=TRUE) takes an average over both

t.test( smplA, smplB, var.equal=TRUE )

## 
##  Two Sample t-test
## 
## data:  smplA and smplB
## t = 2.7229, df = 98, p-value = 0.007662
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.093436 6.970564
## sample estimates:
## mean of x mean of y 
##   170.936   166.904

But they give nearly the same value, so this distinction does not matter much.

Let’s do this “on foot” and try to recreate what R’s t.test function did.

First, we need a test statistic. We use the difference in means, divided by the average standard deviation (or more precisely: the square root of the average of the variances, but that is nearly the same value).

t <- ( mean(smplB) - mean(smplA) ) / sqrt( ( var(smplA) + var(smplB) )/2 ) 
t

## [1] -0.544577

And we need a null hypothesis. Ours is: There is no difference between the two groups. All the measured height values come from the same distribution, namely a normal distribution with mean mu0 (which is just the average over all data, simply mixing the groups) and standard deviation sd0, which is taken from the average of the variances of the groups. (Both our groups have nearly the same standard deviation, so it does not matter much how we pick a value which is somehow in the middle between the two.)

mu0 <- ( mean( smplA )  +  mean( smplB ) ) / 2
sd0 <- sqrt( (  var( smplA )  +   var( smplB ) ) / 2 )
mu0

## [1] 168.92

sd0

## [1] 7.403911

Our null hypothesis is that the 50 values of each group are drawn from a normal distribution with this mean and standard deviation. The rnorm function gives us random draws from a **norm*al distribution. So this here is a set of potential values for our 50 height values under the null hypothesis

rnorm( 50, mu0, sd0 )

##  [1] 172.1991 162.8785 164.5749 163.7004 168.7946 166.5366 175.6744
##  [8] 176.9069 177.8598 168.9133 172.1413 173.1644 183.6032 171.3583
## [15] 164.8574 163.7751 178.4295 173.3023 177.2245 181.0457 161.3619
## [22] 181.5827 167.2446 171.7691 164.4931 180.2527 173.4522 175.3079
## [29] 184.1703 171.9276 170.1616 157.5801 170.3548 166.9666 165.5429
## [36] 168.9853 156.4090 163.4190 183.5148 177.3265 160.6973 159.0892
## [43] 167.2491 161.8902 170.1868 188.3512 163.7815 167.7994 174.3908
## [50] 171.1133

We do this twice to get two null samples of 50 people each and the calculate our null test statistic

smplA0 <- rnorm( 50, mu0, sd0 )
smplB0 <- rnorm( 50, mu0, sd0 )
t0 <- ( mean(smplB0) - mean(smplA0) ) / sqrt( ( var(smplA0) + var(smplB0) )/2 )
t0

## [1] -0.1260606

Do this many times to get a null distribution of our test statistic

t0 <- replicate( 100000, {
   smplA0 <- rnorm( 50, mu0, sd0 )
   smplB0 <- rnorm( 50, mu0, sd0 )
   ( mean(smplB0) - mean(smplA0) ) / sqrt( ( var(smplA0) + var(smplB0) )/2 )
})

Here is the histogram of our null distribution

hist( t0, breaks=100 )
abline( v=t, col="green")
abline( v=-t, col="green", lty="dashed")

How many t0 values are more extreme than the t value from our real data, i.e. how many of the 100,000 values are outside the green lines?

table( abs(t0) > abs(t) )

## 
## FALSE  TRUE 
## 99260   740

mean( abs(t0) > abs(t) )

## [1] 0.0074

That’s also what Student’s t test gave above as p value.

What Student’s t test does differently is that it doesn’t get the t0 histogram from random tries but from an explicite formula, the t distribution.

Here is the histogram from before, with the theoretical t value distribution superimposed in blue.

hist( t0 * sqrt(25), breaks=100, freq=FALSE )

xg <- seq( -5, 5, length.out = 1000 )
lines( xg, dt( xg, 98), col="blue" )

A few remarks for those interested in details

Our test statistic above was the difference in means devided by the standard distribution (SD). In the ususal formulation of the t test, one divides by the standard error (SE) of the difference of the means, not the standard deviation. As the standard error of a mean (SEM) is \(SEM = SD / \sqrt{n}\), we have to multiply our test statistic by \(\sqrt{50}\). And because the standard error of a difference between two means is \(\sqrt{2}\) times the standard error of the means, we have to divide by \(\sqrt{2}\). Hence, the rescaling of t0 by sqrt(25).
Student’s t distribution has one parameter, the number of degrees of freedom (df). Is is the number of data values (2*50=100) minus the number of estimated parameters (two group mean), hence df=98.
The function dt calculated the probability density of the t distribution with df=\(nu\) degrees of freedom. The formula is \(\frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} \left( 1 + \frac{t^2}{\nu_2} \right)^{-\frac{\nu+1}{2}}\). SO, the R function dt is a handy abbreviation for this slighltly complicated formula that Gosset figured out.

Alternative: A permutation test

ALternatively, we could also have permuted the group labels: We consider each of the 100 height value as labelled with either “A” or “B” to makr the group they are from. Now, instead of drawing new labels from a normal distribution, as we did before, we can also simply permute the labels: we keep the values but shuffle which 50 labels are assigned to group A and which to B, and then calculate a null distribution of our test statistic.

Here is the code for this

t0b <- replicate( 100000, {
   smpl_AB0 <- sample( c( smplA, smplB ) )
   smplA0 <- smpl_AB0[1:50]
   smplB0 <- smpl_AB0[51:100]
   c( mean(smplB0) - mean(smplA0) ) / sqrt( ( var(smplA0) + var(smplB0 ) )/2 )
} )  
mean( abs(t0b) > abs(t) )

## [1] 0.00719

The p value is remarkable similar.

The difference is: - Student’s t test is a parametric test: It makes a specific assumption about the distribution that the values are drawn from, namely that they are normally distributed. The null hypothesis is only about the parameters of this null distribution, namely mean and standard deviation. - The permutation test is a non-parametric test: it does not need the assumption that the data are normal, because to get the null distribution of the test statistic, it does not need to draw from a specific distribution with estimated parameters. Rather, we only use the data we have and just change its order. This is safer as it makes fewer assumption. But it is computationally intensive, because it needs to try many permutations. There is not closed-expression formula as for the t distribution. Hence, permutation tests have become a real option only with the advent of fast computers.

Another kind of test

We will now look at a scenario which is usually discussed without mention of any “null hypothesis” or p value.

You are administered a lab test that detects some dangerous disease. The lab test has - a sensitivity of 98%: If a patient has the disease, the test will give a positive result with 98% probability - a specificity of 95%: If the patient does not have the disease, the test will give a negative result with 95% probability

Your test is positive. How worried should you be?

Very worried, supposedly. After all, the test is 98% specific.

In reality, this depends on the prevalence of the disease!

Let’s say, the test is administered to 10,000 patients, and 1% of them (i.e, 100) have the disease. That latter number is called the prevalence of the disease among the test subjects.

Of the 100 diseased subjects, 98 are expected to get a positive and 2 a negative result
Of the 9,900 healthy subject, 95% (i.e., 9405) are expected to get a negative result, but 495 to get a positive result.

So, we have 593 subjects with positive result, of which only 98 (17%) have the disease

Before taking the test, you believed your risk of having the disease to be 1%, now you have reason to believe it to be 17%. Is this helpful to know?

Bayes theorem

A mathematical way of writing down the calculation we just made is this:

We first define events: - D: Subject has the disease - H: Subject is healthy - P: Test result is positive - N: Test result is negative

Now we can write the test’s sensitivity and specificity as conditional probablities: - Sensitivity = Pr( P | D ) - Specificity = Pr( N | H )

Read Pr( P | D ) as: probability of P given that D is the case (i.e., that the test is positive if the patient has the disease)

The prevalence is simply: Prevalence = Pr( D )

And we have, of course: Pr( H ) = 1 - Pr( D )

We noticed that 593/10000 have a positive result. How did we calculate this probability? We did:

Pr( P ) = Pr( P | D ) * Pr( D ) + Pr( P | H ) * Pr( H )

Then, we used this to calculate:

Pr( D | P ) = Pr( P | D ) * Pr( D ) / Pr( P )

This last formula is Bayes theorem. It allowed us to calculate the probability that the patient has the disease if the test is positive from the probability that the test is positive if the patient has the disease, i.e., to swap the condition and the event in the probability. The crucial information we needed was the prevalence. For a different prevalence, we would have got a very different result!

Specificity and p value

Let’s look at the same scenario using Fisher’s hypothesis testing scenario. We have a patient and suspect that he might have the disease. Our null hypothesis, however, is that all is normal and that the patient is healthy. The lab test produces a read out and if the read out is above some threshold, we call the test positive. The probability that our read out is above the threshold even though the null hypothesis was true is what Fisher would call the p value (if the observed value were right on the threshold). Above, we called that the specificity.

To write the p value in the conditional probability notation:

p value = Pr( test statistic has observed value of more extreme | null hypothesis is true )

What we would actually like to know is

Pr( null hypothesis is not true | test statistic has the observed value [or more extreme] )

As we have just seen, we can transfer one to the other with Bayes theorem if we have the information corresponding to the prevalence. Before, this was Pr( patient has the disease ), now it is Pr( null hypothesis is not true ).

Hence, to know how likely our result is wrong, we would need how common it is for scientists in our line of research to persue a wrong hypothesis. This is obviously impossible to know. If we at least have a guess, we can better judge how much a p value of p=0.05 might mean. But as usually, we don’t, we have to deal with the somewhat awkward p value, while in a case as the lab test scenario, things look much easier.