In the early 1930, at a summer party of the [Rothamstad Research Station][https://en.wikipedia.org/wiki/Rothamsted_Research] in England, a discussion took place about tea, namely whether it makes a difference to the taste whether one pours first the milk or first the tea into the cup. A lady claimed that she could tell the difference which was not believed. See here for Wikipedia’s account of the anecdote, which is considered the occasion where R. A. Fisher came up with the idea of the null hypothesis and the p value.
A randomized trial was set up to test her claim: Eight cups were taken, into four of them tea was poured first (“T”) and then milk, into the other four milk first (“M”), then tea.
We represent this in R as a vector:
cups <- c( "T", "T", "T", "T", "M", "M", "M", "M" )
We can use the sample
function to randomly shuffle (“permute”) the order of the cups:
sample( cups )
## [1] "M" "M" "M" "M" "T" "T" "T" "T"
Every time you call sample
you get a random permutation:
sample( cups )
## [1] "M" "M" "T" "M" "T" "M" "T" "T"
sample( cups )
## [1] "T" "T" "T" "M" "M" "T" "M" "M"
Presented with these eight cups in random order, can the lady pick the four cups which had the tea poured in first? If so, does this consitute sufficient evidence that we should believe her claim that she can taste a difference? WHat if she only get three of the four cups right?
R. A. Fisher, who was present at the party, agued that we have to calculate the probability that she could achieve this result by chance, by just picking four cups at random and thus making a lucky guess.
So, he started with formulating what he called the ``null hypothesis’’: The lady’s pick of four cups is entirely random (because she cannot taste a difference).
Let R select four cups at random. Again, we use the sample
function, which now picks a “random sample” of 4 elements from our vector cups
with 8 elements
s <- sample( cups, 4 )
s
## [1] "M" "T" "M" "M"
How many of the cups in this sample s
have the tea first?
sum( s == "T" )
## [1] 1
We now want to see how often it happens that one gets 3 or more cups with tea first. Does s
contain three or more “T” cups?
sum( s == "T" ) >= 3
## [1] FALSE
Let’s repeat these steps 100 times
replicate( 100,
sum( sample( cups, 4 ) == "T" ) >= 3 )
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [12] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [23] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [67] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [89] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
What fraction of these results were TRUE
?
Let’s do it again, this time even 10,000 times and use the mean
function to get the fraction of TRUE
s
mean( replicate( 10000, sum( sample( cups, 4 ) == "T" ) >= 3 ) )
## [1] 0.2495
Were 10,000 samples enough to get a stable estimate of the probability? Let’s try again:
mean( replicate( 10000, sum( sample( cups, 4 ) == "T" ) >= 3 ) )
## [1] 0.242
Yes, we got the same first two digits. It seems that 24% is a good answer to the question? WHat is the probability of getting at least 3 cups right?
And how often, that one gets all four right?
mean( replicate( 10000, sum( sample( cups, 4 ) == "T" ) == 4 ) )
## [1] 0.0157
The lady did in fact get all four cups right. The probability for her managing to do so under the assumption of the null hypothesis (that she was guessing blindly) is less than 2%. The value we just obtaines is the what we call the p value associated with the outcome of our experiment.
We asked our computer to test 10,000 random samples to get the p value. But in 1930, FIsher did not have a computer, and probably also wouldnt have had the patience to make thousands of tried. How did he get the p value?
If we randomly rearrange (“permute”) the eight cups, there are only 70 possible ways to permute them. One way to see that is to ask R to list them all. (You need to install the combinat
package to run this command.)
sapply( unique( combinat::permn( cups ) ), paste, collapse=" " )
## [1] "T T T T M M M M" "T T T M T M M M" "T T M T T M M M"
## [4] "T M T T T M M M" "M T T T T M M M" "M T T T M T M M"
## [7] "T M T T M T M M" "T T M T M T M M" "T T T M M T M M"
## [10] "T T M M T T M M" "T M T M T T M M" "M T T M T T M M"
## [13] "M T M T T T M M" "T M M T T T M M" "M M T T T T M M"
## [16] "T T T M M M T M" "T T M T M M T M" "T M T T M M T M"
## [19] "M T T T M M T M" "T T M M T M T M" "T M T M T M T M"
## [22] "M T T M T M T M" "M T M T T M T M" "T M M T T M T M"
## [25] "M M T T T M T M" "M M T T M T T M" "M T M T M T T M"
## [28] "M T T M M T T M" "T M T M M T T M" "T M M T M T T M"
## [31] "T T M M M T T M" "T M M M T T T M" "M T M M T T T M"
## [34] "M M T M T T T M" "M M M T T T T M" "M M T T T M M T"
## [37] "M T M T T M M T" "M T T M T M M T" "M T T T M M M T"
## [40] "M M T T M T M T" "M T M T M T M T" "M T T M M T M T"
## [43] "M T M M T T M T" "M M T M T T M T" "M M M T T T M T"
## [46] "T M M T T M M T" "T M M T M T M T" "T M M M T T M T"
## [49] "T M T M T M M T" "T M T M M T M T" "T M T T M M M T"
## [52] "T T M M T M M T" "T T M T M M M T" "T T M M M T M T"
## [55] "T T T M M M M T" "T T M M M M T T" "T M T M M M T T"
## [58] "M T T M M M T T" "M T M T M M T T" "T M M T M M T T"
## [61] "M M T T M M T T" "T M M M T M T T" "M T M M T M T T"
## [64] "M M T M T M T T" "M M M T T M T T" "M M T M M T T T"
## [67] "M T M M M T T T" "M M M T M T T T" "T M M M M T T T"
## [70] "M M M M T T T T"
You can also arrive at the number 70 by calculating the binomial coefficient \(\binom{8}{4}\), as we have 4 “T” cups among our 8 tea cups. (See the Wikipedia article on combinations if you want to know what that means. )
Let’s say we always pick the first 4 cups: How many of the 70 permutations above have “T” at the four first places? Only one. Hence we probability to get all four cups correct is 1 in 70. The exact value of the p value we tried to calculate above by random tries is 1/70:
1/70
## [1] 0.01428571
There are 16 permutations that have 3 “T” cups and 1 “M” cups among the first four. So, the probability to get exactly 3 cups right is 16/70, and to get at least 3 right is 17/70. Hence, if the lady had gotten only 3 cups right, the p value for that ourcome would have been
17/70
## [1] 0.2428571
The general way to calculate this is the hypergeometric distribution. For example, the probability to get exactly 3 cups right (16/70) is
16/70
## [1] 0.2285714
dhyper( 3, 4, 4, 4 )
## [1] 0.2285714
The dhyper
function is called thus:
dhyper( number of cups correctly said to have milk first, number of cups with milk first, number of cups with tea first, number of cups said to have milk first )
If you look at its help page, you will read the same, but talking about black and white balls rather than tea and milk.
This kind of problem is quite common, and so, Fisher turned his approach into a general method, called “Fisher’s test”:
We first set up what is called a “contingency table”, with rows for the actual truth and columns for the guess:
| guess
truth | tea first milk first | Sum
-----------+------------------------+------
tea first | 3 1 | 4
milk first | 1 3 | 4
-----------+------------------------+------
Sum | 4 4 | 8
The margins are just the sums, we don’t need them. We put just the core part into a variable. We use rbind
(row bind) to enter the two matrix rows and bind them together into a matrix
m <- rbind( c(3,1), c(1,3) )
m
## [,1] [,2]
## [1,] 3 1
## [2,] 1 3
Now, we can add this to the the FIsher test function
fisher.test( m, alternative = "greater" )
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3135693 Inf
## sample estimates:
## odds ratio
## 6.408309
And, again, we get our p value of 17/70.
A similar example: A die is thrown 60 times. We would expect to see a six 10 times, but we see it 15 times. Is the die loaded?
Our null hypothesis: The die is fair, i.e., the probability for a six is 1/6
Then, the probability to see \(k\) sixes among the 60 throws is given by the binomial distribution. Instead of copying the formula from WIkipedia, we simply ask R what the probability is to see exactly 15 sixes:
dbinom( 15, 60, 1/6 )
## [1] 0.03093418
In our experiment to see whether the die is loaded, we give the outcome by counting how many sixes we have. The number of sixes is our test statistic. The test statistic can take on 61 possible values, the numbers from 0 to 60. In R, we write this with a colon:
0:60
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [24] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## [47] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
For each value, dbinom
can give us the probability of obtaining this value under the assumption of our null hypothesis (namely, that the die is fair and the probability for a six is 1/6):
dbinom( 0:60, 60, 1/6 )
## [1] 1.774701e-05 2.129641e-04 1.256488e-03 4.858422e-03 1.384650e-02
## [6] 3.101617e-02 5.686297e-02 8.773144e-02 1.162442e-01 1.343266e-01
## [11] 1.370131e-01 1.245574e-01 1.017219e-01 7.511768e-02 5.043616e-02
## [16] 3.093418e-02 1.740047e-02 9.007304e-03 4.303490e-03 1.902595e-03
## [21] 7.800641e-04 2.971673e-04 1.053593e-04 3.481438e-05 1.073443e-05
## [26] 3.091517e-06 8.323315e-07 2.096242e-07 4.941143e-08 1.090459e-08
## [31] 2.253615e-09 4.361836e-10 7.905828e-11 1.341595e-11 2.130769e-12
## [36] 3.165713e-13 4.396824e-14 5.703988e-15 6.904828e-16 7.790062e-17
## [41] 8.179565e-18 7.980064e-19 7.220058e-20 6.044699e-21 4.670904e-22
## [46] 3.321532e-23 2.166216e-24 1.290512e-25 6.990273e-27 3.423807e-28
## [51] 1.506475e-29 5.907746e-31 2.044989e-32 6.173551e-34 1.600550e-35
## [56] 3.492110e-37 6.235910e-39 8.752155e-41 9.053953e-43 6.138273e-45
## [61] 2.046091e-47
Let’s plot this
plot(
x = 0:60,
y = dbinom( 0:60, 60, 1/6 ) )
The probability to then see 15 or more sixes is obtained by calculating dbinom
for all the values from 15 to 60 and summing them up
sum( dbinom( 15:60, 60, 1/6 ) )
## [1] 0.06478038
There is also the pbinom
function which can tell us the probability to see 14 or fewer sixes. We shoudl hence get the same value this way:
1 - pbinom( 14, 60, 1/6 )
## [1] 0.06478038
Did we use these functions correctly? Is the the binomial distribution the right one to use? To be sure, let’s simulate:
mean( replicate( 100000, sum( sample( 1:6, 60, replace=TRUE ) == 6 ) >= 15 ) )
## [1] 0.06539
We get aproximately the same p value.
How did this complicates command work? To understand it, read it from inside out. Start with the middle part:
sample( 1:6, 60, replace=TRUE )
## [1] 2 4 3 3 5 4 4 3 2 1 5 2 3 5 3 2 2 5 5 3 4 3 1 2 4 1 4 2 1 6 6 6 4 1 5
## [36] 5 6 4 2 3 1 3 5 2 6 5 4 6 4 5 5 1 5 4 3 2 4 5 3 6
This gives us 60 throws of the die. The replace=TRUE
tells sample
that it can report the same number several times, i.e., that it should pick among each of the six faces of the die (1:6) anew for each throw, rather than removing from the pool the numbers that have already been picked.
Then, we check, how many sixes there are in such a sample of 60 throws
sum( sample( 1:6, 60, replace=TRUE ) == 6 )
## [1] 5
We do this many times, to see how many sixes one might see
replicate( 100, sum( sample( 1:6, 60, replace=TRUE ) == 6 ) )
## [1] 9 4 9 5 15 12 12 11 9 10 9 11 10 7 8 6 8 14 7 12 7 12 8
## [24] 9 12 11 11 7 6 9 10 15 14 13 10 8 6 9 5 10 10 10 13 9 9 12
## [47] 11 8 12 8 12 11 14 9 18 8 10 10 10 13 11 6 13 12 6 8 15 12 14
## [70] 8 9 11 8 10 7 9 11 10 10 12 9 6 5 12 5 5 7 14 11 10 10 7
## [93] 4 5 6 12 11 10 10 11
We then check for each outcome wherher it is 15 or more and get the fraction of TRUE results
mean( replicate( 10000, sum( sample( 1:6, 60, replace=TRUE ) == 6 ) >= 15 ) )
## [1] 0.0676
This is indeed the same as
sum( dbinom( 15:60, 60, 1/6 ) )
## [1] 0.06478038
or, if we use the binomial test function
binom.test( 15, 60, 1/6, alternative = "greater" )
##
## Exact binomial test
##
## data: 15 and 60
## number of successes = 15, number of trials = 60, p-value = 0.06478
## alternative hypothesis: true probability of success is greater than 0.1666667
## 95 percent confidence interval:
## 0.1608644 1.0000000
## sample estimates:
## probability of success
## 0.25
What if we see 25 times a six in our 60n throws?
sum( dbinom( 25:60, 60, 1/6 ) )
## [1] 4.196574e-06
If our opponent got 15 sixes (p=0.065), we might believe that he was lucky, but if he got 25 sixes (p=0.0000042), we would be justified in being quite sure that he cheated.
Here are the p values for various outcomes
library(tidyverse)
tibble( min_num_of_sixes = 0:60 ) %>%
mutate( probability = pbinom( min_num_of_sixes, 60, 1/6, lower.tail=FALSE ) ) %>%
ggplot() +
geom_point( aes( x = min_num_of_sixes, y = probability) ) +
scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
When do we start supecting that the die might be loaded? When are we sure?
Is \(p<=0.05\) really a good threshold? Does this depend on the stakes? Or on our prior beliefs?
See, e.g., xkcd #1132
One more example:
We have two sets of numbers:
x <- c(0.32, 0.39, 0.11, 0.6, 0.94, 0.27, 0.19, 0.66, 0.91, 0.41)
y <- c(0.3, 0.77, 0.09, 0.76, 0.62, 0.19, 0.07, 0.89, 0.92, 0.39)
They seem correlated
plot( x, y, asp=1 )
The Pearson correlation coefficient is
cor( x, y )
## [1] 0.8093415
Could such a high correlation coefficient arise by chance?
If we shuffle one of the vectors, we should destroy the correlation
cor( x, sample(y) )
## [1] 0.02541946
But it won’t be zero. What values should we expect to seet to arise by chance if we pair the values at random?
hist( replicate( 10000, cor( x, sample(y) ) ) )
How many of these are above the correlation value we have seen above?
mean( abs( replicate( 10000, cor( x, sample(y) ) ) ) > cor( x, y ) )
## [1] 0.0048
Note the abs
, which removes the minus signs. We check whetehr the correlation is more extreme than the real ones to get a p value.
R has, again, a way to calculate this number without having to try out thousands of correlations:
cor.test( x, y )
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 3.8975, df = 8, p-value = 0.004561
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3664537 0.9532223
## sample estimates:
## cor
## 0.8093415
Further material (not tiedied up yet)
# See xkcd #882 (jelly beans)
# Sally Clark case (SIDS: 1/8500)
# Now continuous data
# Load NHANES again
library( tidyverse )
inner_join(
haven::read_xpt( "NHANES/DEMO_I.XPT" ),
haven::read_xpt( "NHANES/BMX_I.XPT" ),
by="SEQN" ) %>%
select(
subjectId = SEQN,
age = RIDAGEYR,
sex = RIAGENDR,
height = BMXHT,
weight = BMXWT,
ethnicity = RIDRETH3,
bornInUS = DMDBORN4,
) %>%
mutate(
sex = fct_recode( factor(sex), male="1", female="2" ),
ethnicity = fct_recode( factor(ethnicity), mexican="1", hispanic="2",
white="3", black="4", asian="6", other="7" ),
bornInUS = fct_recode( factor(bornInUS), yes="1", no="2" ),
) ->
nhanes
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", !is.na(height) ) %>%
ggplot +
geom_density( aes( x=height, col=bornInUS ) )
library(ggbeeswarm)
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", !is.na(height) ) %>%
ggplot +
geom_beeswarm( aes( y=height, x=bornInUS ) )
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", bornInUS=="yes", !is.na(height) ) %>%
pull( height ) ->
height_A
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", bornInUS=="no", !is.na(height) ) %>%
pull( height ) ->
height_B
t.test( height_A, height_B )
smplA <- sample( height_A, 20 )
smplB <- sample( height_B, 20 )
t.test( smplA, smplB )
mean( sample( height_B, 20 ) ) - mean( sample( height_A, 20 ) )
smplA <- sample( height_A, 50 )
smplB <- sample( height_B, 50 )
t.test( smplA, smplB )
t.test( smplA, smplB, var.equal=TRUE )
# our test statistic is
t <- ( mean(smplB) - mean(smplA) ) / ( ( sd(smplA) + sd(smplB) )/2 )
# common mean
mu0 <- mean( c( smplA, smplB ) )
mu0
# SDs
sd0 <- ( sd( smplA ) + sd( smplB ) )/2
# Null hypothesis: The mean is the same in both groups
# For example, like this:
smplA0 <- rnorm( 50, mu0, sd0 )
smplB0 <- rnorm( 50, mu0, sd0 )
# test statistic for this null sample
c( mean(smplB0) - mean(smplA0) ) / ( ( sd(smplA0) + sd(smplB0 ) )/2 )
#What is the null distriution of our test statistic?
t0 <- replicate( 10000, {
smplA0 <- rnorm( 50, mu0, sd0 )
smplB0 <- rnorm( 50, mu0, sd0 )
c( mean(smplB0) - mean(smplA0) ) / ( ( sd(smplA0) + sd(smplB0 ) )/2 )
} )
hist(t0)
t
table( abs(t0) > abs(t) )
mean( abs(t0) > abs(t) )
# Alternative: Permutation test
t0 <- replicate( 10000, {
smpl_AB0 <- sample( c( smplA, smplB ) )
smplA0 <- smpl_AB0[1:50]
smplB0 <- smpl_AB0[51:100]
c( mean(smplB0) - mean(smplA0) ) / ( ( sd(smplA0) + sd(smplB0 ) )/2 )
} )
mean( abs(t0) > abs(t) )