Differential Expression Analysis

In the previous lesson, we have preprocessed the RNA-Seq data from the “airway” study (Himes et al.: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells. PLoS One, 2014, 9: using the Subread aligner. We have used the featureCount tool from the Subread package to count how many reads map to each of genes in each sample. We have saved the result in the file featureCount.out.

Load count matrix and run table

Let’s first load this file. It is a data table with tab-separated values (TSV), i.e., TAB characters are used to mark the column boundaries. There are some extra lines at the top, marked with the hash sign (#), that are not part of the table. We tell the tidyverse read function to ignore them:

library( tidyverse )

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

read_tsv( "RNASeq/airway/featureCounts.out", comment = "#" )

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Geneid = col_character(),
##   Chr = col_character(),
##   Start = col_character(),
##   End = col_character(),
##   Strand = col_character()
## )

## See spec(...) for full column specifications.

We only want the actual data columns, which all contain the string bam in their columns name, we want to use the gene IDs as row names, and this should be stored not as a data frame or tibble, but as a matrix:

read_tsv( "RNASeq/airway/featureCounts.out", comment = "#" ) %>%
  column_to_rownames( "Geneid" ) %>%
  select( starts_with("alignment") ) %>%
  as.matrix ->
     countsMatrix

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Geneid = col_character(),
##   Chr = col_character(),
##   Start = col_character(),
##   End = col_character(),
##   Strand = col_character()
## )

## See spec(...) for full column specifications.

countsMatrix[1:5,1:5]

##                 alignments/SRR1039508.bam alignments/SRR1039509.bam
## ENSG00000223972                         2                         3
## ENSG00000227232                        29                        27
## ENSG00000278267                         1                         2
## ENSG00000243485                         1                         1
## ENSG00000284332                         0                         0
##                 alignments/SRR1039510.bam alignments/SRR1039511.bam
## ENSG00000223972                         0                         0
## ENSG00000227232                        38                        21
## ENSG00000278267                         0                         3
## ENSG00000243485                         0                         1
## ENSG00000284332                         0                         0
##                 alignments/SRR1039512.bam
## ENSG00000223972                         1
## ENSG00000227232                        34
## ENSG00000278267                         1
## ENSG00000243485                         0
## ENSG00000284332                         0

Next, we simplify the column names:

colnames(countsMatrix) %>% 
   str_remove( "alignments/" ) %>% 
   str_remove( ".bam" ) -> 
      colnames(countsMatrix)

countsMatrix[1:5,1:5]

##                 SRR1039508 SRR1039509 SRR1039510 SRR1039511 SRR1039512
## ENSG00000223972          2          3          0          0          1
## ENSG00000227232         29         27         38         21         34
## ENSG00000278267          1          2          0          3          1
## ENSG00000243485          1          1          0          1          0
## ENSG00000284332          0          0          0          0          0

What do the sample IDs mean? To see, we load the run table that we have downloaded from ENA last time:

read_tsv( "RNASeq/airway/run_table.txt" )

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   AvgSpotLen = col_double(),
##   MBases = col_double(),
##   MBytes = col_double(),
##   InsertSize = col_double(),
##   LoadDate = col_date(format = ""),
##   ReleaseDate = col_date(format = "")
## )

## See spec(...) for full column specifications.

We select only the columns we need, and parse the treatment column

read_tsv( "RNASeq/airway/run_table.txt") %>%
  select( run = Run, cell_line, treatment ) %>%
  mutate( albuterol = str_detect( treatment, "Albuterol" ) ) %>%
  mutate( dexamethasone = str_detect( treatment, "Dexamethasone" ) ) ->
     run_table

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   AvgSpotLen = col_double(),
##   MBases = col_double(),
##   MBytes = col_double(),
##   InsertSize = col_double(),
##   LoadDate = col_date(format = ""),
##   ReleaseDate = col_date(format = "")
## )

## See spec(...) for full column specifications.

run_table

It’s important for what follows that the columns in countsMatrix and the run IDs in the run table agree and are in the same order. Let’s check that

stopifnot( all( run_table$run == colnames(countsMatrix) ) )

Differential expression analysis

Now, we can perform a standard differential expression analysis, e.g., using DESeq (Love et al, Genome Biology 15:550, 2014).

library( DESeq2 )

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:tidyr':
## 
##     expand

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice

## The following object is masked from 'package:purrr':
## 
##     reduce

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## The following object is masked from 'package:dplyr':
## 
##     count

## Loading required package: BiocParallel

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following object is masked from 'package:purrr':
## 
##     simplify

## The following objects are masked from 'package:base':
## 
##     aperm, apply

In DESeq2, all processing happens within a DESeqDataSet object, which is derived from Bioconductor’s SummerizedExperiment class. To make such an object, bind together a matrix of read counts with a data frame that describes the columns of the count matrix. The third element is the design formula, which is put together using the column names from the data frame. We will discuss it later.

dds0 <- DESeqDataSetFromMatrix( countsMatrix, run_table, ~ cell_line + dexamethasone )

## converting counts to integer mode

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

dds0

## class: DESeqDataSet 
## dim: 58735 16 
## metadata(1): version
## assays(1): counts
## rownames(58735): ENSG00000223972 ENSG00000227232 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(0):
## colnames(16): SRR1039508 SRR1039509 ... SRR1039522 SRR1039523
## colData names(5): run cell_line treatment albuterol dexamethasone

We first analyse only the dexamethasone and hence remove all samples treated with abluterol.

dds <- dds0[ , !dds0$albuterol ]
dds

## class: DESeqDataSet 
## dim: 58735 8 
## metadata(1): version
## assays(1): counts
## rownames(58735): ENSG00000223972 ENSG00000227232 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(5): run cell_line treatment albuterol dexamethasone

The actual analysis is done with a single command:

dds <- DESeq( dds )

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

Let’s check the results:

res <- results( dds )

res

## log2 fold change (MLE): dexamethasoneTRUE 
## Wald test p-value: dexamethasoneTRUE 
## DataFrame with 58735 rows and 6 columns
##                           baseMean     log2FoldChange             lfcSE
##                          <numeric>          <numeric>         <numeric>
## ENSG00000223972  0.899451678871487 -0.584259083975908  1.87432495061848
## ENSG00000227232   18.9656724190172 0.0612197275596309 0.412963486933081
## ENSG00000278267  0.774339343863466  0.219327119196226   1.9893077976767
## ENSG00000243485   0.16459638610649 0.0950010975060304  3.49432889369449
## ENSG00000284332                  0                 NA                NA
## ...                            ...                ...               ...
## ENSG00000271254   63.5947117488586 -0.879673726274981 0.263646889149669
## ENSG00000275405 0.0830439460944298 -0.527969751072956  3.51908100539528
## ENSG00000275987                  0                 NA                NA
## ENSG00000277475                  0                 NA                NA
## ENSG00000268674                  0                 NA                NA
##                               stat               pvalue               padj
##                          <numeric>            <numeric>          <numeric>
## ENSG00000223972 -0.311717071142396    0.755255553053589                 NA
## ENSG00000227232  0.148244892095148    0.882149504421758  0.984774363909095
## ENSG00000278267  0.110252983199672    0.912208743834165                 NA
## ENSG00000243485 0.0271872226101727    0.978310406820466                 NA
## ENSG00000284332                 NA                   NA                 NA
## ...                            ...                  ...                ...
## ENSG00000271254  -3.33656023445663 0.000848220460400001 0.0126618400005116
## ENSG00000275405 -0.150030576239507    0.880740491925834                 NA
## ENSG00000275987                 NA                   NA                 NA
## ENSG00000277475                 NA                   NA                 NA
## ENSG00000268674                 NA                   NA                 NA

Note the first line. These are the results for comparing the cases with dexamethasone versus the controls. If this is not what you want, you have to specify the correct contrast when calling results.

How many genes are significant at 10% false discovery rate?

sum( res$padj < .1, na.rm=TRUE )

## [1] 1375

To get a first overview over the results table, we can ask for an MA plot

plotMA( dds, ylim = c(-4,4) )

We can also do the MA plot manually: this will look more familiar to you, as you have learned how to use ggplot in the previous lessons

as_tibble(res) %>%
   mutate( signif = !is.na(padj) & padj < .1 ) %>%
ggplot( ) +
   geom_point(
      aes(
         x = baseMean,
         y = log2FoldChange,
         col = signif
      )
   ) +
   scale_x_log10() +
   scale_y_continuous( oob = scales::squish )

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 32275 rows containing missing values (geom_point).

Another useful plot is the vulcano plot

as_tibble(res) %>%
   mutate( signif = !is.na(padj) & padj < .1 ) %>%
ggplot( ) +
   geom_point(
      aes(
         x = log2FoldChange,
         y = -log10( pvalue),
         col = signif
      )
   ) +
   scale_x_continuous( oob = scales::squish ) +
   scale_y_continuous( oob = scales::squish, limits = c( 0, 100 ) ) +
   scale_color_manual( values = c( "black", "red" ) )

## Warning: Removed 32275 rows containing missing values (geom_point).

Getting gene names

So far, the rows in our result list are labelled only with Ensembl gene IDs (ENSG0000…), which are not very helpful. We need a mapping to gene names. We can download a table with such a mapping from Ensembl Biomart. There, choose, “Ensebl Genes”, “Human Genes”, then click on “Attributes” to select the table columns we want to get: Ensembl gene ID (“gene stable ID”) and “Gene name”. In case we need it, let’s also add “Chromosome name”, “Gene start”, “Gene end”, “Strand” and “Gene type”. Deselect “Transcript stable ID”: we want one row per gene, not per transcript. No “Filters”, either: we want all genes. Click on “Results” to see the table and export it to a file in TSV format.

Load that file in R

read_tsv("RNASeq/airway/mart_export.txt")

## Parsed with column specification:
## cols(
##   `Gene stable ID` = col_character(),
##   `Gene name` = col_character(),
##   `Chromosome/scaffold name` = col_character(),
##   `Gene description` = col_character(),
##   `Gene start (bp)` = col_double(),
##   `Gene end (bp)` = col_double()
## )

Using tidyverse’s left_join function, we can merge this table into our results table

res %>%
   as_tibble( rownames = "EnsemblID" ) %>%
   left_join( 
      read_tsv( "RNASeq/airway/mart_export.txt" ),
      by = c( "EnsemblID" = "Gene stable ID" ) ) %>%
   mutate( signif = !is.na(padj) & padj < .1 ) ->
     res

## Parsed with column specification:
## cols(
##   `Gene stable ID` = col_character(),
##   `Gene name` = col_character(),
##   `Chromosome/scaffold name` = col_character(),
##   `Gene description` = col_character(),
##   `Gene start (bp)` = col_double(),
##   `Gene end (bp)` = col_double()
## )

res

Now, we can have a look at the most significantly upregulated genes:

res %>%
   arrange( -sign( log2FoldChange ), padj )

Plotting some genes

Let’s have a look at one of the these genes. We simply get a row from the normalized count matrix, using DESeq2’s counts function, turn this row to a column and add it to the run table (which we can get back from dds with colData). Then we have all we need to use ggplot.

gene <- "ENSG00000189221"

colData(dds) %>%
   as_tibble() %>%
   mutate( ncount = counts( dds, normalized=TRUE )[ gene, ] ) %>%
ggplot +
   geom_point(
      aes(
         x = cell_line,
         y = ncount,
         col = treatment
      )
   ) +
   scale_y_continuous( trans = "log1p" )

SFB 1036 Course on Data Analysis: Lesson 7-8

Simon Anders

Differential Expression Analysis

Load count matrix and run table

Differential expression analysis

Getting gene names

Plotting some genes

Doing this interactively