Example data

We use data from this study:

Himes et al.: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells. PLoS One, 2014, 9:.

They have derived cell lines from airway smooth muscle cells obtained from the airways of four human donors, and tested the effect of two drugs, dexamathasone and albuterol on the trabnscriptome. For each cell line, they have four RNA-Seq samples, one for each of the four experimental conditons of treatment with dexamethasone, with albnuterol, with both and with neither drug.

Demamethasone is a synthetic glucocorticoid, and albuterol (a.k.a. salbutamol) is a bronchodilator, which is used e.g., in astma inhalers. It is an agonist of the β2 adrenoreceptor, which mediates relaxation of smooth muscle cells.

The sequencing reads from the 16 sequencing libraries have been deposited by the authors at the Gene Expression Omnibus under accession number GSE52778. Finding the actual reads requries a bit of clickling around in the maze that is GEO and SRA, and leads us eventually to Short read Archive (SRA) accession SRP033351. Clicking there on the link “Send to Run Selector”, we can download the “Runinfo Table”

We load this table in R

library( tidyverse )

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

read_tsv( "run_table.txt" )

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   AvgSpotLen = col_double(),
##   MBases = col_double(),
##   MBytes = col_double(),
##   InsertSize = col_double(),
##   LoadDate = col_date(format = ""),
##   ReleaseDate = col_date(format = "")
## )

## See spec(...) for full column specifications.

## # A tibble: 16 x 30
##    AvgSpotLen BioSample Experiment MBases MBytes Run   SRA_Sample
##         <dbl> <chr>     <chr>       <dbl>  <dbl> <chr> <chr>     
##  1        126 SAMN0242… SRX384345    2756   1588 SRR1… SRS508568 
##  2        126 SAMN0242… SRX384346    2542   1480 SRR1… SRS508567 
##  3        126 SAMN0242… SRX384347    2746   1593 SRR1… SRS508570 
##  4        126 SAMN0242… SRX384348    2636   1519 SRR1… SRS508569 
##  5        126 SAMN0242… SRX384349    3380   2055 SRR1… SRS508571 
##  6         87 SAMN0242… SRX384350    3615   2271 SRR1… SRS508572 
##  7        126 SAMN0242… SRX384351    4809   3101 SRR1… SRS508574 
##  8        114 SAMN0242… SRX384352    3219   1900 SRR1… SRS508573 
##  9        120 SAMN0242… SRX384353    3445   2147 SRR1… SRS508575 
## 10        126 SAMN0242… SRX384354    4121   2654 SRR1… SRS508576 
## 11        126 SAMN0242… SRX384355    3657   2354 SRR1… SRS508578 
## 12        107 SAMN0242… SRX384356    3223   2006 SRR1… SRS508577 
## 13        101 SAMN0242… SRX384357    3355   2109 SRR1… SRS508579 
## 14         98 SAMN0242… SRX384358    3883   2449 SRR1… SRS508580 
## 15        125 SAMN0242… SRX384359    3397   2006 SRR1… SRS508582 
## 16        126 SAMN0242… SRX384360    3737   2269 SRR1… SRS508581 
## # … with 23 more variables: Sample_Name <chr>, cell_line <chr>,
## #   ercc_mix <chr>, treatment <chr>, Assay_Type <chr>, BioProject <chr>,
## #   Center_Name <chr>, Consent <chr>, DATASTORE_filetype <chr>,
## #   DATASTORE_provider <chr>, InsertSize <dbl>, Instrument <chr>,
## #   LibraryLayout <chr>, LibrarySelection <chr>, LibrarySource <chr>,
## #   LoadDate <date>, Organism <chr>, Platform <chr>, ReleaseDate <date>,
## #   SRA_Study <chr>, cell_type <chr>, source_name <chr>, tissue <chr>

This is just a table with one row per library and many columns with various relevant and less relevant information on the libraries. Most important for us are the columns cell_line and treatment, as they contain the experimental conditions, as well as the Run accession number, which is the ID under which we can download the data from the Short Read Archive (SRA).

read_tsv( "run_table.txt" ) %>%
  select( Run, cell_line, treatment ) -> run_table

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   AvgSpotLen = col_double(),
##   MBases = col_double(),
##   MBytes = col_double(),
##   InsertSize = col_double(),
##   LoadDate = col_date(format = ""),
##   ReleaseDate = col_date(format = "")
## )

## See spec(...) for full column specifications.

run_table

## # A tibble: 16 x 3
##    Run        cell_line treatment              
##    <chr>      <chr>     <chr>                  
##  1 SRR1039508 N61311    Untreated              
##  2 SRR1039509 N61311    Dexamethasone          
##  3 SRR1039510 N61311    Albuterol              
##  4 SRR1039511 N61311    Albuterol_Dexamethasone
##  5 SRR1039512 N052611   Untreated              
##  6 SRR1039513 N052611   Dexamethasone          
##  7 SRR1039514 N052611   Albuterol              
##  8 SRR1039515 N052611   Albuterol_Dexamethasone
##  9 SRR1039516 N080611   Untreated              
## 10 SRR1039517 N080611   Dexamethasone          
## 11 SRR1039518 N080611   Albuterol              
## 12 SRR1039519 N080611   Albuterol_Dexamethasone
## 13 SRR1039520 N061011   Untreated              
## 14 SRR1039521 N061011   Dexamethasone          
## 15 SRR1039522 N061011   Albuterol              
## 16 SRR1039523 N061011   Albuterol_Dexamethasone

Downloading the reads

To download the read, we need the utility “fastq-dump” from the SRA Toolkit. So, we first need to download the SRA Toolkit from the NCBI web site, and install it. Then, we can write in the command line a command like

fastq-dump --split-files SRR1039508

The option --split-files is required because these are paired-end libraries, and we need to download two files for each run. The above command will produce two FASTQ files, named SRR1039508_1.fastq and SRR1039508_2.fastq, with the reads of the 5’ and the 3’ ends of each cDNA fragment.

We can write a simple R command to produce fastq-dump commands for all 16 files.

str_c( "fastq-dump --split-files ", run_table$Run, " &\n" ) %>% cat

## fastq-dump --split-files SRR1039508 &
##  fastq-dump --split-files SRR1039509 &
##  fastq-dump --split-files SRR1039510 &
##  fastq-dump --split-files SRR1039511 &
##  fastq-dump --split-files SRR1039512 &
##  fastq-dump --split-files SRR1039513 &
##  fastq-dump --split-files SRR1039514 &
##  fastq-dump --split-files SRR1039515 &
##  fastq-dump --split-files SRR1039516 &
##  fastq-dump --split-files SRR1039517 &
##  fastq-dump --split-files SRR1039518 &
##  fastq-dump --split-files SRR1039519 &
##  fastq-dump --split-files SRR1039520 &
##  fastq-dump --split-files SRR1039521 &
##  fastq-dump --split-files SRR1039522 &
##  fastq-dump --split-files SRR1039523 &

You can copy and paste these commands into your shell. The ampersand (&) will cause all commands to start at the same time. After several hours, you will have all the files.

Here is how the first 12 lines of one such file loook like

$ head -12  SRR1039512_1.fastq
@SRR1039512.1 HWI-ST177:314:C12K6ACXX:1:1101:1207:2192 length=63
ACGACAGGTCATCGTTCAAGCAGAATGCAGACAGGCCATTCACGAGCCCAAGTTGAAGAGAAG
+SRR1039512.1 HWI-ST177:314:C12K6ACXX:1:1101:1207:2192 length=63
FBDEFIDH2AGHIICDHGBDDFGGGDH<FFEA;FH=;BGGA@DFGB<EEE=??CCC>C>;=A?
@SRR1039512.2 HWI-ST177:314:C12K6ACXX:1:1101:1206:2234 length=63
GGTGAGGGACTGAGGCCCCTAAATTTTGGTCCCAGGGGAAAGGAAGAGGCCAGTTGGTCCAGT
+SRR1039512.2 HWI-ST177:314:C12K6ACXX:1:1101:1206:2234 length=63
HI@EBFGHCGGBGGI@GHGHHGHGIIIIIIGIIIIAHH>BFEEAEDBD?=?AB::AC85@3>A
@SRR1039512.3 HWI-ST177:314:C12K6ACXX:1:1101:1377:2202 length=63
ATATTGGCTTTATCTGTACAGGTTCCTTCATCACAAAACCAGTAACTTCCAGTGGATGAAGGC
+SRR1039512.3 HWI-ST177:314:C12K6ACXX:1:1101:1377:2202 length=63
HJFJJJIIJJJJJJJIIJIIIJIGHIJJIJIJJJIDHIJJJJHIIIJJIJJJHIJJIIIHIH?

These are the 5’ ends of the first three fragments. For each fragment, we have four lines: Read ID, sequence, “+” sign, quality string. For details, see the Wikipedia page on the FASTQ format.

Building an index

Next, we need to align the reads to the genome, i.e., to find for each read where in the genome it maps. For this, we need a tool called an aligner. We use the subread aligner by Liao et al. (Nucleic Acids Res., 2013, 41:e108), which you first need to install.

Then, as a preparation for installing, we need to download the full DNA sequence of the human genome and ask subread to transform it into an index, which it will then use as reference to quickly look up sequences when searching for alignments.

We download the genome from the Ensembl FTP site. Go to “DNA (FASTA)” for Homo sapiens; there, download the “DNA primary assembly” FASTA file, then unpack it

wget ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

As the file name tells you, this is genome “GRCh38”, i.e., version 38 of the human genome assembly as provided by the Genome Reference Consortium (GRC). It is important to keep track of the assembly version, or you will mix up gene coordinates.

We let subread build the index

subread-buildindex -o GRCh38 Homo_sapiens.GRCh38.dna.primary_assembly.fa

It produces five files, some of them very large, whose names start with GRCh38, as specified with -o. (This stands for Genome Reference Consortium, human genome, reference assembly version 38 – which is what we have downloaded.)

Aligning the reads

To align the reads from a specific run, say, SRR1039508, we use a command like

subread-align -i GRCh38 -r SRR1039508_1.fastq -R SRR1039508_2.fastq -o SRR1039508.bam -t 0 -T 2

The options used are as follows: -i tells the aligner which index to use. We indicate the files we have just produced with subread-buildindex by specifying GRCh38. Next, we give the pair of FASTQ files, first (after -r) the first-pass reads, then (after -R) the second-pass reads. Then we specify the output file after -o, using the extension bam. Last, we clarify that these are RNA and not DNA reads (-t 0) and that two CPUs should be used in parallel (-T 2, use more if you have them).

Again, we can use R produce all 16 commands for us. This ensure that we do not make a mistake and mix up run IDs.

str_c( "subread_align -i GRCh38 -r ", run_table$Run, "_1.fastq -R ", run_table$Run, "_2.fastq -o ", run_table$Run, ".bam -t 0 -T 2\n" ) %>% cat

## subread_align -i GRCh38 -r SRR1039508_1.fastq -R SRR1039508_2.fastq -o SRR1039508.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039509_1.fastq -R SRR1039509_2.fastq -o SRR1039509.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039510_1.fastq -R SRR1039510_2.fastq -o SRR1039510.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039511_1.fastq -R SRR1039511_2.fastq -o SRR1039511.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039512_1.fastq -R SRR1039512_2.fastq -o SRR1039512.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039513_1.fastq -R SRR1039513_2.fastq -o SRR1039513.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039514_1.fastq -R SRR1039514_2.fastq -o SRR1039514.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039515_1.fastq -R SRR1039515_2.fastq -o SRR1039515.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039516_1.fastq -R SRR1039516_2.fastq -o SRR1039516.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039517_1.fastq -R SRR1039517_2.fastq -o SRR1039517.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039518_1.fastq -R SRR1039518_2.fastq -o SRR1039518.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039519_1.fastq -R SRR1039519_2.fastq -o SRR1039519.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039520_1.fastq -R SRR1039520_2.fastq -o SRR1039520.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039521_1.fastq -R SRR1039521_2.fastq -o SRR1039521.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039522_1.fastq -R SRR1039522_2.fastq -o SRR1039522.bam -t 0 -T 2
##  subread_align -i GRCh38 -r SRR1039523_1.fastq -R SRR1039523_2.fastq -o SRR1039523.bam -t 0 -T 2

If you paste them all into your command line, they will be executed one after the other. After a few hours, everything will be aligned.

We can inspect an alignment file with

samtools view SRR1039508.bam | less -S

using the samtools suite. Do so and try to understand some of the columns, following the description of the SAM format on Wikipedia.

You can the Integrated Genome Viewer (IGV) to check out the alignments. Do so and observe how the read pairs map to the exons. When using IGV, make sure you choose the right genome assembly.

Counting

Next, we want to know, for each BAM file, how many of the reads align to each of the genes. This can be done with featureCount by Liao et al. (Bioinformatics (2014) 30:923). This tool is a refinement and a faster replacement for htseq-count (Anders et al., Bioinformatics (2014) 31:166), which is used in many workflows.

For this, we need a file with gene annotation, i.e., informations about the coordinates where genes, transcripts and exons start and end. These can be found for many species in the “Gene sets” column of the Ensembl download page. We use the file Homo_sapiens.GRCh38.95.gtf, which gives the gene annotation from Release 95 of Ensembl, using the coordinates of the GRCh38 genome assembly.

We call featureCounts with this command

featureCounts -p -T 8 -a Homo_sapiens.GRCh38.94.gtf -o featureCounts.out \
   SRR1039508.bam SRR1039509.bam SRR1039510.bam SRR1039511.bam SRR1039512.bam \
   SRR1039513.bam SRR1039514.bam SRR1039515.bam SRR1039516.bam SRR1039517.bam \
   SRR1039518.bam SRR1039519.bam SRR1039520.bam SRR1039521.bam SRR1039522.bam \
   SRR1039523.bam

First, the backslashes merely indicate that the command is broken in several lines. We use the following options: -a Homo_sapiens.GRCh38.94.gtf tells featureCounts to use the reference annotation that we have just downloaded. -o featureCount.out specifies the file name to save the resulting count table. -p directs featureCount to count paired reads as one cDNA fragment, not as two reads. -T 8 means to use 8 CPU cores in parallel. (This makes sense only if you have as many cores in your computer.) Then, we list all the BAM files for which we want counts. A full description can be found in the Subread package user guide.

After about half an hour, featureCounts finished, and writes out a file with a table of counts. We will examine this table in the next lesson.