Reading in the plate files.

Before continuing, open one of the CSV files in an arbitrary text editor and study their content. Note that the payload data starts with the 6th line and takes 15 lines.

If we only want to read in this block of 15 lines, we can use

library( tidyverse )

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

read_csv( "Patient_1.csv", skip=5, n_max=16 )

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )

## See spec(...) for full column specifications.

To make it easier to work with this data, we transform it from a wide table to a long table. In a “long table” (or, “tidy table” in the tidyverse nomenclature), every table row contains only data on one “object”. (Here, each well is an object.)

read_csv( "Patient_1.csv", skip=5, n_max=16 ) %>%
   rename( "plate_row" = "X1") %>%
   gather( "plate_col", "intensity", -plate_row )

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )

## See spec(...) for full column specifications.

Such a table is useful, for example, for plotting with ggplot:

read_csv( "Patient_1.csv", skip=5, n_max=16 ) %>%
   rename( "plate_row" = "X1") %>%
   gather( "plate_col", "intensity", -plate_row ) %>%
ggplot +
   geom_tile( aes( x = plate_col, y = plate_row, fill = intensity ) )

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )

## See spec(...) for full column specifications.

The plate is not rendered in the right order. Let’s fix this by reversing the rows and making clear to R that the columns should be sorted as numbers, not by alphabet. Also, let’s use log2 values for intensity and add white borders to the tiles. The coord_fixed ensures that x and y axis are scaled the same and hence the wells appear as squares.

read_csv( "Patient_1.csv", skip=5, n_max=16 ) %>%
   rename( "plate_row" = "X1") %>%
   gather( "plate_col", "intensity", -plate_row ) %>%
   mutate( plate_row = fct_rev( plate_row) ) %>%
   mutate( plate_col = as.integer(plate_col) ) %>%
ggplot +
   geom_tile( aes( x = plate_col, y = plate_row, fill = log2(intensity) ), color="white" ) +
   coord_fixed()

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )

## See spec(...) for full column specifications.

We want to work with all 5 plates, to let’s write a function which takes a patient number, read the file and produces a long table

read_plate <- function( patient_number ) {
   filename <- str_c( "Patient_", patient_number, ".csv" )
   read_csv( filename, skip=5, n_max=16 ) %>%
   rename( "plate_row" = "X1") %>%
   gather( "plate_col", "intensity", -plate_row ) %>%
   mutate( plate_col = as.integer(plate_col) ) %>%
   add_column( plate_number = patient_number, .before="plate_row" )
}

We can call this function with a number

read_plate( 2 )

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )

## See spec(...) for full column specifications.

We want to call this function for all five patients, i.e., for each of these values:

1:5

## [1] 1 2 3 4 5

(1:5 just returns the numbers from 1 to 5.)

The map_dfr function calls a given function for each element, expects this function to return a data frame (i.e., a table or tibble) and then combined all the returned tables row-wise (i.e., one below the other). There is also map_dfc, which combnines them column-wise (i.e., side-by-side). We use it to call our read_plate function for all five plates.

tbl <- map_dfr( 1:5, read_plate )

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )

## See spec(...) for full column specifications.

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )
## See spec(...) for full column specifications.

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )
## See spec(...) for full column specifications.

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )
## See spec(...) for full column specifications.

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )
## See spec(...) for full column specifications.

tbl

Now, we have all data in hand and can plot one big overview plot

tbl %>%
   mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
   geom_tile( aes( x = plate_col, y = plate_row, fill = log2(intensity) ), color="white" ) +
   facet_wrap( ~ plate_number ) + coord_fixed()

Plate annotation

Next, let’s read in the Excel file with the well annotation

readxl::read_excel( "PlateAnnotation.xlsx" )

## New names:
## * `` -> `..1`

As this is data from an unpublished experiment, we did not want to make the content of the drug master plate public. Therefore, I have replaced all the drug names in the “Drug” column with names of fruits from this list, only leaving the value “DMSO” as it was.

Apart from now having strange drug names, the table is a bit redundant, an has lengthy column names. Let’s clean this up a bit:

readxl::read_excel( "PlateAnnotation.xlsx" ) %>%
   select( 
      plate_row = Row,
      plate_col = Column,
      drug = Drug,
      conc_level = Drug_Concentration,
      conc_uMol = Conc_mikroMolar ) %>%
   mutate( conc_uMol = as.numeric( conc_uMol ) ) %>%
   mutate( plate_col = as.integer(plate_col) ) ->
      layout

## New names:
## * `` -> `..1`

layout

Checking the control wells

Where are the control wells?

layout %>%
mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
   geom_tile( aes( x=plate_col, y=plate_row, fill=(drug=="DMSO") ) ) + 
   coord_fixed()

As you can see, care was taken to spread the control wells over the whole plate. This helps to check for spatial artifacts, as we will see in a minute.

How equal are the control wells? Let’s plot their intensity.

tbl %>%
   left_join( layout ) %>%
   filter( drug == "DMSO" ) %>%
   mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
   geom_tile( aes( x = plate_col, y = plate_row, fill = log2(intensity) ), color="white" ) +
   facet_wrap( ~ plate_number ) + coord_fixed()

## Joining, by = c("plate_row", "plate_col")

We mainly see differences between plates, but we want to see heterogeneity within the plates. To make this more visible, let’s divide the DMSO control wells of each plate by their median DMSO intensity, to get relative intensity wells (relative to the plate control median), so we call this new columns rel_int.

We also use the scale_fill_gradient2 function to get a diverging color scale, centered at zero with white and showing gradients towards red for negative and towards blue for positive values. Such a diverging colour scale should always be used when visualizing data spreading around zero. As it would be nice to have the colour scale labelled on a natural scale, we replace fill=log2(rel_int) with simply fill=rel_int in the aes phrase, and instead specify the transformation in the colour scale specification: trans="log2".

tbl %>%
   left_join( layout ) %>%
   filter( drug == "DMSO" ) %>%
   group_by( plate_number ) %>%
   mutate( rel_int = intensity / median( intensity ) ) %>%
   mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
   geom_tile( aes( x = plate_col, y = plate_row, fill = rel_int ), color="white" ) +
   facet_wrap( ~ plate_number ) + coord_fixed() +
   scale_fill_gradient2( trans="log2")

## Joining, by = c("plate_row", "plate_col")

The one outlier on plate 4 has caused ggplot to choose very wide limits for the colour scale, making the other points very pale. Hence, we specify the limits manually (limits=c(.7,1.3)), and instruct ggplot to “squish” values outside these limits, i.e., to give them the colour of the limit. Unfortunately, ggplot’s automatic tic marking system fails here (producing ony a single tic mark for “1”), so we also have to specify the tics manually (breaks).

tbl %>%
   left_join( layout ) %>%
   filter( drug == "DMSO" ) %>%
   group_by( plate_number ) %>%
   mutate( rel_int = intensity / median( intensity ) ) %>%
   mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
   geom_tile( aes( x = plate_col, y = plate_row, fill = rel_int ), color="white" ) +
   facet_wrap( ~ plate_number ) +
   scale_fill_gradient2( trans="log2", limits=c(.7,1.3), oob=scales::squish, 
      breaks=c( .7, .8, .9, 1, 1.1, 1.2, 1.3 ) ) + 
   coord_fixed() +
   theme( panel.background = element_rect("darkgray"), panel.grid = element_blank() )

## Joining, by = c("plate_row", "plate_col")

Is this spread small enough to ignore the spatial artifacts? This may be easier to judge using a beeswarm plot.

tbl %>%
   left_join( layout ) %>%
   filter( drug == "DMSO" ) %>%
   group_by( plate_number ) %>%
   mutate( rel_int = intensity / median( intensity ) ) %>%
ggplot +
   ggbeeswarm::geom_beeswarm( aes( x = plate_number, y = rel_int ) ) +
   scale_y_continuous( trans="log2", breaks = seq( 0, 2, by=.1 ) )

## Joining, by = c("plate_row", "plate_col")

Drug response curves

Enough looking at control wells

Let’s just use the median of each plate’s control wells as normalization reference

tbl %>%
   left_join( layout ) %>%
   filter( drug=="DMSO" ) %>%
   group_by( plate_number ) %>%
   summarise( median_dmso_intensity = median( intensity ) ) ->
      ctrl_int_tbl

## Joining, by = c("plate_row", "plate_col")

ctrl_int_tbl

Now, use this to normalize our data

tbl %>% 
   left_join( layout ) %>%
   left_join( ctrl_int_tbl ) %>%
   mutate( rel_int = intensity / median_dmso_intensity ) ->
      tbl2

## Joining, by = c("plate_row", "plate_col")

## Joining, by = "plate_number"

tbl2

Let’s plot one drug, say “Passionfruit”

tbl2 %>%
   filter( drug == "Passionfruit") %>%
   mutate( plate_number = str_c( "Patient_", plate_number ) ) %>%
ggplot +
   geom_line( aes( x = conc_uMol, y = rel_int, col = plate_number ) ) +
   scale_x_log10()

By putting a for loop around these commands, we can make a PDF file with one such plot per page, one for each drugs.

We first get the names of all drugs

layout %>%
   filter( drug != "DMSO" ) %>%
   pull( "drug" ) %>%
   unique ->
      drugs
drugs

##  [1] "Passionfruit"      "Plum"              "Apricot"          
##  [4] "Dragonfruit"       "Soursop"           "Star fruit"       
##  [7] "Olive"             "Fig"               "Apple"            
## [10] "Purple mangosteen" "Plantain"          "Mulberry"         
## [13] "Currant"           "Orange"            "Mangosteen"       
## [16] "Blood orange"      "Bilberry"          "Raspberry"        
## [19] "Nance"             "Blackcurrant"      "Honeyberry"       
## [22] "Watermelon"        "Cucumber"          "Yuzu"             
## [25] "Jabuticaba"        "White sapote"      "Buddha's hand"    
## [28] "Salmonberry"       "Black sapote"      "Mandarine"        
## [31] "White currant"     "Papaya"            "Raisin"           
## [34] "Marionberry"       "Cranberry"         "Date palm|Date"   
## [37] "Plumcot"           "Mango"             "Redcurrant"       
## [40] "Pear"              "Juniper berry"     "Jackfruit"        
## [43] "Cantaloupe"        "Jostaberry"        "Guava"            
## [46] "Surinam cherry"    "Persimmon"         "Damson"           
## [49] "Goji berry"        "Clementine"        "Coconut"          
## [52] "Chico fruit"       "Grape"             "Cherry"           
## [55] "Feijoa"            "Salal"             "Kiwano"           
## [58] "Peach"             "Tamarind"          "Blackberry"       
## [61] "Blueberry"         "Star apple"        "Lychee"           
## [64] "Strawberry"

Now, we open a PDF file for writing (pdf), run the for loop, and close the PDF file (dev.off). The pdf command means that all subsequent plotting commands should result in the produced plot not being displayed on the screen as usual but rather added as a new page to the PDF file until this rerouting is switched off again with dev.off.

The for loop runs its body (the part in curly braces) once each of the drugs in the vector drugs, by putting the value of each drug into the variable dr (for ( dr in drugs )) and then runs the commands. These produce one plot. Whenever you use ggplot within a function or a block, you have to surround the whole ggplot command with parenthesis and hand it to print (for reasons we dobn’t want to go into now), as otherwise nothing gets plotted.

# start a PDF file
pdf( "dose_resp_curves.pdf" )

# Go through the list of drugs
for( dr in drugs ) {
   (tbl2 %>%
      filter( drug == dr ) %>%
      mutate( plate_number = str_c( "Patient_", plate_number ) ) %>%
   ggplot +
      geom_line( aes( x = conc_uMol, y = rel_int, col = plate_number ) ) +
      scale_x_log10() +
      ggtitle( dr ) ) %>% print
}

# close the PDF
dev.off()

## png 
##   2

You can look at the PDF thus produced here: dose_resp_curves.pdf

Getting AUC values

For each drug and each patient, we have five response values in the rel_int column, namely the responses to the five concentrations in which the drug is present on the master plate. Let us simply average over these to get one single value. In a way, this is roughly the same as calculating the area under the dose response curve.

tbl2 %>%
   filter( drug != "DMSO" ) %>%
   group_by( plate_number, drug ) %>%
   summarise( mean_viab = mean( rel_int ) )

We can use the spread fucntion to turn this long table into a wide table. This is the opposite of gather, which we have used further up to turn a wide table into a long one. Hence we add one line to the previous code chunk.

tbl2 %>%
   filter( drug != "DMSO" ) %>%
   group_by( plate_number, drug ) %>%
   summarise( mean_viab = mean( rel_int ) ) %>%
   spread( plate_number, mean_viab )

To plot this as a heatmap, we have to turn it from a tidyverse tibble into a matrix, because the heatmap plotting functions of R don’t work well with tidyverse objects. The as.matrix function does this. Before, we turn the column with the drug names into row names for the matrix, so that the matrix itself contains only numbers.

tbl2 %>%
   filter( drug != "DMSO" ) %>%
   group_by( plate_number, drug ) %>%
   summarise( mean_viab = mean( rel_int ) ) %>%
   spread( plate_number, mean_viab ) %>%
   column_to_rownames( "drug" ) %>%
   as.matrix ->
      responseMatrix

responseMatrix[ 1:10, ]

##                       1         2         3         4         5
## Apple         0.5620512 0.5024576 0.7448491 0.5766846 0.4617407
## Apricot       1.0422428 0.9323655 1.0457924 1.0102802 1.1376190
## Bilberry      1.0906224 1.1949816 0.8827831 1.0272608 0.9978405
## Blackberry    0.9031180 0.9153274 0.9585928 0.9555345 0.9849007
## Blackcurrant  0.9118808 1.0666942 0.9661327 1.0413223 0.8890081
## Black sapote  1.0048951 1.0733791 0.9823822 1.0030341 1.0314541
## Blood orange  1.1242877 1.1626094 0.9071811 1.1148978 1.0449368
## Blueberry     0.6031344 0.8027615 0.9612890 0.6184696 0.6523558
## Buddha's hand 0.7791521 0.8957545 0.9180749 0.6496263 0.8157282
## Cantaloupe    1.1569436 1.0069223 0.8863148 1.0394060 0.9973789

Now, we can use the pheatmap function from Raivo Kolde’s “pretty heatmaps” package to make our heatmap. Have a look at the help page for pheatmap to see its many options. We just use one, to tell it to perform hierarchical clustering on our drugs not using the usual Euclidean distance but correlation distance, which works better for drug response data.

library( pheatmap )
pheatmap( responseMatrix, clustering_distance_rows = "correlation" )

We can also look directly at the correlation matrix. In earlier lessons, we have used the cor function to calculate the correlation between to vectors. Here, we give it a matrix, and it calculates a matrix of correlations between all the matrix columns. As we want to see the correlation between rows, we have to transpose the matrix first, with t.

pheatmap( cor( t(responseMatrix) ) )

This correlation heatmap is quite noisy, simply because we have samples from only five patients. However, even now, you would see (if you had the real drug names instead of fruit names) that the red squares along the diagonal tend to correspond to drugs with related mode of action. A somewhat larger cohort can hence give a surprising amount of details. See, e.g., Figure 3 of this paper by Dietrich et al.

SFB 1036 Course on Data Analysis: Lesson 5

Simon Anders

Exploring data from a drug perturbation experiment in 384-well format

The experiment

Reading in the plate files.

Plate annotation

Checking the control wells

Drug response curves

Getting AUC values