This is data curtesy of Tobias Roider from Sascha Dietrich’s group at MMPU. For a small pilot study, Tobias took primary cell samples from the lymph nodes of five patients suffering from lymphoma. He spread the cells from each patient over one 384-well plate and copied aliquots from a drug master plate over these, incubated them, then checked the number of remaining viable cells in each well with a flourescence-based ATP-sensitive readout. We have six files, five CSV from the plate reader, and one Excel file with the information on the layout of the drug master plate.
If you want to redo this analysis, you can find the six files here: lymphoma_data.zip.
Before continuing, open one of the CSV files in an arbitrary text editor and study their content. Note that the payload data starts with the 6th line and takes 15 lines.
If we only want to read in this block of 15 lines, we can use
library( tidyverse )
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 2.0.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
read_csv( "Patient_1.csv", skip=5, n_max=16 )
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
To make it easier to work with this data, we transform it from a wide table to a long table. In a “long table” (or, “tidy table” in the tidyverse nomenclature), every table row contains only data on one “object”. (Here, each well is an object.)
read_csv( "Patient_1.csv", skip=5, n_max=16 ) %>%
rename( "plate_row" = "X1") %>%
gather( "plate_col", "intensity", -plate_row )
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
Such a table is useful, for example, for plotting with ggplot:
read_csv( "Patient_1.csv", skip=5, n_max=16 ) %>%
rename( "plate_row" = "X1") %>%
gather( "plate_col", "intensity", -plate_row ) %>%
ggplot +
geom_tile( aes( x = plate_col, y = plate_row, fill = intensity ) )
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
The plate is not rendered in the right order. Let’s fix this by reversing the rows and making clear to R that the columns should be sorted as numbers, not by alphabet. Also, let’s use log2 values for intensity and add white borders to the tiles. The coord_fixed
ensures that x and y axis are scaled the same and hence the wells appear as squares.
read_csv( "Patient_1.csv", skip=5, n_max=16 ) %>%
rename( "plate_row" = "X1") %>%
gather( "plate_col", "intensity", -plate_row ) %>%
mutate( plate_row = fct_rev( plate_row) ) %>%
mutate( plate_col = as.integer(plate_col) ) %>%
ggplot +
geom_tile( aes( x = plate_col, y = plate_row, fill = log2(intensity) ), color="white" ) +
coord_fixed()
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
We want to work with all 5 plates, to let’s write a function which takes a patient number, read the file and produces a long table
read_plate <- function( patient_number ) {
filename <- str_c( "Patient_", patient_number, ".csv" )
read_csv( filename, skip=5, n_max=16 ) %>%
rename( "plate_row" = "X1") %>%
gather( "plate_col", "intensity", -plate_row ) %>%
mutate( plate_col = as.integer(plate_col) ) %>%
add_column( plate_number = patient_number, .before="plate_row" )
}
We can call this function with a number
read_plate( 2 )
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
We want to call this function for all five patients, i.e., for each of these values:
1:5
## [1] 1 2 3 4 5
(1:5
just returns the numbers from 1 to 5.)
The map_dfr
function calls a given function for each element, expects this function to return a data frame (i.e., a table or tibble) and then combined all the returned tables row-wise (i.e., one below the other). There is also map_dfc
, which combnines them column-wise (i.e., side-by-side). We use it to call our read_plate
function for all five plates.
tbl <- map_dfr( 1:5, read_plate )
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
tbl
Now, we have all data in hand and can plot one big overview plot
tbl %>%
mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
geom_tile( aes( x = plate_col, y = plate_row, fill = log2(intensity) ), color="white" ) +
facet_wrap( ~ plate_number ) + coord_fixed()
Next, let’s read in the Excel file with the well annotation
readxl::read_excel( "PlateAnnotation.xlsx" )
## New names:
## * `` -> `..1`
As this is data from an unpublished experiment, we did not want to make the content of the drug master plate public. Therefore, I have replaced all the drug names in the “Drug” column with names of fruits from this list, only leaving the value “DMSO” as it was.
Apart from now having strange drug names, the table is a bit redundant, an has lengthy column names. Let’s clean this up a bit:
readxl::read_excel( "PlateAnnotation.xlsx" ) %>%
select(
plate_row = Row,
plate_col = Column,
drug = Drug,
conc_level = Drug_Concentration,
conc_uMol = Conc_mikroMolar ) %>%
mutate( conc_uMol = as.numeric( conc_uMol ) ) %>%
mutate( plate_col = as.integer(plate_col) ) ->
layout
## New names:
## * `` -> `..1`
layout
Where are the control wells?
layout %>%
mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
geom_tile( aes( x=plate_col, y=plate_row, fill=(drug=="DMSO") ) ) +
coord_fixed()
As you can see, care was taken to spread the control wells over the whole plate. This helps to check for spatial artifacts, as we will see in a minute.
How equal are the control wells? Let’s plot their intensity.
tbl %>%
left_join( layout ) %>%
filter( drug == "DMSO" ) %>%
mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
geom_tile( aes( x = plate_col, y = plate_row, fill = log2(intensity) ), color="white" ) +
facet_wrap( ~ plate_number ) + coord_fixed()
## Joining, by = c("plate_row", "plate_col")
We mainly see differences between plates, but we want to see heterogeneity within the plates. To make this more visible, let’s divide the DMSO control wells of each plate by their median DMSO intensity, to get relative intensity wells (relative to the plate control median), so we call this new columns rel_int
.
We also use the scale_fill_gradient2
function to get a diverging color scale, centered at zero with white and showing gradients towards red for negative and towards blue for positive values. Such a diverging colour scale should always be used when visualizing data spreading around zero. As it would be nice to have the colour scale labelled on a natural scale, we replace fill=log2(rel_int)
with simply fill=rel_int
in the aes
phrase, and instead specify the transformation in the colour scale specification: trans="log2"
.
tbl %>%
left_join( layout ) %>%
filter( drug == "DMSO" ) %>%
group_by( plate_number ) %>%
mutate( rel_int = intensity / median( intensity ) ) %>%
mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
geom_tile( aes( x = plate_col, y = plate_row, fill = rel_int ), color="white" ) +
facet_wrap( ~ plate_number ) + coord_fixed() +
scale_fill_gradient2( trans="log2")
## Joining, by = c("plate_row", "plate_col")
The one outlier on plate 4 has caused ggplot to choose very wide limits for the colour scale, making the other points very pale. Hence, we specify the limits manually (limits=c(.7,1.3)
), and instruct ggplot to “squish” values outside these limits, i.e., to give them the colour of the limit. Unfortunately, ggplot’s automatic tic marking system fails here (producing ony a single tic mark for “1”), so we also have to specify the tics manually (breaks
).
tbl %>%
left_join( layout ) %>%
filter( drug == "DMSO" ) %>%
group_by( plate_number ) %>%
mutate( rel_int = intensity / median( intensity ) ) %>%
mutate( plate_row = fct_rev( plate_row ) ) %>%
ggplot +
geom_tile( aes( x = plate_col, y = plate_row, fill = rel_int ), color="white" ) +
facet_wrap( ~ plate_number ) +
scale_fill_gradient2( trans="log2", limits=c(.7,1.3), oob=scales::squish,
breaks=c( .7, .8, .9, 1, 1.1, 1.2, 1.3 ) ) +
coord_fixed() +
theme( panel.background = element_rect("darkgray"), panel.grid = element_blank() )
## Joining, by = c("plate_row", "plate_col")
Is this spread small enough to ignore the spatial artifacts? This may be easier to judge using a beeswarm plot.
tbl %>%
left_join( layout ) %>%
filter( drug == "DMSO" ) %>%
group_by( plate_number ) %>%
mutate( rel_int = intensity / median( intensity ) ) %>%
ggplot +
ggbeeswarm::geom_beeswarm( aes( x = plate_number, y = rel_int ) ) +
scale_y_continuous( trans="log2", breaks = seq( 0, 2, by=.1 ) )
## Joining, by = c("plate_row", "plate_col")
Enough looking at control wells
Let’s just use the median of each plate’s control wells as normalization reference
tbl %>%
left_join( layout ) %>%
filter( drug=="DMSO" ) %>%
group_by( plate_number ) %>%
summarise( median_dmso_intensity = median( intensity ) ) ->
ctrl_int_tbl
## Joining, by = c("plate_row", "plate_col")
ctrl_int_tbl
Now, use this to normalize our data
tbl %>%
left_join( layout ) %>%
left_join( ctrl_int_tbl ) %>%
mutate( rel_int = intensity / median_dmso_intensity ) ->
tbl2
## Joining, by = c("plate_row", "plate_col")
## Joining, by = "plate_number"
tbl2
Let’s plot one drug, say “Passionfruit”
tbl2 %>%
filter( drug == "Passionfruit") %>%
mutate( plate_number = str_c( "Patient_", plate_number ) ) %>%
ggplot +
geom_line( aes( x = conc_uMol, y = rel_int, col = plate_number ) ) +
scale_x_log10()
By putting a for
loop around these commands, we can make a PDF file with one such plot per page, one for each drugs.
We first get the names of all drugs
layout %>%
filter( drug != "DMSO" ) %>%
pull( "drug" ) %>%
unique ->
drugs
drugs
## [1] "Passionfruit" "Plum" "Apricot"
## [4] "Dragonfruit" "Soursop" "Star fruit"
## [7] "Olive" "Fig" "Apple"
## [10] "Purple mangosteen" "Plantain" "Mulberry"
## [13] "Currant" "Orange" "Mangosteen"
## [16] "Blood orange" "Bilberry" "Raspberry"
## [19] "Nance" "Blackcurrant" "Honeyberry"
## [22] "Watermelon" "Cucumber" "Yuzu"
## [25] "Jabuticaba" "White sapote" "Buddha's hand"
## [28] "Salmonberry" "Black sapote" "Mandarine"
## [31] "White currant" "Papaya" "Raisin"
## [34] "Marionberry" "Cranberry" "Date palm|Date"
## [37] "Plumcot" "Mango" "Redcurrant"
## [40] "Pear" "Juniper berry" "Jackfruit"
## [43] "Cantaloupe" "Jostaberry" "Guava"
## [46] "Surinam cherry" "Persimmon" "Damson"
## [49] "Goji berry" "Clementine" "Coconut"
## [52] "Chico fruit" "Grape" "Cherry"
## [55] "Feijoa" "Salal" "Kiwano"
## [58] "Peach" "Tamarind" "Blackberry"
## [61] "Blueberry" "Star apple" "Lychee"
## [64] "Strawberry"
Now, we open a PDF file for writing (pdf
), run the for
loop, and close the PDF file (dev.off
). The pdf
command means that all subsequent plotting commands should result in the produced plot not being displayed on the screen as usual but rather added as a new page to the PDF file until this rerouting is switched off again with dev.off
.
The for loop runs its body (the part in curly braces) once each of the drugs in the vector drugs
, by putting the value of each drug into the variable dr
(for ( dr in drugs )
) and then runs the commands. These produce one plot. Whenever you use ggplot within a function or a block, you have to surround the whole ggplot command with parenthesis and hand it to print
(for reasons we dobn’t want to go into now), as otherwise nothing gets plotted.
# start a PDF file
pdf( "dose_resp_curves.pdf" )
# Go through the list of drugs
for( dr in drugs ) {
(tbl2 %>%
filter( drug == dr ) %>%
mutate( plate_number = str_c( "Patient_", plate_number ) ) %>%
ggplot +
geom_line( aes( x = conc_uMol, y = rel_int, col = plate_number ) ) +
scale_x_log10() +
ggtitle( dr ) ) %>% print
}
# close the PDF
dev.off()
## png
## 2
You can look at the PDF thus produced here: dose_resp_curves.pdf
For each drug and each patient, we have five response values in the rel_int
column, namely the responses to the five concentrations in which the drug is present on the master plate. Let us simply average over these to get one single value. In a way, this is roughly the same as calculating the area under the dose response curve.
tbl2 %>%
filter( drug != "DMSO" ) %>%
group_by( plate_number, drug ) %>%
summarise( mean_viab = mean( rel_int ) )
We can use the spread
fucntion to turn this long table into a wide table. This is the opposite of gather
, which we have used further up to turn a wide table into a long one. Hence we add one line to the previous code chunk.
tbl2 %>%
filter( drug != "DMSO" ) %>%
group_by( plate_number, drug ) %>%
summarise( mean_viab = mean( rel_int ) ) %>%
spread( plate_number, mean_viab )
To plot this as a heatmap, we have to turn it from a tidyverse tibble into a matrix, because the heatmap plotting functions of R don’t work well with tidyverse objects. The as.matrix
function does this. Before, we turn the column with the drug names into row names for the matrix, so that the matrix itself contains only numbers.
tbl2 %>%
filter( drug != "DMSO" ) %>%
group_by( plate_number, drug ) %>%
summarise( mean_viab = mean( rel_int ) ) %>%
spread( plate_number, mean_viab ) %>%
column_to_rownames( "drug" ) %>%
as.matrix ->
responseMatrix
responseMatrix[ 1:10, ]
## 1 2 3 4 5
## Apple 0.5620512 0.5024576 0.7448491 0.5766846 0.4617407
## Apricot 1.0422428 0.9323655 1.0457924 1.0102802 1.1376190
## Bilberry 1.0906224 1.1949816 0.8827831 1.0272608 0.9978405
## Blackberry 0.9031180 0.9153274 0.9585928 0.9555345 0.9849007
## Blackcurrant 0.9118808 1.0666942 0.9661327 1.0413223 0.8890081
## Black sapote 1.0048951 1.0733791 0.9823822 1.0030341 1.0314541
## Blood orange 1.1242877 1.1626094 0.9071811 1.1148978 1.0449368
## Blueberry 0.6031344 0.8027615 0.9612890 0.6184696 0.6523558
## Buddha's hand 0.7791521 0.8957545 0.9180749 0.6496263 0.8157282
## Cantaloupe 1.1569436 1.0069223 0.8863148 1.0394060 0.9973789
Now, we can use the pheatmap
function from Raivo Kolde’s “pretty heatmaps” package to make our heatmap. Have a look at the help page for pheatmap
to see its many options. We just use one, to tell it to perform hierarchical clustering on our drugs not using the usual Euclidean distance but correlation distance, which works better for drug response data.
library( pheatmap )
pheatmap( responseMatrix, clustering_distance_rows = "correlation" )
We can also look directly at the correlation matrix. In earlier lessons, we have used the cor
function to calculate the correlation between to vectors. Here, we give it a matrix, and it calculates a matrix of correlations between all the matrix columns. As we want to see the correlation between rows, we have to transpose the matrix first, with t
.
pheatmap( cor( t(responseMatrix) ) )
This correlation heatmap is quite noisy, simply because we have samples from only five patients. However, even now, you would see (if you had the real drug names instead of fruit names) that the red squares along the diagonal tend to correspond to drugs with related mode of action. A somewhat larger cohort can hence give a surprising amount of details. See, e.g., Figure 3 of this paper by Dietrich et al.