SFB 1036 Course on Data Analysis: Lesson 1

R and the Tidyverse

This write-up provides an overview over the content of the first lecture. We use here the statistical programming language R. For thsi first lecture, try to understand the statistical concepts but for now only glance over the R commands. We will discuss details in later lectures on what exactly they mean and how you can write them yourself.

However, if you want to look ahead: We restrict ourself to using a subset of R, called the “tidyverse”. The tidyverse is a project to “tidy up” the complicated universe of dozens of styles and thousands of commands that R has become in its over 30 years of history. For beginners, I recommend to start with learning only the tidyverse part of R.

There is an excellent textbook by the makers of the Tidyverse that teaches you data science using R:

Garrett Grolemund and Hadley Wickham: R for Data Science online version: for free at http://r4ds.had.co.nz/ print version: O’Reilly, 2017; ISBN-13: 978-1491910399

Exploratory and confirmatory analysis

Exploratory data analysis (EDA): If you get data, you should spend substantial time to explore the data set by looking at it from many angles, on the one hand to learn about peculiarities, strengths and weaknesses of your data, on the other hand, to find new insights and form hypotheses. Knowing many techniques for EDA is essential to do good data science.

Confirmatory analysis (inferential statistics): Once you have found something interesting or if you want to test a hypothesis, you have to try to reject the null hypothesis that what you see in the data is just the result of random fluctuation. We will delay the discussion of what this exactly means, why this is called “inference”, and what p values really are, to the next lesson.

It is important to be well versed in both of these basic data analysis tasks, and to not neglect one for the other. This first lesson will focus on simple EDA.

The NHANES data

Our example data set for today is the National Health and Nutrition Examination Survey (NHANES) of the Center for Disease Control and Prevention (CDC). We will use the NHANES data set for 2015-2016 (Continuous NHANES, data set “I”), and there the Demographics data (“DEMO”) and the Body Measures part (“BMX”) of the Examinations data. We first download the two data tables: DEMO_I.XPT and BMX_I.XPT. If you want to fully understand this data, look at the study overview and at the documentation pages for DEMO and for BMX

We first load the Tidyverse packages

library( tidyverse )

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load the two data tables¹

nhanes_demo_i <- haven::read_xpt( "NHANES/DEMO_I.XPT" )
nhanes_bmx_i <- haven::read_xpt( "NHANES/BMX_I.XPT" )

Have a look at the demographics table

nhanes_demo_i

The meaning of the abbreviations for the column names are given in the data documentation page, and also in the columns’ “label attributes” which are only displayed if we explictely ask R for that:²

# Don't worry about this complicated looking command now; just look at the results
map_chr( nhanes_demo_i, attr, "label" ) %>% { tibble( colname = names(.), description = . ) }

Let’s also have a look at the body measurements table

nhanes_bmx_i

And the column descriptions:

map_chr( nhanes_bmx_i, attr, "label" ) %>% { tibble( colname = names(.), description = . ) }

To make our life easier, let’s merge the two columns, remove most columns, keep only those that we need and give them simpler names

# Again, don't worry about this complex looking code for now; we'll get there eventually.

# Merge the two tables, using the "SEQN" column to know which rows belong together
inner_join( 
    nhanes_demo_i, 
    nhanes_bmx_i, by="SEQN" ) %>%
# Select the columns we need, and rename them
  select( 
     subjectId = SEQN,
     age = RIDAGEYR,
     sex = RIAGENDR,
     height = BMXHT,
     weight = BMXWT,
     ethnicity = RIDRETH3,
     education = DMDEDUC2,
     bornInUS = DMDBORN4 ) %>%
# Change the values in the "sex" column from 1 and 2 to "male" and "female"  
  mutate(
    sex = fct_recode( factor(sex), 
      "male"="1", "female"="2" ),
    ethnicity = fct_recode( factor(ethnicity), 
      "mexican"="1", "hispanic"="2", "white"="3", "black"="4", 
         "asian"="6", "other"="7" ),
    education = fct_recode( factor(education),
      "below9"="1", "9to11"="2", "high_school"="3", "some_college"="4",
         "college"="5", "refused"="7", "dontknow"="9" ),
    bornInUS = fct_recode( factor(bornInUS),
      "yes"="1", "no"="2", "refused"="77", "dontknow"="99" )
  ) -> 
# Save the result ("->") in the following variable  
    nhanes

## Warning: Unknown levels in `f`: 7

## Warning: Unknown levels in `f`: 77

At the end of this complex operation, we have saved the result (using the arrow operator ->) in a new variable called nhanes. Lets’ have a look into it

nhanes

Data exploration with scatter plots

Scatter plots (also called x-y plots) are the basic tool to explore potential relations between two variables. For example, how do people grow with age?

ggplot( nhanes ) +
  geom_point(
    aes(
      x = age,
      y = height
    )
  )

## Warning: Removed 775 rows containing missing values (geom_point).

We can see that people seem to grow up to age 16 or so, and the stay at this height. (Is this correct? Time to think about cross-sectional versus longitudinal data…).

Let’s dissect the command: With ggplot( nhanes ), we tell R that we wish to plot data from the nhanes data table, using the “ggplot” plotting functionality. Then, we specify that we wish to make a scatter plot, i.e., a plot in which each entry in our data table (i.e., each table row) is represented as a point: ggplot( nhanes ) + geom_point( ... ). In ggplot’s parlance, the thing that represents a data entity (here, a point) is called a “geom”, and the specification what exactly it should represent is called (somewhat misleadingly) the “aesthetics” (aes). A geom_point needs at least two aesthetics specification, namely x and y.

Let’s refine things a bit:

ggplot( nhanes ) +
  geom_point(
    aes(
      x = age,
      y = height,
      color = sex
    ),
    size = 0.4,
    alpha = 0.5
  )

## Warning: Removed 775 rows containing missing values (geom_point).

Now I have added another aesthetic, called “color”, and set it to represent sex. Note how ggplot automatically has chosen two colors for the two sexes and added a color legend to the right. I have also set size to 0.4 to make the points a bit smaller and alpha = 0.5 makes them half transparent (alpha=1, the default, is opaque: you can see only the point on top; and alpha=0 would be fully transparent, i.e., invisible). These two parameters are aesthetics, too. They are, however, outside the aes(...) construct because they do not depend on the data table but are the same for all points. You can, however, also put them inside and use, e.g. differently sized points (so-called bubble or balloon plot).

To learn more about geom_point and the aestethics, type ?geom_plot to see the help page for this function.

Why does R warn us about missing data? Let’s check for which data records the values for age or height are not available (abbreviated: NA).

nhanes %>%
  filter( is.na(age) | is.na(height) )

How about weight?

ggplot( nhanes ) +
  geom_point(
    aes(
      x = height,
      y = weight,
      color = sex
    ),
    size = 0.4,
    alpha = 0.5
  )

## Warning: Removed 788 rows containing missing values (geom_point).

Let’s redo this using only adult subjects (age >= 20):

nhanes %>%
  filter( age >= 20 ) %>%
ggplot() +
  geom_point(
    aes(
      x = height,
      y = weight,
      color = sex
    ),
    size = 0.4,
    alpha = 0.5
  )

## Warning: Removed 68 rows containing missing values (geom_point).

Which of these people are overweight or obese? What is a healthy weight depends on body height. Supposedly, dividing weight by the square of the height (height^2) normalizes for the effect of height differences. This is called the body mass index (BMI):

nhanes %>%
  filter( age >= 20 ) %>%
ggplot() +
  geom_point(
    aes(
      x = height,
      y = weight / ( height/100 )^2,
      color = sex
    ),
    size = 0.4,
    alpha = 0.5
  )

## Warning: Removed 68 rows containing missing values (geom_point).

Note that we have divided height by 100 for the BMI, because our column gives height in centimeters but the BMI formula wants it in meters: BMI = weight[kg] / (height[m])^2.

For future use, let’s add a new column to our data table for the BMI. The (strangely named) function mutate allows us to add or change columns.

nhanes %>%
  mutate( bmi = weight / (height/100)^2 ) ->
    nhanes

Note the -> nhanes at the end, to write the result back into the variable. Otherwise, R would merely print the resulting table onto the screen and forget about it.

The WHO considers BMI values from 18.5 to 25 as normal, below that as underweight, above as overweight, and above 30 as obese. Let’s draw some lines to mark this boundaries

nhanes %>%
  filter( age >= 20 ) %>%
ggplot() +
  geom_point(
    aes(
      x = height,
      y = bmi,
      color = sex
    ),
    size = 0.4,
    alpha = 0.5
  ) +
  geom_hline(
    yintercept = c( 18.5, 25, 30 ),
    alpha = 0.4
  )

## Warning: Removed 68 rows containing missing values (geom_point).

Now, we have seen a second “geom”: geom_hline for a horizontal line. The aesthetics to say where to place the line is called (maybe a bit verbosely) yintercept. Instead of specifying a column name, we give the values directly: c( 18.5, 25, 30). The function c here builds a data vector (i.e., just a rowlist of numbers) out of the individual numbers. (“c” stands for “concatenate”.)

Histograms

It is hard to see in the scatter plots hwo many people are actually obese. A histogram might be better

nhanes %>%
  filter( age >= 20 ) %>%
ggplot() +
  geom_histogram(
    aes( x = bmi )
  )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 68 rows containing non-finite values (stat_bin).

The “histogram” geom automatically cuts the bmi values into 30 bins, counts the number of subjects that fall into each bin and plots these.

Remember that in a histogram, the area corresponds to numbers of subjects, not the height.

Let’s add marker lines for the WHO categories and stratify by sex

nhanes %>%
  filter( age >= 20 ) %>%
ggplot() +
  geom_histogram(
    aes( x = bmi ),
    binwidth = 2 ) +
  geom_vline( xintercept = c( 18.5, 25, 30 ), alpha=.3 ) +
  facet_grid( sex ~ . )

## Warning: Removed 68 rows containing non-finite values (stat_bin).

Here, we have a new feature: facetting (also called trellis plotting). The facet_grid term arranges the plots into a grid, such that each row corresponds to another level of the factor “sex” and each column corresponds to another level of “the factor the”empty factor" .. (In other words: The . behind the tilde³ is just a place holder, where we could put another column name if we wished to have more than one column.)

Let’s do the same with the height data

nhanes %>%
  filter( age >= 20 ) %>%
ggplot() +
  geom_histogram(
    aes( x = height, fill=sex ), 
    alpha=0.5, position="identity", binwidth = 2 )

## Warning: Removed 57 rows containing non-finite values (stat_bin).

Instead of facetting, we have this time put both histograms into the same plot, and used transparency (alpha) to make both visible. Note the position="identity": the dafault is position="stack" but stacking the bars on top of each other looks misleading (to me).

Another possibility of showing the same data is aa density estimate:

nhanes %>%
  filter( age >= 20 ) %>%
ggplot() +
  geom_density(
    aes( x = height, col=sex ) )

## Warning: Removed 57 rows containing non-finite values (stat_density).

Counting with tidyverse

From the BMI histogram above, it seems that more than half of the population is overweight. Can we get exact numbers? Yes:

nhanes %>%
  filter( age >= 20) %>%          # use adults only
  count( sex, bmi > 25 )          # Divide up the data into groups, count the number of rows in each group

The count function above is just a short-cut for this here

nhanes %>%
  filter( age >= 20) %>%
  group_by( sex, bmi > 25 ) %>%
  summarise( number = n() )

This is a construction that we will use a lot, so let’s dissect this simple first example: group_by divides up the data table rows into groups according to sex and according to bmi > 25. As you can see from the six rows of the resulting table, there are six combination of these factors and hence six groups, each comprising some of the rows of the data table. The function summarise now reduces each group to a single row in the output table, by applying the function n onto each group and storing the result obtained in the column number. The function n simply counts the number of data table rows in the group.

Let’s also study a slightly different construction:

nhanes %>%
  filter( age >= 20 ) %>%
  filter( !is.na(bmi) ) %>%    # Keep only rows for which the BMI is not NA  ("!" means "not")
  group_by( sex ) %>%   
  summarise( 
    number = n(),
    number_overweight = sum( bmi>25 ) )

Here, we have only one grouping variable, sex, with only two levels, male and female, and hence get an output table with only two rows. Now, summarise calculates two summery statistics and hence makes two new columns: number is just the number of rows in the group, as reported by n, and number_overweight is the number of subjects with a BMI over 25.

To understand this last part, let’s break it up into parts. First, let’s use mutate again (which, as you may recall, is the strangely named function that adds or changes a column).

nhanes %>%
  mutate( is_overweight = bmi>25 )

As you can see, a condition like bmi>25 simply yields a column with TRUE and FALSE as values.⁴. Internally, a TRUE is represented as 1 and a FALSE as 0. Hence, the sum over such a vector (as in sum( bmi>25 )) is just the number of ones, i.e., of TRUEs. This call to sum takes a whole column and outputs a single scalar. hence it has to appear inside summerise:

nhanes %>%
  filter( !is.na(bmi) ) %>%
  mutate( is_overweight = bmi>25 ) %>%
  summarise(
    number_overweight = sum( is_overweight )
  )

Putting all this back together, and adding one last line to divide the number of overweight people by the total number of people, we get the fraction of subjects that are overweight:

nhanes %>%
  filter( !is.na(bmi) ) %>%
  mutate( is_overweight = bmi>25 ) %>%
  group_by( sex ) %>%
  summarise(
    number = n(),
    number_overweight = sum( is_overweight )
  ) %>%
  mutate( percent_overweight = 100 * number_overweight / number )

If we only want to get the percentage, we can collapse a few lines and write this shorter

nhanes %>%
  filter( !is.na(bmi) ) %>%
  group_by( sex ) %>%
  summarise( number = 100 * sum( bmi > 25 ) / n() )

We have just learned some basics of “dplyr”, the part of the tidyverse concerned with manipulating data tables:

%>% hands over data from one command to the next in a chain of commands
filter selects specific table rows and removes the others. Inside a filter call, specify conditions to indicate the rows you want to keep.
select selects specific table columns and removes the other. Inside select, list the columns you want to keep, and possible indicate new names if you want to change the column name
mutate changes a column or adds a new column. Within the mutate call, you always have expressions of the structure column_name = some_function( ... ), where the calculation on the right hand side of the equal sign should take data vectors (i.e., columns) and result in data vectors of the same length. mutate may add columsn but the number of rows is the same before and after a mutation.
group_by assigns the rows to distinct groups, according to the columns indicated
summarise reduces each of these groups of data rows to a single row. The expressions in the summerise call hence have to be calculations that take vectors (columsn) but produce only single values.⁵

Once you are more familiar with these, you will find the “Data transformations with dplyr” Cheat Sheer a handy guide.

More data exploration

How does the prevalence of obesity change with age?

nhanes %>%
  filter( age >= 17.5 ) %>%          
  filter( !is.na(bmi) ) %>%        
  mutate( age_rounded = round( age/5 ) * 5 ) %>%
  group_by( sex, age_rounded ) %>%
  summarise( frac_overweight = sum( bmi>25 ) / n() ) %>%
ggplot + 
  geom_line( aes( x=age_rounded, y=frac_overweight, col=sex ) )

What was new this time? We used a new geom, geom_line, for a line chart. And we used a trick to bin our data into age bins of 5 years: Dividing the age by 5, rounding and then re-multiplying with 5.

FInally, here is more complicated example, just to show how to build more complex graphics

nhanes %>%
  filter( age >= 15 ) %>%          
  filter( !is.na(bmi) ) %>%        
  mutate( category = cut( bmi,     
     breaks = c( 0, 18.5, 25, 30, Inf ),
     labels = c( "underweight", "normal", "overweight", "obese" ),
     ordered = TRUE ) ) %>%
  mutate( category = fct_rev(category) ) %>%
  mutate( age = cut_interval( age, length=5, ordered=TRUE ) ) %>%
  count( sex, age, category ) %>%
  group_by( sex, age ) %>%
  mutate( frac = n / sum(n) ) %>%
ggplot + 
  geom_col(aes(x=age,y=frac,fill=category)) +
  facet_grid( sex ~ . )

More on dplyr

There are important operations that we have not touched yet, especially joining and reshaping data tables. These are crucial tools to replace error-prone copy-and-paste operations in Excel.

Footnotes

These tables are in the XPT format, which is the data export format of SAS, another statistics package. We use here a function from the tidyverse package “haven”, which collects functionality to read such data in “foreign” formats.↩
Such column description are not really typical of tidyverse R. They are here because the NHANES people seem to use SAS, another statistics package, which organises data this way.↩
“Tilde” (pronounced /ˈtɪldə/) is how one calls the “wavy dash” sign: ~↩
This is called a “logical” or “Boolean” vector.↩
Such calculations produce so-called “summary statistics”. Typical examples are the sum of the values (sum), their average (mean) or simply their number (n).↩