Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”. You can learn more about them in vignette("dplyr"). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. Install the dbplyr package then read vignette("databases", package = "dbplyr").

Lesson

Data Transformation with dplyr

Variable recoding with dplyr

recode and recode_factor: Replace numeric values based on their position, and character values by their name;
if_else: Replace values based on a logical vector;
case_when: Vectorise multiple if and else if statements.

Recoding, when to use which function: - one-to-one, many-to-one: recode and recode_factor

Download the NHTS 2009 data file for the demos here (Right click & select Save As…)

library(tidyverse)

# load NHTS2009 travel diaries subset
dd <- read_csv("data/NHTS2009_dd.csv")

## Parsed with column specification:
## cols(
##   HOUSEID = col_integer(),
##   PERSONID = col_character(),
##   HHSIZE = col_integer(),
##   HH_RACE = col_character(),
##   HHFAMINC = col_character(),
##   URBRUR = col_character(),
##   TRIPPURP = col_character(),
##   TRPTRANS = col_character(),
##   TRVLMIN = col_integer(),
##   TRPMILES = col_double()
## )

# recode race (HH_RACE column) according to data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=951
dd %>% mutate(hh_race_str=recode(HH_RACE, 
                                 "01"="White",
                                 "02"="African American, Black",
                                 "03"="Asian Only",
                                 "04"="American Indian, Alaskan Native",
                                 "05"="Native Hawaiian, other Pacific",
                                 "06"="Multiracial",
                                 "07"="Hispanic/Mexican",
                                 "97"="Other specify",
                                 .default = as.character(NA) # any unspecified values would be assgined NA
                                 )) %>% 
  select(HH_RACE, hh_race_str)

## # A tibble: 304 x 2
##    HH_RACE hh_race_str
##      <chr>       <chr>
##  1      01       White
##  2      01       White
##  3      01       White
##  4      01       White
##  5      01       White
##  6      01       White
##  7      01       White
##  8      01       White
##  9      01       White
## 10      01       White
## # ... with 294 more rows

a logical condition: if_else

# code driving & non-driving based on travel modes (TRPTRANS column) data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=1084
dd %>% mutate(driving=ifelse(TRPTRANS %in% c("01", "02", "03", "04", "05", "06", "07"), 1, 0),
              driving=ifelse(TRPTRANS %in% c("-1", "-7", "-8", "-9"), NA, driving) # retain missing values as NA
             ) %>% 
  select(TRPTRANS, driving)

## # A tibble: 304 x 2
##    TRPTRANS driving
##       <chr>   <dbl>
##  1       03       1
##  2       03       1
##  3       03       1
##  4       03       1
##  5       03       1
##  6       03       1
##  7       03       1
##  8       03       1
##  9       03       1
## 10       03       1
## # ... with 294 more rows

multiple logical conditions: case_when

# code driving & non-driving based on travel modes (TRPTRANS column) data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=1084 use case_when
dd %>% mutate(driving=case_when(
  TRPTRANS %in% c("01", "02", "03", "04", "05", "06", "07") ~ 1, 
  TRPTRANS %in% c("-1", "-7", "-8", "-9") ~ as.double(NA), # retain missing values as NA
  TRUE ~ 0)) %>% 
  select(TRPTRANS, driving)

## # A tibble: 304 x 2
##    TRPTRANS driving
##       <chr>   <dbl>
##  1       03       1
##  2       03       1
##  3       03       1
##  4       03       1
##  5       03       1
##  6       03       1
##  7       03       1
##  8       03       1
##  9       03       1
## 10       03       1
## # ... with 294 more rows

# reclassify households into low, med, high income based on HHFAMINC column data dictionary: http://nhts.ornl.gov/tables09/CodebookPage.aspx?id=949 with brackets [0, 30000, 6000]
dd <- dd %>% mutate(income_cat=case_when(
  HHFAMINC %in% c("01", "02", "03", "04", "05", "06") ~ "low income",
  HHFAMINC %in% c("07", "08", "09", "10", "11", "12") ~ "med income",
  HHFAMINC %in% c("13", "14", "15", "16", "17", "18") ~ "high income",
  TRUE ~ as.character(NA) # retain missing values as NA
  ))

# verify recodeing results with group_by & tally
dd %>% group_by(HHFAMINC, income_cat) %>% 
  tally()

## # A tibble: 13 x 3
## # Groups:   HHFAMINC [?]
##    HHFAMINC  income_cat     n
##       <chr>       <chr> <int>
##  1       01  low income     4
##  2       02  low income     2
##  3       03  low income    12
##  4       04  low income     2
##  5       06  low income    18
##  6       07  med income     6
##  7       08  med income    10
##  8       12  med income     7
##  9       14 high income    20
## 10       16 high income    38
## 11       17 high income    64
## 12       18 high income   115
## 13       -7        <NA>     6

Programming with dplyr

Exercise

Filter days where there are missing values in bike counts and weather information. Count number of days with missing values on either bike counts or weather information.
Calculate weekly, monthly, and annual bike counts from the daily bike counts data.
Join the bike counts data with the weather data. Which type of joins works best here?
With the NHTS2009 travel diaries data, how do you cacluate total miles traveled (using any modes) and miles traveled by driving for each household (hint: the TRPMILES column contains information of trip distance for each member of a household).
[Challenge] How do you compute the average household-level miles driving per capita by income categories (low, med, high)?

Data manipulation with dplyr

Overview

Lesson

Variable recoding with dplyr

Programming with dplyr

Exercise

Resources: