Working with time-series data can be a challenge for new and experienced R users. You will often have to format the date, time, and timezone when working with raw data. R does not automatically recognize date-time formats and there are many formats for representing date-time (e.g. yyyy-mm-dd, mm-dd-yy, mm/dd/yyyy hh:mm:ss).

lubridate is a handy package that is installed as part of the tidyverse installation but does not automatically load when you call for the tidyverse package (library(tidyverse)). You have to explicitly call the package when you need it.


Use one of the Biketown data files you downloaded the other day in the writing functions exercise. Or use source to use the function to get Biketown data for 06/2018 through 08/2018.

get_data(start = "06/2018", end = "08/2018")

# Function to read in all files and combine into one dataframe
# This function only works if you explicitly set the working drive to
# where the data is being stored

# Aftering running the function, make sure to set the working drive
# back to the folder where your .Rproj file is stored.

folder <- "/home/tammy/Documents/ds19-class/data/biketown"
filenames <- list.files(path = folder, pattern = "*.csv", all.files = FALSE, full.names = FALSE,
                        recursive = FALSE, = FALSE)
read_csv_filename <- function(filenames){
  ret <- read.csv(filenames, stringsAsFactors = F, 
                  strip.white = T, na.strings = "")
  ret$Source <- filenames
bike_raw <- plyr::ldply(filenames, read_csv_filename)

# check data structure
# create new columns `start.datetime` and `end.datetime`
bike_df1 <- bike_raw %>%
  mutate(start.datetime = paste(StartDate, StartTime, sep = " "),
         end.datetime = paste(EndDate, EndTime, sep = " "))

# convert `start.datetime` and `end.datetime` into date time format with appropriate timezone
bike_df1$start.datetime <- mdy_hm(bike_df1$start.datetime, tz = "America/Los_Angeles")
bike_df1$end.datetime <- mdy_hm(bike_df1$end.datetime, tz = "America/Los_Angeles")

# convert `Duration` into a useable format
bike_df1$Duration <- hms(bike_df1$Duration)
# this throws a warning about NA's

# checking for NAs in `bike_raw$Duration`
## [1] 3946

There are three functions in lubridate that seem synonomous but define very different actions:

  1. duration: span of time measured in seconds, and there is no start date involved (see above example useage)
  2. interval: measures between two specific time points (in seconds)
  3. period: measures time span in units larger than seconds, handy for when accounting for daylight saving times, leap years
# calculate interval
bike_df1$interval <- interval(bike_df1$start.datetime, bike_df1$end.datetime)
# using floor_date to help aggregate data
# want weekly mean distance traveled

bike_wkagg <- bike_df1 %>%
  mutate(week.datetime = floor_date(start.datetime, unit = "week")) %>%
  group_by(week.datetime) %>%
  summarise(weekly.meandist = mean(Distance_Miles))
bike_wkagg$week.datetime <- as.Date(bike_wkagg$week.datetime)

weekly_meandist_fig <- bike_wkagg %>%
  ggplot(aes(x = week.datetime, y = weekly.meandist)) +
  geom_bar(stat = "identity", fill = "orange") +
  scale_x_date(date_breaks = "1 week") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

There are a few important regular expression (Regex) base R functions that are handy to know when working with strings:

  1. grep and grepl: grep looks for a match within a string vector and returns an indice of matches. grepl looks for a match within a string and returns a logical vector.
  2. sub and gsub: replaces the first exact matching chunk of text within a string vector with a specified replacement. gsub replaces all exact matching chunks of text with a specified replacement.
# Create three different station categories for the start and end stations
bike_df2 <- bike_df1 %>%
  mutate(start.station.category = if_else(grepl("Community", StartHub), "Community Station",
                                    if_else(grepl("", StartHub), "Outside Station",
                                      "BIKETOWN Station"))) %>%
  mutate(end.station.category = if_else(grepl("Community", EndHub), "Community Station",
                                    if_else(grepl("", EndHub), "Outside Station",
                                      "BIKETOWN Station")))
Steve Fick’s Regular Expressions