install.packages("tidyverse")
This section assumes students know little about R and gets them up to speed with the basics. I am now convinced that it might be easier to teach beginners tidyverse
than the base R, as argued by David Robinson. We will dive right into tidyverse, which will be covered in more depth in Part II.
However, if you prefer or are more comfortable with base R, these lessons by Software Carpentry covers more or less similar contents with mostly base R functions:
%>%
pipes an object forward into a function or call expression
%>%
with unary function callsrequire(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
iris %>% head
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris %>% tail
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
## What are the outputs for these lines?
3 %>% `-`(4)
iris %>% head(5)
iris %>% tail(5)
## What are the outputs for these
3 %>% `-`(4, .)
4 %>% c("A", "B", "C", "D", "E")[.]
More information available at: http://magrittr.tidyverse.org/
RStudio keyboard shortcut for %>%
:
In programming as in writing, it is generally a good idea to stick to a consitent coding style. There are two style guides that you can adopt or customize to create your own:
RStudio is good for writing and testing your R code, but for work that needs repetitions or takes a long time to finish, it may be easier to run your program/script in command line instead.
Before we start, open the RStudio project you created following the RStudio project organization recomendations in the Overview section (assuming you downloaded and saved the bike counts data to the data
directory of the project).
We can create an R script (from the File/New File/R Script menu of RStudio) that load the bike counts for Hawthorne Bridge:
library(tidyverse)
input_file <- "data/Hawthorne Tilikum Steel daily bike counts 073118.xlsx"
bridge_name <- "Hawthorne"
bikecounts <- read_excel(input_file, #path - the path to the input excel file
sheet=bridge_name, #name/number of the sheet, it uses name of the bridge
skip=1) #since each worksheet has a two-row header, skip the first row
#names(bikecounts) <- c("date", "westbound", "eastbound", "total")
bikecounts$bridge <- bridge_name
head(bikecounts)
Choose a file name, for example, load_data.R
, and save the script in the code directory of your RStudio project (created a code
directory first if you haven’t yet).
Now we can run the script in a command line shell (you can open one in RStudio’s Tools/Shell… menu):
Rscript code/load_data.R
Notice that the script may not print out outputs on the screen when called in the command line unless you explicitly call the print
function.
But what if we have many files for which we would like to repeatedly show the basic information (rows, data types etc)? We can refactor our script to accept the file name and bridge name from command line arguments, so that the script can work with any acceptable files.
In an R script, you can use commandArgs
function to get the command line arguments:
args <- commandArgs()
print(args)
So in our case, our script should take input_file and bridge_name from the command line arguments, we can get the value of the arguments with:
args <- commandArgs()
input_file <- args[1]
bridge_name <- args[2]
Replace the two lines in load_data.R
starting with input_file
and bridge_name
with these three lines.
Now our script can be invoked in the command line with:
Rscript code/load_data.R "data/Hawthorne Tilikum Steel daily bike counts 073118.xlsx" \
Hawthorne
(The quotation marks are needed for the file name when there are spaces in the name and “" breaks a command into two lines.)
This section is adapted from Visual Debugging with RStudio.
Download foo.R
from https://raw.githubusercontent.com/cities/datascience2017/master/code/foo.R and save it to the code
dirctory of your project folder;
foo.R
and source
it;In the RStudio Console pane of type foo("-1")
and then enter.
Why does the foo
function claim “-1 is larger than 0”? Let’s debug the foo
function and find out.
read_excel
function in the readxl
package to read data in excel fileslibrary(readxl)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
input_file <- "data/Hawthorne Tilikum Steel daily bike counts 073118.xlsx"
bridge_name <- "Hawthorne"
# define a funtion that load bike counts data
load_data <- function(input_file, bridge_name) {
bikecounts <- read_excel(input_file,
sheet = bridge_name,
skip = 1)
bikecounts$name <- bridge_name
bikecounts
}
Tilikum <- load_data(input_file, "Tilikum")
Hawthorne <- load_data(input_file, "Hawthorne")
# use the column names of Tilikum for Hawthorne
names(Hawthorne) <- names(Tilikum)
Steel <- load_data(input_file, "Steel")
names(Steel) <- c("date", "lower", "westbound", "eastbound", "total", "name")
# combine all three data frame for all three bridges
bikecounts <- bind_rows(Hawthorne,
Tilikum,
Steel %>% select(-lower)) # exclude the `lower` col in Steel data frame
# average daily bike counts by bridge
bikecounts %>%
group_by(name) %>%
summarize(avg_daily_counts=mean(total, na.rm=TRUE))
## # A tibble: 3 x 2
## name avg_daily_counts
## <chr> <dbl>
## 1 Hawthorne 3904.
## 2 Steel 2240.
## 3 Tilikum 1725.
# average monthly bike counts by bridge
bikecounts %>%
# first create ym column as a unique month identifier
group_by(name, ym=floor_date(date, "month")) %>%
summarize(total_monthly_counts=sum(total), counts=n()) %>%
# then average by month over years for each bridge
group_by(name, month(ym)) %>%
summarize(avg_monthly_counts=mean(total_monthly_counts))
## # A tibble: 36 x 3
## # Groups: name [?]
## name `month(ym)` avg_monthly_counts
## <chr> <dbl> <dbl>
## 1 Hawthorne 1 79782.
## 2 Hawthorne 2 81534.
## 3 Hawthorne 3 100500.
## 4 Hawthorne 4 130006.
## 5 Hawthorne 5 155227.
## 6 Hawthorne 6 151171.
## 7 Hawthorne 7 138820.
## 8 Hawthorne 8 164198.
## 9 Hawthorne 9 142954.
## 10 Hawthorne 10 126626
## # ... with 26 more rows