3 R Coding Basics

3.2 Best practices for data science

These are the best practices for data science recommended by G. Wilson et al. (2014):

  1. Write programs for people, not computers.
    1. a program should not require its readers to hold more than a handful of facts in memory at once
    2. make names consistent, distinctive, and meaningful
    3. make code style and formatting consistent
  2. Let the computer do the work
    1. make the computer repeat tasks
    2. save recent commands in a file for re-use
    3. use a build tool to automate workflows
  3. Make incremental changes
    1. work in small steps with frequent feedback and course correction
    2. use a version control system
    3. put everything that has been created manually in version control
  4. Don’t repeat yourself (or others)
    1. every piece of data must have a sin- gle authoritative representation in the system
    2. modular- ize code rather than copying and pasting
    3. re-use code instead of rewriting it
  5. Plan for mistakes
    1. add assertions to programs to check their operation
    2. use an off-the-shelf unit testing library
    3. turn bugs into test cases
    4. use a symbolic debugger
  6. Optimize software only after it works correctly
    1. use a profiler to identify bottlenecks
    2. write code in the highest-level language possible
  7. Document design and purpose, not mechanics.
    1. document interfaces and reasons, not implementations
    2. refactor code in preference to explaining how it works
    3. embed the documentation for a piece of software in that software
  8. Collaborate
    1. use pre-merge code reviews
    2. use pair programming when bringing someone new up to speed and when tackling particularly tricky problems
    3. use an issue tracking tool

3.3 R coding basics

This section assumes students know little about R and gets them up to speed with the basics:

  1. Data Structures
    • How can I read data in R?
    • What are the basic data types in R?
    • How do I represent categorical information in R?
  2. Exploring Data Frames
    • How can I manipulate a data frame?
  3. Subsetting Data
    • How can I work with subsets of data in R?
  4. Control Flow
    • How can I work with subsets of data in R?
  5. Visualization with ggplot2
    • How can I create publication-quality graphics in R?
  6. Vectorization
    • How can I operate on all the elements of a vector at once?
  7. Functions Explained
    • How can I write a new function in R?
  8. Writing Good Software
    • How can I write software that other people can use?

3.4 Advanced Topics

3.4.1 Code Style Guide

In programming as in writing, it is generally a good idea to stick to a consitent coding style. There are two style guides that you can adopt or customize to create your own:

3.4.2 R Command-Line Program

RStudio is good for writing and testing your R code, but for work that needs repetitions or takes a long time to finish, it may be easier to run your program/script in command line instead.

We can create a R script (from the File/New File/R Script menu of RStudio) that load the bike counts for Hawthorne Bridge:

library(tidyverse)

input_file <- "data/Hawthorne Bridge daily bike counts 2012-2016 082117.xlsx"
bridge_name <- "Hawthorne"
bikecounts <- read_excel(input_file)
names(bikecounts) <- c("date", "westbound", "eastbound", "total")
bikecounts$bridge <- bridge_name

head(bikecounts)

Choose a file name, for example, load_data.R, and save the script in the code directory of your RStudio project.

Now we can run the script in a command line shell (you can open one in RStudio’s Tools/Shell… menu):

Rscript code/load_data.R

Notice that the script may not print out outputs on the screen when called in the command line unless you explicitly call the print function.

But what if we have many files for which we would like to repeatedly show the basic information (rows, data types etc)? We can refactor our script to accept the file name and bridge name from command line arguments, so that the script can work with any acceptable files.

In a R script, you can use commandArgs function to get the command line arguments:

args <- commandArgs()
print(args)

So in our case, our script should take input_file and bridge_name from the command line arguments, we can get the value of the arguments with:

args <- commandArgs()
input_file <- args[1]
bridge_name <- args[2]

Replace the two lines in load_data.R starting with input_file and bridge_name with these three lines.

Now our script can be invoked in the command line with:

Rscript code/load_data.R "data/Hawthorne Bridge daily bike counts 2012-2016 082117.xlsx" Hawthorne

3.4.3 Debugging with RStudio

This section is adapted from Visual Debugging with RStudio.

  1. Download foo.R from https://raw.githubusercontent.com/cities/datascience2017/master/code/foo.R and save it to the code (or src) subdirctory of your project folder;
  2. Open foo.R and source it;
  3. In the RStudio Console pane of type foo("-1") and then enter.

Why does the foo function claim “-1 is larger than 0”? Let’s debug the foo function and find out.

3.5 Exercise

  1. Write a function that takes the name of a bike counts data file as input and return a data frame;
    • use the readxl package to read data in excel files
  2. Create a R script that utilizes your function to read in data in the Tilikum and Hawthorne bike count files;
  3. Do quick summaries of the data for each brigde:
    • How many days of data are there for each bridge?
    • What are the average daily bike counts for each bridge? Minimum? Maximum?
    • What are the average weekly, monthly, and annual bike counts for each bridge?
  4. [Advanced] Write a function that calculates average daily, weekly, or monthly bike counts for each bridge based on an frequency argument.

3.6 Learning more

References

Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLOS Biology 12 (1): e1001745. doi:10.1371/journal.pbio.1001745.