These are the best practices for data science recommended by [@Wilson2014]:
This section assumes students know little about R and gets them up to speed with the basics:
In programming as in writing, it is generally a good idea to stick to a consitent coding style. There are two style guides that you can adopt or customize to create your own:
RStudio is good for writing and testing your R code, but for work that needs repetitions or takes a long time to finish, it may be easier to run your program/script in command line instead.
We can create a R script (from the File/New File/R Script menu of RStudio) that load the bike counts for Hawthorne Bridge:
library(tidyverse)
input_file <- "data/Hawthorne Bridge daily bike counts 2012-2016 082117.xlsx"
bridge_name <- "Hawthorne"
bikecounts <- read_excel(input_file)
names(bikecounts) <- c("date", "westbound", "eastbound", "total")
bikecounts$bridge <- bridge_name
head(bikecounts)
Choose a file name, for example, load_data.R
, and save the script in the code directory of your RStudio project.
Now we can run the script in a command line shell (you can open one in RStudio’s Tools/Shell… menu):
Rscript code/load_data.R
Notice that the script may not print out outputs on the screen when called in the command line unless you explicitly call the print
function.
But what if we have many files for which we would like to repeatedly show the basic information (rows, data types etc)? We can refactor our script to accept the file name and bridge name from command line arguments, so that the script can work with any acceptable files.
In a R script, you can use commandArgs
function to get the command line arguments:
args <- commandArgs()
print(args)
So in our case, our script should take input_file and bridge_name from the command line arguments, we can get the value of the arguments with:
args <- commandArgs()
input_file <- args[1]
bridge_name <- args[2]
Replace the two lines in load_data.R
starting with input_file
and bridge_name
with these three lines.
Now our script can be invoked in the command line with:
Rscript code/load_data.R "data/Hawthorne Bridge daily bike counts 2012-2016 082117.xlsx" Hawthorne
This section is adapted from Visual Debugging with RStudio.
foo.R
from https://raw.githubusercontent.com/cities/datascience2017/master/code/foo.R and save it to the code
(or src
) subdirctory of your project folder;foo.R
and source
it;foo("-1")
and then enter.Why does the foo
function claim “-1 is larger than 0”? Let’s debug the foo
function and find out.