5 Import
5.1 Readings
- R4DS Chapter 10, Chapter 11
5.2 Tibbles vs. data.frame
5.2.1 Data frames are awesome
Whenever you have rectangular, spreadsheet-y data, your default data receptacle in R is a data frame. Do not depart from this without good reason. Data frames are awesome because…
- Data frames package related variables neatly together,
- keeping them in sync vis-a-vis row order
- applying any filtering of observations uniformly.
- Most functions for inference, modelling, and graphing are happy to be passed a data frame via a
data =
argument. This has been true in base R for a long time. - The set of packages known as the
tidyverse
takes this one step further and explicitly prioritizes the processing of data frames. This includes popular packages likedplyr
andggplot2
. In fact the tidyverse prioritizes a special flavor of data frame, called a “tibble.”
Data frames – unlike general arrays or, specifically, matrices in R – can hold variables of different flavors, such as character data (subject ID or name), quantitative data (white blood cell count), and categorical information (treated vs. untreated). If you use homogenous structures, like matrices, for data analysis, you are likely to make the terrible mistake of spreading a dataset out over multiple, unlinked objects. Why? Because you can’t put character data, such as subject name, into the numeric matrix that holds white blood cell count. This fragmentation is a Bad Idea.
In R Coding Basics we use data.frame
in base R as our main data structure. From now on we will start to use tibble
and other functions in tidyverse
as much as possible. This will provide a special type of data frame called a “tibble” that has nice default printing behavior, among other benefits such as speed performance and better default behavior.
First, install and load tidyverse packages if you haven’t yet:
if (!require(tidyverse))
install.packages("tidyverse"); library(tidyverse)
There are two main differences in the usage of a tibble vs. a classic data.frame
: printing and subsetting.
5.2.2 Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str()
:
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
## # A tibble: 1,000 x 5
## a b c d e
## <dttm> <date> <int> <dbl> <chr>
## 1 2017-11-27 17:18:32 2017-12-25 1 0.07659894 j
## 2 2017-11-27 20:29:09 2017-12-19 2 0.50174012 o
## 3 2017-11-27 11:53:27 2017-12-06 3 0.44544000 b
## 4 2017-11-27 03:44:09 2017-12-23 4 0.68567890 k
## 5 2017-11-27 07:16:38 2017-12-03 5 0.63353081 l
## 6 2017-11-27 17:45:31 2017-12-13 6 0.92768086 c
## 7 2017-11-27 12:35:46 2017-12-24 7 0.34549211 o
## 8 2017-11-27 22:13:37 2017-12-25 8 0.94264613 c
## 9 2017-11-27 11:26:04 2017-12-17 9 0.72518358 z
## 10 2017-11-27 03:24:34 2017-12-08 10 0.84166491 o
## # ... with 990 more rows
lubridate
is a R package for date and time. You need to install the package if you want to replicate the above code on your own computer.
Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.
First, you can explicitly print()
the data frame and control the number of rows (n
) and the width
of the display. width = Inf
will display all columns:
print(mtcars, n = 10, width = Inf)
You can also control the default print behaviour by setting options:
options(tibble.print_max = n, tibble.print_min = m)
: if more thanm
rows, print onlyn
rows. Useoptions(dplyr.print_min = Inf)
to always show all rows.Use
options(tibble.width = Inf)
to always print all columns, regardless of the width of the screen.
You can see a complete list of options by looking at the package help with package?tibble
.
A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.
View(mtcars)
5.2.3 Subsetting
So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, $
and [[
. [[
can extract by name or column position; $
only extracts by name but is a little less typing.
df <- tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x
## [1] 0.5551326 0.7178420 0.7773671 0.1380507 0.1864935
df[["x"]]
## [1] 0.5551326 0.7178420 0.7773671 0.1380507 0.1864935
# Extract by column position
df[[1]]
## [1] 0.5551326 0.7178420 0.7773671 0.1380507 0.1864935
Compared to a data.frame
, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
5.2.4 Converting
Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame()
to turn a tibble back to a data.frame
:
class(as.data.frame(df))
## [1] "data.frame"
Or use as_tibble()
to convert a data.frame to tibble:
class(as_tibble(mtcars))
## [1] "tbl_df" "tbl" "data.frame"
The main reason that some older functions don’t work with tibble is the [
function. We don’t use [
much in this book because dplyr::filter()
and dplyr::select()
allow you to solve the same problems with clearer code (but you will learn a little about it in vector subsetting). With base R data frames, [
sometimes returns a data frame, and sometimes returns a vector. With tibbles, [
always returns another tibble.
5.3 Data Import
The most common data in data science is rectangular, spreadsheet-y data that works best to be loaded as a R data frame, and this section focus on import data into R as data frames. tidyverse
provides the readr
package that reads comma-separated values (csv), tab delimited values (tsv), and fixed width files (fwf) as tibbles. Other R packages provide access to Excel spreadsheets, binary statistical data formats (SPSS, SAS, Stata), and relational databases:
- readr, reads flat files (csv, tsv, fwf) into R
- readxl, reads excel files (.xls and .xlsx) into R
- heaven, reads SPSS, Stata and SAS files into R
- rstats-db, R interface to databases
- RMySQL
- RSQLite
- RPostgres
- …
For this section, we primarily focus on importing data in flat files into data frames. readr
provides these functions:
read_csv()
reads comma delimited files,read_csv2()
reads semicolon separated files (common in countries where,
is used as the decimal place),read_tsv()
reads tab delimited files, andread_delim()
reads in files with any delimiter.read_fwf()
reads fixed width files. You can specify fields either by their widths withfwf_widths()
or their position withfwf_positions()
.read_table()
reads a common variation of fixed width files where columns are separated by white space.read_log()
reads Apache style log files. (But also check out webreadr which is built on top ofread_log()
and provides many more helpful tools.)
These functions all have similar syntax: once you’ve mastered one, you can use the others with ease. For the rest of this chapter we’ll focus on read_csv()
. Not only are csv files one of the most common forms of data storage, but once you understand read_csv()
, you can easily apply your knowledge to all the other functions in readr.
The first argument to read_csv()
is the most important: it’s the path to the file to read.
heights <- read_csv("data/heights.csv")
## Parsed with column specification:
## cols(
## earn = col_double(),
## height = col_double(),
## sex = col_character(),
## ed = col_integer(),
## age = col_integer(),
## race = col_character()
## )
heights
## # A tibble: 1,192 x 6
## earn height sex ed age race
## <dbl> <dbl> <chr> <int> <int> <chr>
## 1 50000 74.42444 male 16 45 white
## 2 60000 65.53754 female 16 58 white
## 3 30000 63.62920 female 16 29 white
## 4 50000 63.10856 female 16 91 other
## 5 51000 63.40248 female 17 39 white
## 6 9000 64.39951 female 15 26 white
## 7 29000 61.65633 female 12 49 white
## 8 32000 72.69854 male 17 46 white
## 9 2000 72.03947 male 15 21 hispanic
## 10 27000 72.23493 male 12 26 white
## # ... with 1,182 more rows
When you run read_csv()
it prints out a column specification that gives the name and type of each column. That’s an important part of readr, which we’ll come back to in [parsing a file].
You can also supply an inline csv file. This is useful for experimenting with readr and for creating reproducible examples to share with others:
read_csv("a,b,c
1,2,3
4,5,6")
## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 3
## 2 4 5 6
In both cases read_csv()
uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behaviour:
Sometimes there are a few lines of metadata at the top of the file. You can use
skip = n
to skip the firstn
lines; or usecomment = "#"
to drop all lines that start with (e.g.)#
.read_csv("The first line of metadata The second line of metadata x,y,z 1,2,3", skip = 2)
## # A tibble: 1 x 3 ## x y z ## <int> <int> <int> ## 1 1 2 3
read_csv("# A comment I want to skip x,y,z 1,2,3", comment = "#")
## # A tibble: 1 x 3 ## x y z ## <int> <int> <int> ## 1 1 2 3
The data might not have column names. You can use
col_names = FALSE
to tellread_csv()
not to treat the first row as headings, and instead label them sequentially fromX1
toXn
:read_csv("1,2,3\n4,5,6", col_names = FALSE)
## # A tibble: 2 x 3 ## X1 X2 X3 ## <int> <int> <int> ## 1 1 2 3 ## 2 4 5 6
(
"\n"
is a convenient shortcut for adding a new line. You’ll learn more about it and other types of string escape in [string basics].)Alternatively you can pass
col_names
a character vector which will be used as the column names:read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## # A tibble: 2 x 3 ## x y z ## <int> <int> <int> ## 1 1 2 3 ## 2 4 5 6
Another option that commonly needs tweaking is na
: this specifies the value (or values) that are used to represent missing values in your file:
read_csv("a,b,c\n1,2,.", na = ".")
## # A tibble: 1 x 3
## a b c
## <int> <int> <chr>
## 1 1 2 <NA>
This is all you need to know to read ~75% of CSV files that you’ll encounter in practice. You can also easily adapt what you’ve learned to read tab separated files with read_tsv()
and fixed width files with read_fwf()
. To read in more challenging files, you’ll need to learn more about how readr parses each column, turning them into R vectors.
5.4 Writing to a file
readr also comes with two useful functions for writing data back to disk: write_csv()
and write_tsv()
. Both functions increase the chances of the output file being read back in correctly by:
Always encoding strings in UTF-8.
Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
If you want to export a csv file to Excel, use write_excel_csv()
— this writes a special character (a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.
The most important arguments are x
(the data frame to save), and path
(the location to save it). You can also specify how missing values are written with na
, and if you want to append
to an existing file.
write_csv(heights, "results/heights.csv")
Note that the type information is lost when you save to csv:
heights
## # A tibble: 1,192 x 6
## earn height sex ed age race
## <dbl> <dbl> <chr> <int> <int> <chr>
## 1 50000 74.42444 male 16 45 white
## 2 60000 65.53754 female 16 58 white
## 3 30000 63.62920 female 16 29 white
## 4 50000 63.10856 female 16 91 other
## 5 51000 63.40248 female 17 39 white
## 6 9000 64.39951 female 15 26 white
## 7 29000 61.65633 female 12 49 white
## 8 32000 72.69854 male 17 46 white
## 9 2000 72.03947 male 15 21 hispanic
## 10 27000 72.23493 male 12 26 white
## # ... with 1,182 more rows
write_csv(heights, "results/heights-2.csv")
read_csv("results/heights-2.csv")
## Parsed with column specification:
## cols(
## earn = col_double(),
## height = col_double(),
## sex = col_character(),
## ed = col_integer(),
## age = col_integer(),
## race = col_character()
## )
## # A tibble: 1,192 x 6
## earn height sex ed age race
## <dbl> <dbl> <chr> <int> <int> <chr>
## 1 50000 74.42444 male 16 45 white
## 2 60000 65.53754 female 16 58 white
## 3 30000 63.62920 female 16 29 white
## 4 50000 63.10856 female 16 91 other
## 5 51000 63.40248 female 17 39 white
## 6 9000 64.39951 female 15 26 white
## 7 29000 61.65633 female 12 49 white
## 8 32000 72.69854 male 17 46 white
## 9 2000 72.03947 male 15 21 hispanic
## 10 27000 72.23493 male 12 26 white
## # ... with 1,182 more rows
This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives:
write_rds()
andread_rds()
are uniform wrappers around the base functionsreadRDS()
andsaveRDS()
. These store data in R’s custom binary format called RDS:write_rds(heights, "results/heights.rds") read_rds("results/heights.rds")
## # A tibble: 1,192 x 6 ## earn height sex ed age race ## <dbl> <dbl> <chr> <int> <int> <chr> ## 1 50000 74.42444 male 16 45 white ## 2 60000 65.53754 female 16 58 white ## 3 30000 63.62920 female 16 29 white ## 4 50000 63.10856 female 16 91 other ## 5 51000 63.40248 female 17 39 white ## 6 9000 64.39951 female 15 26 white ## 7 29000 61.65633 female 12 49 white ## 8 32000 72.69854 male 17 46 white ## 9 2000 72.03947 male 15 21 hispanic ## 10 27000 72.23493 male 12 26 white ## # ... with 1,182 more rows
The feather package implements a fast binary file format that can be shared across programming languages:
install.packages("feather") library(feather) write_feather(heights, "results/heights.feather") read_feather("results/heights.feather")
Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which we will come back to in [Model]); feather currently does not.
5.5 Learn More
- R Data Import Tutorial by Data Camp
- Importing Data and Exporting Data by R Programming @ UC.
- Data Import & Export in R by National Park Services
- Scrape web data with APIs