Handling data frames the data.table way
I aim to introduce data.table to those who are somewhat familiar with base R and/or dplyr, with some examples showing similarities and differences. But this is just to anchor you to some familiar syntax that you have already encountered, and it is not necessary to run the base R and dplyr code. (I also do not make any speed comparisons or formal benchmarking; there are several examples of head-to-head benchmarking comparisons between data.table vs dplyr vs base R online and while data.table is arguably much faster, it may not be the same (or may not even matter) in your use case.)
The data.table package was created by Matt Dowle and co-developed by Arun Srinivasan. The data.table philosophy is
[]
DT[i, j, by = ]
where:
i
is row subsettingj
is column selection and any arbitrary R expression that returns
a listby =
is grouping variable(s)Additionally:
.N
for number of observations (rows)on =
is joining (merging) variable(s).()
shortcut for list()
.SD
and .SDcols
- subset of dataset (not discussed here)set*
functions - e.g. setnames()
, setDT()
, setorder()
perform actions on the data.table by reference (i.e. in place)Let’s load the packages we are going to need.
library(data.table)
library(dplyr)
library(ggplot2)
library(babynames)
fread()
We will mainly use the data from the babynames package for later work,
but intially let’s go through data.table’s fread
function.
fread
can take a url, a filename or the output of a bash command as
input. For example, let’s read a csv file from a url and write it to a
gzipped file locally using data.table::fwrite
flights <- fread("https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/flights14.csv")
fwrite(flights, "flights.csv.gz")
We can read a compressed file directly (if R.utils
package is
installed)
fread("flights.csv.gz")
We can read output from arbitrary bash commands. Here we use the cut
command in bash to read in only columns 1-8 (the same can be done within
the fread()
function using the select =
argument)
fread(cmd = "zcat flights.csv.gz | cut -d, -f1-8")
Let’s make a copy of the babynames dataset, coerce it to a data.table.
class(babynames) # a tibble
babynames_df <- as.data.frame(babynames)
class(babynames_df) # a data.frame
babynames_dt <- as.data.table(babynames)
class(babynames_dt)
head(babynames_dt) # a data.table
[i, j]
Remember again:
DT[i, j, by = ]
where:
i
is row subsettingj
is column selection and any arbitrary R expression that returns
a listby =
is grouping variable(s)Additionally:
.N
for number of observations (rows)on =
is joining (merging) variable(s).()
shortcut for list()
.SD
and .SDcols
- subset of dataset (not discussed here)set*
functions - e.g. setnames()
, setDT()
, setorder()
perform actions on the data.table by reference (i.e. in place)Let’s try making a new data.table for only the year 2000. (Important to
remember that if there is only one argument within []
, base R
interprets this as the column (because data.frames are lists
internally), whereas data.table interprets this as the row.)
# base R
babynames_df[babynames_df$year == 1880, ] # try this: babynames_df[babynames_df$year == 1880]
# data.table
babynames_dt[year == 1880] # equivalent to babynames_dt[year == 2000, ]
# dplyr
babynames %>%
filter(year == 1880) # equivalent to filter(babynames, year == 2000)
In a data.table []
i
is always row (not so with base R!)j
outputs a list
, the subset is a data.table
,
if not, it is an atomic vectorj
can be coerced to a list to make sure the subset is a
data.table
using list()
or .()
Next, let’s subset the dataset to give us just the name
column for the
year 2000.
# base R
babynames_df[babynames_df$year == 2000, "name"] # equivalent to babynames_df[babynames_df$year == 2000, 3]
# data.table
babynames_dt[year == 2000, name] # NOT the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, list(name)] # the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, .(name)] # .() is just a shorthand for list()
# dplyr
babynames %>%
filter(year == 2000) %>%
select(name)
Introducing .N
count()
function in dplyrQ. How many different names were given to babies born in the year 2000?
# base R nrow(babynames_df[babynames_df$year == 2000, ]) # returns an atomic vector # data.table babynames_dt[year == 2000, .N] # returns an atomic vector here, but a data.table is returned if there is more than 1 group babynames_dt[year == 2000, .(.N)] # or wrap j in .() to consistently return a data.table # dplyr babynames %>% filter(year == 2000) %>% count() # returns a tibbleQ. How many babies were named “Gandalf”, in all the years that data has been collected?
# babynames_dt[i, j]By copying
Just like base R, and similar to mutate()
in dplyr.
babynames_dt$firstletter <- substr(babynames_dt$name, start = 1, stop = 1)
By reference :=
Changes the data.table in place.
babynames_dt[ , firstletter := substr(name, start = 1, stop = 1)]
head(babynames_dt)
by =
The by =
argument takes the grouping variable. For example, if we want
to group by the first letter
Q. How many baby names start with each letter of the alphabet?
counts_by_letter <- babynames_dt[ , .(total = sum(n)), by = firstletter] counts_by_letterQ. How many babies were named “Trinity” in *each of the years* between 1995 and 2004 (inclusive)? hint: subset by name and year range in `i`, select year and n columns in `j`
babynames_dt[name == "Trinity" & year > 1994 & year < 2006 , .(total = sum(n)), by = year] %>% ggplot(aes(x = year, y = total)) + geom_line()By copying - using order()
, which is a base R function that has
been optimised in data.table
counts_by_letter[order(firstletter)] # A to Z
counts_by_letter[order(-firstletter)] # Z to A
counts_by_letter[order(total)] # ascending total
counts_by_letter[order(-total)] # descending total
By reference
setorder(counts_by_letter, -firstletter)
counts_by_letter
on =
keywordletters_df <- data.table(firstletter = LETTERS, letter_no = 1:26)
letters_df
counts_by_letter[letters_df, on = .(firstletter)]
melt
and dcast
Making long tables wide using dcast
formula: LHS ~ RHS
.
represents no variable...
represents all other variables not explicitly mentioned in
LHS or RHSLet’s first make a new data.table that summarises the total number of individuals by the first letter of their name and the year of birth:
counts_by_letter_and_year <- babynames_dt[, .(total = sum(n)), by = .(firstletter, year)]
counts_by_letter_and_year
Let’s visualise that, for fun.
counts_by_letter_and_year %>%
ggplot(aes(x = year, y = total, colour = firstletter)) +
geom_line()
Using dcast()
to make a wide table (year in rows, each letter as a
separate column)
widetable <- dcast(counts_by_letter_and_year, year ~ firstletter)
widetable
How popular are Z names over the years?
widetable %>%
ggplot(aes(x = year, y = Z)) +
geom_line()
Making wide tables long using melt
id.vars
- the columns that are ‘kept as is’measure.vars
- columns that are ‘melted’variable.name
- name for the names of the melted columnsvalue.name
- name for the values from the melted columnswidetable
melt(widetable, id.vars = c("year"), variable.name = "letter", value.name = "total")
Q. Which palindromic names are the most common (in all years combined)?
# function to check if a string is a palindrome # returns TRUE if the reverse of a name is the same as the name itself # ignores capitalisation is_palindrome <- function(name) { rev_name = sapply(strsplit(name, ""), function(x) paste(rev(x), collapse = "")) return(tolower(name) == tolower(rev_name)) }# is_palindrome("lol")
Using data.table:
# in plain English: take the babynames dataset, subset it for palindromic names, and sum the number of individuals for each name, and sort by the total number
babynames_dt[is_palindrome(name), .(total = sum(n)), by = name][order(-total), ]
Equivalent code with dplyr
babynames_dt %>%
filter(is_palindrome(name)) %>%
group_by(name) %>%
summarise(total = sum(n)) %>%
arrange(desc(total))
Equivalent code with base R
babynames_palindromic <- babynames_df[is_palindrome(babynames_df$name), ]
babynames_aggregated <- aggregate(babynames_palindromic$n, by = list(babynames_palindromic$name), FUN = "sum")
babynames_aggregated[order(babynames_aggregated$x, decreasing = TRUE), ]
Bonus question:
Q. What are the top 10 most common names among millennials (let’s go with the Wikipedia definition of millennials = those born between the birth years 1981 and 1996 (inclusive))?