Handling data frames the data.table way

Abstract

Handling data frames the data.table way

Date
Jun 10, 2020 11:00 AM — 12:00 PM
Event
R code clinic, Big Data Institute
Location
remote

Summary of the content

  • The data.table philosophy
  • How data.table differs from base R & how it differs from dplyr
  • Run through with example dataset
    • loading data into R
    • subsetting and filtering
    • grouping and summarising
    • sorting
    • joining
    • reshaping
    • chaining several operations

I aim to introduce data.table to those who are somewhat familiar with base R and/or dplyr, with some examples showing similarities and differences. But this is just to anchor you to some familiar syntax that you have already encountered, and it is not necessary to run the base R and dplyr code. (I also do not make any speed comparisons or formal benchmarking; there are several examples of head-to-head benchmarking comparisons between data.table vs dplyr vs base R online and while data.table is arguably much faster, it may not be the same (or may not even matter) in your use case.)

The data.table package

The data.table package was created by Matt Dowle and co-developed by Arun Srinivasan. The data.table philosophy is

  • speed
  • concision - (almost) everything can be done within []
  • few dependencies

DT[i, j, by = ]

where:

  • i is row subsetting
  • j is column selection and any arbitrary R expression that returns a list
  • by = is grouping variable(s)

Additionally:

  • .N for number of observations (rows)
  • on = is joining (merging) variable(s)
  • .() shortcut for list()
  • .SD and .SDcols - subset of dataset (not discussed here)
  • set* functions - e.g. setnames(), setDT(), setorder() perform actions on the data.table by reference (i.e. in place)

Let’s load the packages we are going to need.

library(data.table)
library(dplyr)
library(ggplot2)
library(babynames)
  1. Loading data fread()

We will mainly use the data from the babynames package for later work, but intially let’s go through data.table’s fread function.

fread can take a url, a filename or the output of a bash command as input. For example, let’s read a csv file from a url and write it to a gzipped file locally using data.table::fwrite

flights <- fread("https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/flights14.csv")
fwrite(flights, "flights.csv.gz")

We can read a compressed file directly (if R.utils package is installed)

fread("flights.csv.gz")

We can read output from arbitrary bash commands. Here we use the cut command in bash to read in only columns 1-8 (the same can be done within the fread() function using the select = argument)

fread(cmd = "zcat flights.csv.gz | cut -d, -f1-8")

Let’s make a copy of the babynames dataset, coerce it to a data.table.

class(babynames) # a tibble
babynames_df <- as.data.frame(babynames)
class(babynames_df) # a data.frame
babynames_dt <- as.data.table(babynames)
class(babynames_dt)
head(babynames_dt) # a data.table
  1. Subsetting data [i, j]

Remember again:

DT[i, j, by = ]

where:

  • i is row subsetting
  • j is column selection and any arbitrary R expression that returns a list
  • by = is grouping variable(s)

Additionally:

  • .N for number of observations (rows)
  • on = is joining (merging) variable(s)
  • .() shortcut for list()
  • .SD and .SDcols - subset of dataset (not discussed here)
  • set* functions - e.g. setnames(), setDT(), setorder() perform actions on the data.table by reference (i.e. in place)

Let’s try making a new data.table for only the year 2000. (Important to remember that if there is only one argument within [], base R interprets this as the column (because data.frames are lists internally), whereas data.table interprets this as the row.)

# base R
babynames_df[babynames_df$year == 1880, ] # try this: babynames_df[babynames_df$year == 1880]
# data.table
babynames_dt[year == 1880] # equivalent to babynames_dt[year == 2000, ]
# dplyr
babynames %>%
  filter(year == 1880) # equivalent to filter(babynames, year == 2000)

In a data.table []

  • the first argument i is always row (not so with base R!)
  • can refer to column or variable names directly (without quotes or referring to the data.frame, similar to dplyr)
  • if expression in j outputs a list, the subset is a data.table, if not, it is an atomic vector
  • j can be coerced to a list to make sure the subset is a data.table using list() or .()

Next, let’s subset the dataset to give us just the name column for the year 2000.

# base R
babynames_df[babynames_df$year == 2000, "name"] # equivalent to babynames_df[babynames_df$year == 2000, 3]
# data.table
babynames_dt[year == 2000, name] # NOT the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, list(name)] # the same as babynames_dt[year == 2000, 3]
babynames_dt[year == 2000, .(name)] # .() is just a shorthand for list()
# dplyr
babynames %>%
  filter(year == 2000) %>%
  select(name)

Introducing .N

  • stands for the number of rows in the current grouping (or subset)
  • similar to count() function in dplyr

Q. How many different names were given to babies born in the year 2000?

# base R nrow(babynames_df[babynames_df$year == 2000, ]) # returns an atomic vector # data.table babynames_dt[year == 2000, .N] # returns an atomic vector here, but a data.table is returned if there is more than 1 group babynames_dt[year == 2000, .(.N)] # or wrap j in .() to consistently return a data.table # dplyr babynames %>% filter(year == 2000) %>% count() # returns a tibble

Q. How many babies were named “Gandalf”, in all the years that data has been collected?

# babynames_dt[i, j]

detour - adding new columns

By copying

Just like base R, and similar to mutate() in dplyr.

babynames_dt$firstletter <- substr(babynames_dt$name, start = 1, stop = 1)

By reference :=

Changes the data.table in place.

babynames_dt[ , firstletter := substr(name, start = 1, stop = 1)]
head(babynames_dt)
  1. Grouping by =

The by = argument takes the grouping variable. For example, if we want to group by the first letter

Q. How many baby names start with each letter of the alphabet?

counts_by_letter <- babynames_dt[ , .(total = sum(n)), by = firstletter] counts_by_letter

Q. How many babies were named “Trinity” in *each of the years* between 1995 and 2004 (inclusive)? hint: subset by name and year range in `i`, select year and n columns in `j`

babynames_dt[name == "Trinity" & year > 1994 & year < 2006 , .(total = sum(n)), by = year] %>% ggplot(aes(x = year, y = total)) + geom_line()
  1. Sorting

By copying - using order(), which is a base R function that has been optimised in data.table

counts_by_letter[order(firstletter)]  # A to Z
counts_by_letter[order(-firstletter)] # Z to A
counts_by_letter[order(total)]   # ascending total
counts_by_letter[order(-total)]  # descending total

By reference

setorder(counts_by_letter, -firstletter)
counts_by_letter
  1. Joining using the on = keyword

letters_df <- data.table(firstletter = LETTERS, letter_no = 1:26)
letters_df
counts_by_letter[letters_df, on = .(firstletter)]
  1. Reshaping data.tables melt and dcast

Making long tables wide using dcast

formula: LHS ~ RHS

  • . represents no variable
  • ... represents all other variables not explicitly mentioned in LHS or RHS

Let’s first make a new data.table that summarises the total number of individuals by the first letter of their name and the year of birth:

counts_by_letter_and_year <- babynames_dt[, .(total = sum(n)), by = .(firstletter, year)]
counts_by_letter_and_year

Let’s visualise that, for fun.

counts_by_letter_and_year %>%
  ggplot(aes(x = year, y = total, colour = firstletter)) +
    geom_line()

Using dcast() to make a wide table (year in rows, each letter as a separate column)

widetable <- dcast(counts_by_letter_and_year, year ~ firstletter) 
widetable

How popular are Z names over the years?

widetable %>%
  ggplot(aes(x = year, y = Z)) + 
    geom_line()

Making wide tables long using melt

  • id.vars - the columns that are ‘kept as is’
  • measure.vars - columns that are ‘melted’
  • variable.name - name for the names of the melted columns
  • value.name - name for the values from the melted columns
widetable
melt(widetable, id.vars = c("year"), variable.name = "letter", value.name = "total")
  1. Bringing it all together

Q. Which palindromic names are the most common (in all years combined)?

# function to check if a string is a palindrome # returns TRUE if the reverse of a name is the same as the name itself # ignores capitalisation is_palindrome <- function(name) { rev_name = sapply(strsplit(name, ""), function(x) paste(rev(x), collapse = "")) return(tolower(name) == tolower(rev_name)) }
# is_palindrome("lol")

Using data.table:

# in plain English: take the babynames dataset, subset it for palindromic names, and sum the number of individuals for each name, and sort by the total number
babynames_dt[is_palindrome(name), .(total = sum(n)), by = name][order(-total), ]

Equivalent code with dplyr

babynames_dt %>%
  filter(is_palindrome(name)) %>%
  group_by(name) %>%
  summarise(total = sum(n)) %>%
  arrange(desc(total))

Equivalent code with base R

babynames_palindromic <- babynames_df[is_palindrome(babynames_df$name), ]
babynames_aggregated <- aggregate(babynames_palindromic$n, by = list(babynames_palindromic$name), FUN = "sum")
babynames_aggregated[order(babynames_aggregated$x, decreasing = TRUE), ]

Bonus question:

Q. What are the top 10 most common names among millennials (let’s go with the Wikipedia definition of millennials = those born between the birth years 1981 and 1996 (inclusive))?

Resources

Srinivasa Rao
Srinivasa Rao
Postdoctoral research associate

My research interests include prostate cancer, cancer genomics and tumour evolution.

Related