Slides | Srinivasa Rao

Regular Expressions

A ‘language’ to represent text patterns - concept invented in the 1950’s
Bound by a set of rules (syntax); a set of special characters used to denote patterns
Multi-platform: available (natively or through libraries) in many languages and tools (R, Python, Java, sed, awk)
Use cases:
- read files with a specific naming pattern, e.g. 20190114_Mon_P1_W08_R2.csv, 20190114_Mon_P10_W01_R3.csv
- search for text patterns
- replace text patterns

The basics

Character classes []
- any character: .
- alphabet: [A-Z] or [:upper:], [a-z] or [:lower:], [A-Za-z] or [:alpha:]
- numeric: [0-9] or [:digit:] or \d
- alphanumeric: [A-Za-z0-9] or [:alnum:]
- whitespace (space, tab, linebreak): \s
quantifiers:
- one or more (of the preceding character): +
- zero or more: *
- zero or one: ?
- specified number: {m,}, {m,n}

The basics

anchors:
- start: ^ (except in the context of [^ ], where it is negation)
- end: $
capture groups:
- extract groups: ()
- refer to captured groups: \1, \2, etc.
metacharacters: . \ | ( ) [ { ^ $ * + ? ,

Examples

https://regexr.com/
in the string “the cat in the hat has a bat”:
- [ch]at matches cat and hat
- .at matches cat, hat and bat
- .{2} matches in
- [:alpha:]{1,2} matches in and a
- .\s. matches e c, t i, n t, e h, t h, s a, a b

Strings in R

Strings (“character” class) are represented in R using " or ’
But what about special characters like newlines and tabs? They are represented as escape sequences.print prints the escape sequence, whereas cat processes them.

string = "First\tline\nSecond\tline"
print(string)

## [1] "First\tline\nSecond\tline"

cat(string)

## First    line
## Second   line

Strings in R

What if the string contains an invalid escape character?

regex_string = ".\s."

## Error: '\s' is an unrecognized escape in character string starting "".\s"

Regular expressions are represented as strings in R. But strings are processed first for escape characters. Unrecognised escape characters in strings throw an error, before even reaching the regex parser.
Double backslahes needed for regex escape sequences

regex_string = ".\\s."
string = "the cat in the hat has a bat"
regexpr(regex_string, string)

[1] 3 attr(,“match.length”) [1] 3 attr(,“index.type”) [1] “chars” attr(,“useBytes”) [1] TRUE

Two problems Some people, when

confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski

Caption for thepicture.

Base R functions that use regex

grep()
grepl()
regexpr()
gregexpr()
sub()
gsub()
strsplit()
list.files()

Stringr functions

As with other Tidyverse functions, Stringr functions take the text as the first argument and the pattern as the second argument

str_locate() - like regexpr(), but returns an integer matrix
str_detect() - like grepl()
str_split() - like strsplit()
str_extract() - like match = regexpr(pattern, string); substring(string, match, match + attr(match, "match.length") - 1)

Test drive

library(babynames)
library(stringr)
library(dplyr)
library(ggplot2)
library(ggpubr)
babynames %>%
  group_by(name, sex) %>%
  summarise(n = sum(n)) -> sum_babynames

sum_babynames %>%
  filter(str_detect(name, "a$")) %>%
  group_by(sex) %>%
  count() %>%
  ggplot(aes(x = sex, y = n, fill = sex)) + 
    geom_col(colour = "black") +
    labs(title = "Names ending with 'a'") +
    theme_pubr(border = TRUE)

babynames %>%
  filter(str_detect(name, "a$")) %>%
  group_by(sex, year) %>%
  count() %>% 
  ggplot(aes(x = year, y = n, colour = sex)) + 
    geom_line() +
    labs(title = "Names ending with 'a'") +
    theme_pubr(border = TRUE)

sum_babynames %>%
  filter(str_detect(name, "[aeiou]$")) %>%
  group_by(sex) %>%
  count() %>%
  ggplot(aes(x = sex, y = n, fill = sex)) + 
    geom_col(colour = "black") +
    labs(title = "Names ending with vowel") +
    theme_pubr(border = TRUE)

babynames %>%
  filter(str_detect(name, "[eiou]$")) %>%
  group_by(sex, year) %>%
  count() %>% 
  ggplot(aes(x = year, y = n, colour = sex)) + 
    geom_line() +
    labs(title = "Names ending with a vowel other than 'a'") +
    theme_pubr(border = TRUE)

sum_babynames %>%
  filter(str_detect(name, "(.{2})\\1")) %>%
  group_by(sex) %>%
  #filter(sex == "M") %>%
  count() %>% 
  #print()
  ggplot(aes(x = sex, y = n, fill = sex)) + 
    geom_col(colour = "black") +
    labs(title = "Names with repetitive characters") +
    theme_pubr(border = TRUE)

sum_babynames %>%
  filter(str_detect(name, "[HhZz]ero")) %>%
  group_by(sex) %>%
  #filter(sex == "M") %>%
  #count() %>% 
  print()

## # A tibble: 36 x 3
## # Groups:   sex [2]
##    name     sex       n
##    <chr>    <chr> <int>
##  1 Acheron  M        56
##  2 Cherod   M         6
##  3 Cherokee F      2414
##  4 Cherokee M       337
##  5 Cherol   F        17
##  6 Cherolyn F        60
##  7 Cheron   F       635
##  8 Cheron   M        99
##  9 Cheronda F       164
## 10 Cherone  F         7
## # … with 26 more rows

babynames %>% group_by(year) %>% count() -> babynames_pop
left_join(babynames, babynames_pop, by = "year") -> babynames_complete

babynames_complete %>%
  filter(str_detect(name, "^[^AEIOUaeiou]+$")) %>%
  group_by(sex, year) %>% 
  summarise(n = sum(n.x/n.y)) %>%
  #count() %>% 
  ggplot(aes(x = year, y = n, colour = sex)) + 
    geom_line() +
    labs(title = "Names without any vowels") +
    theme_pubr(border = TRUE)

babynames %>%
  filter(str_detect(name, "^Joshua$")) %>%
  group_by(sex, year) %>% 
  #count() %>% head()
  ggplot(aes(x = year, y = n, colour = sex)) + 
    geom_line() +
    labs(title = "") +
    theme_pubr(border = TRUE)

Regular Expressions

The basics

The basics

Examples

Strings in R

Strings in R

Two problems Some people, when

Base R functions that use regex

Stringr functions

Test drive

Commenting within a regex