Regular Expressions Every R programmer Should Know

  •  
  •  
  •  
  •  

Regular expressions. How they can be cruel! Well we’re here to make them a tad easier. To do so we’re going to make use of the stringr package

install.packages("stringr")
library("stringr")

We’re going to use the str_detect() and str_subset() functions. In particular the latter. These have the syntax

function_name(STRING, REGEX_PATTERN)

str_detect() is used to detect whether a string contains a certain pattern. At the most basic use of these functions, we can match strings of text. For instance

jr = c("Theo is first", "Esther is second", "Colin - third")
str_detect(jr, "Theo")
## [1]  TRUE FALSE FALSE
str_detect(jr, "is")
## [1]  TRUE  TRUE FALSE

So str_detect() will return TRUE when the element contains the pattern we searched for. If we want to return the actual strings that contain these patterns, we use str_subset()

str_subset(jr, "Theo")
## [1] "Theo is first"
str_subset(jr, "is")
## [1] "Theo is first"    "Esther is second"

To practise our regex, we’ll need some text to practise on. Here we have a vector of filenames called files

files = c(
  "tmp-project.csv", "project.csv", 
  "project2-csv-specs.csv", "project2.csv2.specs.xlsx", 
  "project_cars.ods", "project-houses.csv", 
  "Project_Trees.csv","project-cars.R",
  "project-houses.r", "project-final.xls", 
  "Project-final2.xlsx"
)

I’m also going to give us a task. The task is to be able to grab the files that have a format “project-objects” or “project_objects”. Let’s say of those files we want the csv and ods files. i.e. we want to grab the files “project_cars.ods”, “project-houses.csv” and “project_Trees.csv”. As we introduce more regex we’ll gradually tackle our task.

Regex: The backslash, \

Here we go! Our first regular expression. When typing regular expressions, there are a group of special characters called metacharacters that have other functions. These are:

.{()\^$|?*+

The backslash is SUPER important because if we want to search for any of these characters without using their built in function we must escape the character with a backslash. For example, if we wanted to extract the names of the name of all csv files then perhaps we would think to search for the string “.csv”? Then we would do

str_subset(files, "\.csv") # xXx error = TRUE isn't working

Hang on a second, what? Ah yes. The backslash is a metacharacter too! So to create a backslash for the function to search with, we need to escape the backslash!

str_subset(files, "\\.csv")
## [1] "tmp-project.csv"          "project.csv"             
## [3] "project2-csv-specs.csv"   "project2.csv2.specs.xlsx"
## [5] "project-houses.csv"       "Project_Trees.csv"

Much better. With regards to our task, this is already useful, as we want csv and ods files. However, you’ll notice when we searched for files contained the string “.csv”, we got files of type “.xlsx” as well, just because they had “.csv” somewhere in their name or extension. Up step the hat and dollar…

Regex: The hat ,^, and dollar, $

The hat and dollar are used to specify the start and end of a line respectively. For instance, all file names that start with “Proj” (take note of the capital “P”!)

str_subset(files, "^Proj")
## [1] "Project_Trees.csv"   "Project-final2.xlsx"

So what if we wanted specifically just “.csv” or “.ods” files, just like in our task? We could use the dollar to search for files ending in a specific extension

str_subset(files, "\\.csv$")
## [1] "tmp-project.csv"        "project.csv"           
## [3] "project2-csv-specs.csv" "project-houses.csv"    
## [5] "Project_Trees.csv"
str_subset(files, "\\.ods$")
## [1] "project_cars.ods"

Now we can search for files that end in certain patterns. That’s all well and good, but we still can’t search for both together. Up step round parentheses and the pipe…

Regex: Round parentheses,(), and the pipe, |

Round parentheses and the pipe are best used in conjuction with either other. The parentheses specify a group and the pipe means “or”. Now, we could search for files ending in a certain extension or another extension. For our task we need “.csv” and “.ods” files. Using the pipe

str_subset(files, "\\.csv$|\\.ods$")
## [1] "tmp-project.csv"        "project.csv"           
## [3] "project2-csv-specs.csv" "project_cars.ods"      
## [5] "project-houses.csv"     "Project_Trees.csv"

Alternatively we can use a group and pipe

str_subset(files, "\\.(csv|ods)$")
## [1] "tmp-project.csv"        "project.csv"           
## [3] "project2-csv-specs.csv" "project_cars.ods"      
## [5] "project-houses.csv"     "Project_Trees.csv"

Now we don’t have to write surrounding expressions more than once. Of course there are other csv and ods files that we don’t want to collect. Now we need a way of specifiying a block of letters. Up step the square parentheses and the asterisk…

Regex: Square parentheses,[], and the asterisk, *

The square parentheses and asterisk. We can match a group of characters or digits using the square parentheses. Here I’m going to use a new function, str_extract(). This does as it says on the tin, it extracts the parts of the text that match our pattern. For instance the last lower case letter in each element of the vector, if such a thing exists

str_extract(files, "[a-z]$")
##  [1] "v" "v" "v" "x" "s" "v" "v" NA  "r" "s" "x"

Notice that one of the files ends with an upper case letter, so we get an NA. To include this we add “A-Z” (to add numbers we add 0-9 and to add metacharacters we write them without escaping them)

str_extract(files, "[a-zA-Z]$")
##  [1] "v" "v" "v" "x" "s" "v" "v" "R" "r" "s" "x"

Now, this is obviously useless at the moment. This is where does the asterisk comes into it. The asterisk is what is called a quantifier. There are three other quantifiers (+, ? and {}), but won’t cover them here. A quantifier quantifies how many of the characters we want to match and the asterisk means we want 0 or more characters of the same form. For instance, we could now extract all of the file extensions if we wished to

str_extract(files, "[a-zA-Z]*$")
##  [1] "csv"  "csv"  "csv"  "xlsx" "ods"  "csv"  "csv"  "R"    "r"    "xls" 
## [11] "xlsx"

So we go backwards from the end of the line collecting all the characters until we hit a character that isn’t a lower or upper case letter. We can now use this to grab the group letters preceeding the file extensions for our task

str_subset(files, "[a-zA-Z]*\\.(csv|ods)$")
## [1] "tmp-project.csv"        "project.csv"           
## [3] "project2-csv-specs.csv" "project_cars.ods"      
## [5] "project-houses.csv"     "Project_Trees.csv"

Obviously we still have some pesky files in there that we don’t want. Up step the… only joking! We now actually have all the tools to complete the task. The filenames we want take the form project-objects or project_objects, so we know that preceeding that block of letters for “objects” we want either a dash or an underscore. We can use a group and pipe for this

str_subset(files, "(\\_|\\-)[a-zA-Z]*\\.(csv|ods)$")
## [1] "tmp-project.csv"        "project2-csv-specs.csv"
## [3] "project_cars.ods"       "project-houses.csv"    
## [5] "Project_Trees.csv"

We still have two pesky files sneaking in there. How do those two files and the three files we want differ? Well the files we want all start with “project-” or “project_” where as the other two don’t. We must also take note that the project could have a capital “P”. We can combat that using a group!

str_subset(files, "(P|p)roject(\\_|\\-)[a-zA-Z]*\\.(csv|ods)$")
## [1] "project_cars.ods"   "project-houses.csv" "Project_Trees.csv"

If we had a huge file list, we’d want to stop files such as “2Project_Trees.csv” filtering in as well. So we can just use the hat to specify the start of a line

str_subset(files, "^(P|p)roject(\\_|\\-)[a-zA-Z]*\\.(csv|ods)$")
## [1] "project_cars.ods"   "project-houses.csv" "Project_Trees.csv"

Regular expressions are definitely a trade worth learning. They play a big role in modern data analytics. For a good table of metacharacters, quantifiers and useful regular expressions, see this microsoft page. Remember, in R you have to double escape metacharacters!

That’s all for now. Cheers for reading!


8 thoughts on “Regular Expressions Every R programmer Should Know

  1. Why using stringr? The base functions like grep and sub are not much longer to write and are a lot more efficient and flexible. I would also advise against the dependency on an external package when it’s not absolutely necessary.str_subset(files, “\.csv”) takes 20 microseconds (from microbenchmark)grep(“\.csv”, files, val=T) takes 7, almost 3 times faster, with only one extra character.grep(“.csv”, files, val=T, fixed=T) takes 2.4, close to a factor of 10 from stringr.

    • Mea culpa.Thanks for the feedback and I agree with your comment. We should have highlighted base R versions of the string functions. Essentially the blog post came off a training course we have been asked to develop specifically on the tidyverse.We are a big fans of the base R (see http://blog.jumpingrivers…. and the associated comments)

    • Actually stringr is faster for larger data, when you know how to use it:n <- 2e6set.seed(2)f <- files[sample.int(length(files), n, replace = T)]require(microbenchmark)microbenchmark( str_subset(f, “\.csv”), grep(“\.csv”, f, val=T), grep(“.csv”, f, val = T, fixed = T), str_subset(f, fixed(“.csv”)), times = 10)# Unit: milliseconds# expr min lq mean median uq max neval cld# str_subset(f, “\\.csv”) 987.0476 999.5223 1012.6638 1015.0188 1022.8013 1031.0876 10 c# grep(“\\.csv”, f, val = T) 348.6355 350.8006 369.8565 362.9126 387.5642 419.2893 10 b # grep(“.csv”, f, val = T, fixed = T) 165.2346 168.5898 174.4053 170.5422 172.5742 214.7860 10 a # str_subset(f, fixed(“.csv”)) 154.0928 157.0352 163.2285 160.2875 167.0760 180.2442 10 a

  2. Why do you have to escape the backslash?In my understanding the backslash is a metacharacter that escapes the build-in function of a following metacharacter. Thus, would the first backslash not simply escape the second backslash and the second backslash would therefore be printed?

  3. Thank you!, one thing I dont get is why use “\.csv” when “.csv” is working? what is the function of the single ?

    • When you use “.csv” the “.” stands for any value (which includes the full stop). Compare “str_match(c(“abcsv”, “a.csv”), “.csv”)” with str_match(c(“abcsv”, “a.csv”), “\.csv”)

Comments are closed.