Conditional RegEx Matching with Python

Recently I’ve needed to capture the entries of a datalines statement in SAS for editing. Generally, this is a straight forward problem if I only need to do it with one file or all of the files that I am using are formatted identically. But then I started thinking about the more general case. SAS doesn’t care about the case of my keywords, so I need a case insensitive match. I need to account for possible extra whitespace. So far so good. But what if I have two different keywords that can start my data section, and the end of the data section is indicated with different characters depending on the chosen keyword? Could I still use a single regular expression?

SAS does in fact allow a number of different keywords to enter data in a data step. In my experience, the most common are the datalines and datalines4 statements. The main difference between them is how the end of the data is indicated. For datalines, a single semicolon is used, while datalines4 uses a sequence of four semicolons, thereby allowing the use of semicolons in the data itself. There are some aliases for these commands that can be used: cards/lines and cards4/lines4 with matching behavior. A simple data step with these statements could look like this:

data person;
  input name $ sex $ age;
  datalines; /* or `cards` or `lines` */
Alfred M 14
Alice F 13
;

data person4;
  input name $ sex $ age;
  datalines4; /* or `cards4` or `lines4` */
Alfred M 14
Alice F 13
;;;;

We could write two separate RegEx expressions, one for the datalines/cards/lines statement and a second one for the datalines4/cards4/lines4 statement. But, if the RegEx engine we are using allows conditionals, e.g. the Python RegEx engine, then we can write a single statement that can capture both types of statements. The basic format of the conditonal capture is (?(D)A|B), which can be read as “if capture group D is set, then match A, otherwise match B.” For more details, see here.

Using this technique, we can capture both types of statements in one go. The short form of the solution I found is this regular expression: r"(?:(?:(?:data)?lines)|cards)(4)?\s*;(.*?)(?(1);{4}|;)" with two flags set: case insensitive and dot-all. If we utilize Python’s verbose flag, we can format this a bit nicer as well:

re.compile(
  r"""(?:(?:    # mark groups as non-capture groups
      (?:data)? # maybe match `data`, but don't capture
       lines)   # matches `lines`
      |cards)   # alternatively, matches `cards`
      (4)?      # a `4` may be present
      \s*;      # there might be whitespace before the ;
      (.*?)     # lazy-match data content
      (?(1)     # check if capture group 1 is set, if so
      ;{4}      # match `;;;;`
      |;)       # otherwise, match a single ;
  """, flags=re.DOTALL | re.X | re.I)

A great website to help you build up a regular expression is regex101.com. It allows you to copy a sample text and regular expression. It then explains your expression and lists the capture groups by number, which can be convenient. It also allows you to try out different RegEx engines. Try setting it to Python with the flags we mentioned, and see how it works!

D. Michael Senter
D. Michael Senter
Research Statistician Developer

My research interests include data analytics and missing data.

Related