regex

Matching patterns and extracting information in strings

Basic Matching

Any sequence of letters or digits will match exactly that sequence. For example, the regex “the” will match precisely those places where there is a “t” followed by an “h” followed by an “e”.

Meta Characters

  • . : matches any single character except line break
  • [] : character class, matches any character within the brackets
  • [^] : negated character class, matches any character not in the square brackets
  • * : matches 0 or more repetitions of the preceding symbol
  • + : matches 1 or more repetitions of the preceding symbol
  • ? : treats preceding symbol as optional match
  • {n,m} : matches at least n but no more than m repetitions of preceding symbol
  • () : capturing group, group of sub-patterns in parentheses. Makes it possible to extract pieces from the matching
  • | : alternation, matches either characters before or characters after the “|”
  • \ : escapes following character
  • ^ : match at beginning of the input
  • $ : match at the end of the input

Character Sets

  • \w : matches alphanumeric characters [a-zA-z0-9_]
  • \W : matches non-alphanumeric characters [^a-zA-z0-9_]
  • \d : matches any digit [0-9]
  • \D : matches any non-digit character [^\d]
  • \s : matches whitespace characters
  • \S : matches non-whitespace characters

Lookaround

  • A(?=B): positive look ahead, match A when B is ahead/after
  • A(?!B): negative look ahead, match A when B is not ahead/after
  • (?<=B)A: positive look behind, match A when B is behind/before
  • (?<!B)A: negative look ahead, match A when B is not behind/before

Multiple lookaround

When multiple lookaround clauses are used prior to some target match, most regex systems seem to match only when all conditions hold. For example, with something like (?<!B)(?<!C)A, it will only match A when not preceded by B or C. As far as I can tell, this is simply equivalent to (?<!B|C)A.