Preface
To start, I know that's a bold claim, and I feel uncomfortable even asserting it. The truth is that I've been working on this regular expression (regex) lexer/parser off and on for the last 3-4 years, have dug deep into the regular expression literature, and have generally found resources incomplete or confusing to muddle through. This led me to create a Whimsical flow chart that I reference often when I'm working on my side project, but I still from time to time find myself needing to refer to other things and it would be nice to have them all in one place.
Now, normally I would just put this off until someone asked me for such a resource, as I'm naturally the type of person who tends not to solve problems for themselves (apparently preferring to suffer?), but will bend over backward to help another. But it's November 28th, Virtual Coffee is doing NaNoWriMo for blog writing, we're behind the word count goal, and I'm overcome with the guilt of under-contributing. So you have Virtual Coffee, my favorite developer community, to thank for the monstrosity this is likely to turn into.
Without further ado, let's get into regular expressions.
What is a regular expression?
A regular expression is a sequence of characters that make up a search pattern to be used in a piece of text. Put another way, a regular expression is like a shorthand way of communicating what you're looking for in a document or other piece of text. We've all gone looking through a website or document, trying to find every instance of a particular word. Rather than scouring the page manually, relying on your own two eyes, chances are that you've employed the good ol' "Find" functionality, giving it the particular word you're interested in with the hope of reducing the amount of the document you have to visually scour yourself. Have you ever had a situation where you're looking for a particular word but you don't know if that word is going to appear in the middle of a sentence or the beginning of a sentence and therefore don't know which case you need to be searching the document with, and been frustrated that the Find only searches for the case you specifically use instead of the general term? Or worse yet, have you ever had to search for a word that's rather commonly used but you want to search for it cased in a certain way and the search has been case insensitive (searching for both uppercase and lowercase variants of your query)? Regular expressions to the rescue!
Anatomy of a regular expression
Regular expressions have two parts, a pattern (the criteria by which you want to search) and flags (modifiers that tell the system how to interpret the criteria). Flags are optional in regular expressions, but if you've ever seen a regular expression in the wild, more than likely you've encountered a regular expression with flags. In JavaScript, regular expressions have their own type -- RegExp!
/*
There are two ways of declaring a regular expression
in JavaScript, by defining a regular expression literal
or by calling the constructor function.
Regular expression literals are compiled when the script is loaded.
Regular expressions declared using the constructor function are compiled
at runtime.
*/
let regex = /meow/g // regular expression literal
let regex = new RegExp("meow", "g") // constructor function
Flags
There are 8 flags in JavaScript-flavored regex, each represented by a single character but accessible through a longer property name.
Flag | Property name | Meaning |
d | hasIndices | Generate indices for substring matches |
g | global | Global search |
i | ignoreCase | Case-insensitive search |
m | multiline | Allows the ^ and $ symbols to match newline characters |
s | dotAll | Allows . to match newline characters |
u | unicode | Treats a pattern as a sequence of Unicode code points |
v | unicodeSets | MORE UNICODE FEATURES |
y | sticky | Match starting at current position in the target string |
You can inspect which flags are on any RegExp instance by accessing the flags
property. You can also check for the presence of a particular flag by accessing the property name for that flag.
let regex = /meow/gi
console.log(regex.flags) // gi
console.log(regex.global) // true
console.log(regex.hasIndices) // false
console.log(regex.ignoreCase) // true
Categories of characters
Assertions
Think of assertions as rules governing if a match is possible. These include anchors (symbols that indicate the start and/or end of input), word boundaries, and lookarounds.
When used as the literal first character in a regular expression, ^
indicates that the pattern must match the beginning of the input. If the m
/multiline
flag is set, ^
indicates that the pattern must match the beginning of the line.
When used as the literal last character in a regular expression, $
indicates that the pattern must match up to the end of the input. If the m
/multiline
flag is set, $
indicates that the pattern must match up to the end of the line.
\b
indicates a word boundary, when a word character and a non-word character are next to each other.
\B
indicates a non-word boundary, which could be two non-word characters or two word characters
/*
Word boundaries are kind of tricky to grok,
so we'll give a little example here:
*/
let regex = /\bA/
regex.test("An apple") // true
regex.test("an Apple") // true
regex.test("anApple") // false
regex = /\bA\b/
regex.test("A lovely day") // true
regex.test("Another lovely day") // false
regex = /\Bon/
regex.test("onward") // false
regex.test("noon") // true
Lookaheads assert that the preceding token only matches if it is followed by something matching the pattern of the lookahead. The syntax of a lookahead is x(?=y)
where x
is the preceding token and y
is the pattern that must follow x
for x
to match.
Negative lookaheads assert that the preceding token only matches if it is not followed by something matching the pattern of the negative lookahead. The syntax of a negative lookahead is x(?!y)
where x
is the preceding token and y
is the pattern that must not follow x
for x
to match.
Lookbehinds assert that the pattern that follows them only matches if it is preceded by something matching the pattern of the lookbehind. The syntax for lookbehinds is (?<=y)x
where x
is the pattern attempting to be matched and y
is the pattern that must precede it for it to be a match.
Negative lookbehinds assert that the pattern that follows them only matches if it is not preceded by something matching the pattern of the lookbehind. The syntax for negative lookbehinds is (?<!y)x
where x
is the pattern attempting to be matched and y
is the pattern that must not precede it for it to be a match.
Character classes
Character set refers to the ability to specify "any of these characters". For example, /[aeiou]/
would match either an "a" or an "e" or an "i" or an "o" or a "u".
You can also specify a range of characters in a character class by using a hyphen, as long as the character on the left of the hyphen is at a lower code point than the character on the right of the hyphen. If the character on the left of the hyphen is at a higher code point, you'll get a SyntaxError for Range out of order in character class
. You can, however, use a regular hyphen in a character class as a way of matching a hyphen -- you just need to remember to put it at the start or end of the character class rather than in the middle.
Negated character set refers to the ability to specify "none of these characters". For example, /[^a-d]
means "any character other than a
, b
, c
, and d
". The same rule about using a hyphen as a range character applies, but in order to specify that you do not want to match a hyphen, the hyphen must come either immediately after the ^
character or at the end of the negated character set.
Dot .
is one of the most frequently used metacharacters in regex. It matches any character, except line terminators (\n
, \r
, \u2028
, \u2029
) by default. If the s
/dotAll
flag is set, however, the .
matches those too! However, if a .
is inside a character set, it matches a literal period.
\d matches any Arabic numeral and is equivalent to [0-9]
.
\D matches any character that isn't an Arabic numeral, equivalent to [^0-9]
.
\w matches any alphanumeric character from the Latin alphabet and underscores, equivalent to [A-Za-z0-9_]
.
\W matches any character that is not an alphanumeric character from the Latin alphabet or an underscore, equivalent to [^A-Za-z0-9_]
.
\s matches a whitespace character, including spaces, tabs, form feeds, line feeds, and other Unicode spaces, equivalent to [\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
.
\S matches a character that is not a whitespace character, equivalent to [^\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
.
\t matches a horizontal tab.
\r matches a carriage return.
\n matches a linefeed.
\v matches a vertical tab.
\f matches a form-feed.
[\b] matches a backspace
\0 matches a NULL character (char code 0). It is important to remember not to follow \0 with another digit, or it will be interpreted as an octal instead.
Control characters are non-printing characters that convey information about text. We'll list all the control characters in a separate post for ease of discovery, but they range from \c[a-zA-Z]
.
Hexadecimals are ways to match characters with the format of \xhh
with h
being a hexadecimal value. Hexadecimal values are 0-9a-fA-F. You'll recognize the first 27 hexadecimal values if you're familiar with the control characters. We'll list all hexadecimal characters and their values in a separate post.
Unicode characters match a UTF-16 code-unit with their hexadecimal value. The format for unicode when the unicode flag is not set is \uhhhh
with each h
being a hexadecimal value. If the unicode flag is set, there is also the option of \u{hhhh}
or \u{hhhhh}
. We'll list unicode characters in a series of posts because this blogging platform really did not enjoy me trying to include them all in one post.
Unicode Property Escapes are a way of matching a range of UTF-16 code-points that correspond with the given Unicode character properties. The format for searching for characters that match a unicode property are \p{binaryPropertyName}
or \p{propertyName=Value}
. Binary property names include those listed in this table. Property names include those listed in this table. Value refers to the options provided in this file. If you're searching for things that don't match a unicode property, the syntax for that is \P{binaryPropertyName}
or \P{propertyName=Value}
. We will go into these options in another post. Please note that unicode property escapes are only supported in unicode mode (when the unicode flag is set).
\ is a special character that indicates that the next character should be treated specially, as you may have noticed from these earlier examples. It is also useful for treating a special character as a literal character. (Treating .
as a literal period, for example).
Disjunction is the |
symbol you may have encountered in some regex in the wild. This essentially splits something into two alternatives. For example, a|b
can match a
or b
.
Groups
Capturing groups refers to the ability to find a match to a pattern and remember the match to be used later. Capturing groups can be referenced by the number in which they appear, the first capture being given the number 1. The syntax for a capturing group is to surround the pattern in parentheses, like (pattern)
.
Named capturing groups are a way to create a capture group and, like, the name would imply, give it a name to be referenced by later. The syntax for a named capture group is (?<Name>pattern)
.
Non-capturing groups are useful when you want to group characters together for a match but don't want to be able to reference them later. The syntax for non-capturing groups is (?:pattern)
.
Backreferences are the way you can reference a capturing group by number, and the syntax is \Number
.
Named backreferences are ways to reference named capturing groups as an alternative to using a numeric backreference. The syntax for named backreferences is \k<Name>
, with Name
matching the name of a named capture group.
Quantifiers
Quantifiers are ways in which to specify how many times you want a certain character to appear, as an alternative to having to type each character individually. By default, quantifiers are greedy, which means they will match the most characters possible according to the pattern. However, if you follow a quantifier with a ?
, you turn the quantifier into what's called a lazy quantifier, making it stop as soon as there is a match.
For example, given a string like "<div><a><span></span></a></div>"/<.*>/
will match "<div><a><span></span></a></div>"/<.*?>/
will match "<div>"
* indicates that the preceding character or group of characters matches zero or many times.
+ indicates the preceding character or group of characters matches one ore more times.
? indicates the preceding character or group of characters matches zero or one times.
Ranges are ways to specify numerically how few or how many times the previous character or group of characters have to match.
To match an exact number of times, the syntax is x{n}
where x
is the character being matched and n
is the number of times x
should be matched.
To specify a minimum without a maximum, to ensure that a character matches at least a certain number of times, the syntax is x{n,}
with x
being the character being matched and n
being the minimum number of times you wish it to be matched.
To specify a minimum and maximum range, the syntax is x{min, max}
with x
being the character being matched, min
being the minimum number, and max
being the maximum number. For example, x{2,5}
would match xx
, xxx
, xxxx
, or xxxxx
.