Input Consumption
A naïve way to think about regexes is “a way to find a substring within a larger string”
(often called, colorfully, “needle in a haystack”). While this naïve conception is often
all you need, it will limit your ability to understand the true nature of regexes and
leverage them for more powerful tasks.
The sophisticated way to think about a regex is a pattern for consuming input strings.
The matches (what you’re looking for) become a byproduct of this thinking.
A good way to conceptualize the way regexes work is to think of a common children’s
word game: a grid of letters in which you are supposed to find words. We’ll ignore
diagonal and vertical matches; as a matter of fact, let’s think only of the first line of
this word game:
X J A N L I O N A T U R E J X E E L N P
Humans are very good at this game. We can look at this, and pretty quickly pick out
LION, NATURE, and EEL (and ION while we’re at it). Computers—and regexes—are
not as clever. Let’s look at this word game as a regex would; not only will we see how
regexes work, but we will also see some of the limitations that we need to be aware of.
To simplify things, let’s tell the regex that we’re looking for LION, ION, NATURE,
and EEL; in other words, we’ll give it the answers and see if it can verify them.
The regex starts at the first character, X. It notes that none of the words it’s looking for
start with the letter X, so it says “no match.” Instead of just giving up, though, it
moves on to the next character, J. It finds the same situation with J, and then moves
on to A. As we move along, we consider the letters the regex engine is moving past as
being consumed. Things don’t get interesting until we hit the L. The regex engine then
says, “Ah, this could be LION!” Because this could be a potential match, it doesn’t con‐
sume the L; this is an important point to understand. The regex goes along, matching
the I, then the O, then the N. Now it recognizes a match; success! Now that it has
recognized a match it can then consume the whole word, so L, I, O, and N are now
consumed. Here’s where things get interesting. LION and NATURE overlap. As
humans, we are untroubled by this. But the regex is very serious about not looking at
things it’s already consumed. So it doesn’t “go back” to try to find matches in things
it’s already consumed. So the regex won’t find NATURE because the N has already
been consumed; all it will find is ATURE, which is not one of the words it is looking
for. It will, however, eventually find EEL.
Now let’s go back to the example and change the O in LION to an X. What will hap‐
pen then? When the regex gets to the L, it will again recognize a potential match
(LION), and therefore not consume the L. It will move on to the I without consuming
it. Then it will get to the X; at this point, it realizes that there’s no match: it’s not look‐
240 | Chapter 17: Regular Expressions