• If there is a match, the regex consumes all the characters in the match at once;
matching continues with the next character (if the regex is global, which we’ll talk
about later).
This is the general algorithm, and it probably won’t surprise you that the details are
much more complicated. In particular, the algorithm can be aborted early if the regex
can determine that there won’t be a match.
As we move through the specifics of the regex metalanguage, try to keep this algo‐
rithm in mind; imagine your strings being consumed from left to right, one character
at a time, until there are matches, at which point whole matches are consumed at
once.
Alternation
Imagine you have an HTML page stored in a string, and you want to find all tags that
can reference an external resource (
<a>
,
<area>
,
<link>
,
<script>
,
<source>
, and
sometimes,
<meta>
). Furthermore, some of the tags may be mixed case (
<Area>
,
<LINKS>
, etc.). Regular expression alternations can be used to solve this problem:
const
html
=
'HTML with <a href="/one">one link</a>, and some JavaScript.'
+
'<script src="stuff.js"></script>'
;
const
matches
=
html
.
match
(
/area|a|link|script|source/ig
);
// first attempt
The vertical bar (
|
) is a regex metacharacter that signals alternation. The
ig
signifies
to ignore case (
i
) and to search globally (
g
). Without the
g
, only the first match would
be returned. This would be read as “find all instances of the text area, a, link, script, or
source, ignoring case.” The astute reader might wonder why we put
area
before
a
; this
is because regexes evaluate alternations from left to right. In other words, if the string
has an
area
tag in it, it would match the
a
and then move on. The
a
is then con‐
sumed, and
rea
would not match anything. So you have to match
area
first, then
a
;
otherwise,
area
will never match.
If you run this example, you’ll find that you have many unintended matches: the
word link (inside the
<a>
tag), and instances of the letter a that are not an HTML tag,
just a regular part of English. One way to solve this would be to change the regex
to
/<area|<a|<link|<script|<source/
(angle brackets are not regex metacharac‐
ters), but we’re going to get even more sophisticated still.
Matching HTML
In the previous example, we perform a very common task with regexes: matching
HTML. Even though this is a common task, I must warn you that, while you can gen‐
erally do useful things with HTML using regexes, you cannot parse HTML with
regexes. Parsing means to completely break something down into its component
242 | Chapter 17: Regular Expressions