parts. Regexes are capable of parsing regular languages only (hence the name). Regu‐
lar languages are extremely simple, and most often you will be using regexes on more
complex languages. Why the warning, then, if regexes can be used usefully on more
complex languages? Because it’s important to understand the limitations of regexes,
and recognize when you need to use something more powerful. Even though we will
be using regexes to do useful things with HTML, it’s possible to construct HTML that
will defeat our regex. To have a solution that works in 100% of the cases, you would
have to employ a parser. Consider the following example:
const
html
=
'<br> [!CDATA[[<br>]]'
;
const
matches
=
html
.
match
(
/<br>/ig
);
This regex will match twice; however, there is only one true
<br>
tag in this example;
the other matching string is simply non-HTML character data (CDATA). Regexes are
also extremely limited when it comes to matching hierarchical structures (such as an
<a>
tag within a
<p>
tag). The theoretical explanations for these limitations are
beyond the scope of this book, but the takeaway is this: if you’re struggling to make a
regex to match something very complicated (such as HTML), consider that a regex
simply might not be the right tool.
Character Sets
Character sets provide a compact way to represent alternation of a single character
(we will combine it with repetition later, and see how we can extend this to multiple
characters). Let’s say, for example, you wanted to find all the numbers in a string. You
could use alternation:
const
beer99
=
"99 bottles of beer on the wall "
+
"take 1 down and pass it around -- "
+
"98 bottles of beer on the wall."
;
const
matches
=
beer99
.
match
(
/0|1|2|3|4|5|6|7|8|9/g
);
How tedious! And what if we wanted to match not numbers but letters? Numbers and
letters? Lastly, what if you wanted to match everything that’s not a number? That’s
where character sets come in. At their simplest, they provide a more compact way of
representing single-digit alternation. Even better, they allow you to specify ranges.
Here’s how we might rewrite the preceding:
const
m1
=
beer99
.
match
(
/[0123456789]/g
);
// okay
const
m2
=
beer99
.
match
(
/[0-9]/g
);
// better!
You can even combine ranges. Here’s how we would match letters, numbers, and
some miscellaneous punctuation (this will match everything in our original string
except whitespace):
const
match
=
beer99
.
match
(
/[\-0-9a-z.]/ig
);
Character Sets | 243