they can also have dashes (but they have to start with a letter, and they can’t end with
a dash).
This example isn’t perfect. For example, it would match the URL //gotcha (no TLD)
just as it would match //valid.com. However, to match completely valid URLs is a
much more complicated task, and not necessary for this example.
If you’re feeling a little fed up with all the caveats (“this will match
invalid URLs”), remember that you don’t have to do everything all
the time, all at once. As a matter of fact, I use a very similar regex to
the previous one all the time when scanning websites. I just want to
pull out all the URLs—or suspect URLs—and then do a second
analysis pass to look for invalid URLs, broken URLs, and so on.
Don’t get too caught up in making perfect regexes that cover every
case imaginable. Not only is that sometimes impossible, but it is
often unnecessary effort when it is possible. Obviously, there is a
time and place to consider all the possibilities—for example, when
you are screening user input to prevent injection attacks. In this
case, you will want to take the extra care and make your regex iron‐
clad.
Lazy Matches, Greedy Matches
What separates the regex dilettantes from the pros is understanding lazy versus
greedy matching. Regular expressions, by default, are greedy, meaning they will match
as much as possible before stopping. Consider this classic example.
You have some HTML, and you want to replace, for example,
<i>
text with
<strong>
text. Here’s our first attempt:
const
input
=
"Regex pros know the difference between\n"
+
"<i>greedy</i> and <i>lazy</i> matching."
;
input
.
replace
(
/<i>(.*)<\/i>/ig
,
'<strong>$1</strong>'
);
The
$1
in the replacement string will be replaced by the contents of the group
(.*)
in
the regex (more on this later).
Go ahead and try it. You’ll find the following disappointing result:
"Regex pros know the difference between
<strong>greedy</i> and <i>lazy</strong> matching."
To understand what’s going on here, think back to how the regex engine works: it
consumes input until it satisfies the match before moving on. By default, it does so in
a greedy fashion: it finds the first
<i>
and then says, “I’m not going to stop until I see
an
</i>
and I can’t
find any more past that.” Because there are two instances of
</i>
, it
ends at the second one, not the first.
248 | Chapter 17: Regular Expressions