visit
In my last post I talked about the kinds of errors our newly implemented rule about character classes found in open source code.
Today, I’ll talk about boundaries, another regex feature that can lead to bugs when used incorrectly, and a rule of ours that can help you avoid such issues. I’ll also talk about complexity and maintainability in regular expressions and our rule that can help you find regular expressions that are too complex.Boundary markers such as
^
and $
allow you to anchor the regex pattern to the beginning and end of the line (or string depending on which flags you use) respectively. This means that when you want to match a literal ^
or $
, you need to escape these special characters with a backslash.And if you fail to escape
^
or $
then you may end up with a pattern that doesn't match anything at all. In order to detect such problems, we offer a rule () pointing out cases where a boundary is used in a way such that it can never produce a successful match. Here’s one example of this problem that we found while checking code on GitHub:Pattern.compile("^[a-zA-Z][a-zA-Z0-9_.][@.](!#$%&*()-+=^){8,30}$")
Another case where this rule applies is if you use an end-of-line/string boundary at the beginning or a beginning-of-line/string at the end. This could be a case of confusing the meaning of
^
and $
:Pattern.compile(".*A^")
"^(?:(?:31(\\/|-|\\.)(?:0?[13578]|1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[13-9]|1[0-2])\\2))(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)0?2\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d|2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$"
Me neither.
Another example, one might say the example, of a complicated regex is one that is commonly used to match email addresses, which can be found (and you’ll see lots of versions of it flying around on the internet). I won’t include it in this blog post for space reasons, but it’s more than 6,000(!) characters long. And perhaps the worst part is that we’ve been guilty of using this regex internally.To us, reusing an overly complicated regex without understanding it sounds like a trap. Whether it works or not may depend on your situation, including which email addresses you want to consider valid and which invalid. For example, if you wanted to write an email to yourself at your local mail server, the mail application shouldn't stop you from addressing the mail to `host@localhost` (a.k.a. Local Host), but when validating email addresses for a web form, you might want to restrict addresses to non-local domains. Now there probably aren’t many people who would be able to tell whether the "standard" email regex would accept `host@localhost` or not by just looking at the code. And certainly it would be a decidedly non-trivial engineering effort to change it to not accept an `@localhost` address if it does (or vice-versa).To help you keep track of complicated regular expressions you’re using, SonarQube, SonarCloud and SonarLint offer a dedicated rule for finding regular expressions with complexity that exceeds a configurable threshold: . This rule is inspired by the concept, developed by SonarSource, and takes into account how all the regex operators, combined with each other, raise the complexity of a given regex. Here’s what it says about the date regex from above: