mirror of
https://github.com/github/codeql.git
synced 2026-02-13 13:41:08 +01:00
134 lines
3.7 KiB
XML
134 lines
3.7 KiB
XML
<!DOCTYPE qhelp PUBLIC
|
|
"-//Semmle//qhelp//EN"
|
|
"qhelp.dtd">
|
|
|
|
<qhelp>
|
|
|
|
<include src="ReDoSIntroduction.inc.qhelp" />
|
|
|
|
<example>
|
|
<p>
|
|
|
|
Consider this use of a regular expression, which removes
|
|
all leading and trailing whitespace in a string:
|
|
|
|
</p>
|
|
|
|
<sample language="javascript">
|
|
text.replace(/^\s+|\s+$/g, ''); // BAD</sample>
|
|
|
|
<p>
|
|
|
|
The sub-expression <code>"\s+$"</code> will match the
|
|
whitespace characters in <code>text</code> from left to right, but it
|
|
can start matching anywhere within a whitespace sequence. This is
|
|
problematic for strings that do <strong>not</strong> end with a whitespace
|
|
character. Such a string will force the regular expression engine to
|
|
process each whitespace sequence once per whitespace character in the
|
|
sequence.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
This ultimately means that the time cost of trimming a
|
|
string is quadratic in the length of the string. So a string like
|
|
<code>"a b"</code> will take milliseconds to process, but a similar
|
|
string with a million spaces instead of just one will take several
|
|
minutes.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Avoid this problem by rewriting the regular expression to
|
|
not contain the ambiguity about when to start matching whitespace
|
|
sequences. For instance, by using a negative look-behind
|
|
(<code>/^\s+|(?<!\s)\s+$/g</code>), or just by using the built-in trim
|
|
method (<code>text.trim()</code>).
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Note that the sub-expression <code>"^\s+"</code> is
|
|
<strong>not</strong> problematic as the <code>^</code> anchor restricts
|
|
when that sub-expression can start matching, and as the regular
|
|
expression engine matches from left to right.
|
|
|
|
</p>
|
|
|
|
</example>
|
|
|
|
<example>
|
|
|
|
<p>
|
|
|
|
As a similar, but slightly subtler problem, consider the
|
|
regular expression that matches lines with numbers, possibly written
|
|
using scientific notation:
|
|
</p>
|
|
|
|
<sample language="javascript">
|
|
/^0\.\d+E?\d+$/.test(str) // BAD</sample>
|
|
|
|
<p>
|
|
|
|
The problem with this regular expression is in the
|
|
sub-expression <code>\d+E?\d+</code> because the second
|
|
<code>\d+</code> can start matching digits anywhere after the first
|
|
match of the first <code>\d+</code> if there is no <code>E</code> in
|
|
the input string.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
This is problematic for strings that do <strong>not</strong>
|
|
end with a digit. Such a string will force the regular expression
|
|
engine to process each digit sequence once per digit in the sequence,
|
|
again leading to a quadratic time complexity.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
To make the processing faster, the regular expression
|
|
should be rewritten such that the two <code>\d+</code> sub-expressions
|
|
do not have overlapping matches: <code>^0\.\d+(E\d+)?$</code>.
|
|
|
|
</p>
|
|
|
|
</example>
|
|
|
|
<example>
|
|
<p>
|
|
Sometimes it is unclear how a regular expression can be rewritten to
|
|
avoid the problem. In such cases, it often suffices to limit the
|
|
length of the input string. For instance, the following
|
|
regular expression is used to match numbers, and on some non-number
|
|
inputs it can have quadratic time complexity:
|
|
</p>
|
|
|
|
<sample language="javascript">
|
|
/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str) // BAD</sample>
|
|
|
|
<p>
|
|
It is not immediately obvious how to rewrite this regular expression
|
|
to avoid the problem. However, you can mitigate performance issues by limiting the length
|
|
to 1000 characters, which will always finish in a reasonable amount
|
|
of time.
|
|
</p>
|
|
|
|
<sample language="javascript">
|
|
if (str.length > 1000) {
|
|
throw new Error("Input too long");
|
|
}
|
|
|
|
/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)</sample>
|
|
</example>
|
|
|
|
<include src="ReDoSReferences.inc.qhelp"/>
|
|
|
|
</qhelp>
|