Update docs to be about Java

This commit is contained in:
Joe Farebrother
2022-02-22 17:10:20 +00:00
parent c312b4b6b0
commit 5364001aa2
3 changed files with 19 additions and 12 deletions

View File

@@ -14,13 +14,13 @@
</p> </p>
<sample language="python"> <sample language="java">
re.sub(r"^\s+|\s+$", "", text) # BAD Pattern.compile("^\\s+|\\s+$").matcher(text).replaceAll("") // BAD
</sample> </sample>
<p> <p>
The sub-expression <code>"\s+$"</code> will match the The sub-expression <code>"\\s+$"</code> will match the
whitespace characters in <code>text</code> from left to right, but it whitespace characters in <code>text</code> from left to right, but it
can start matching anywhere within a whitespace sequence. This is can start matching anywhere within a whitespace sequence. This is
problematic for strings that do <strong>not</strong> end with a whitespace problematic for strings that do <strong>not</strong> end with a whitespace
@@ -45,14 +45,14 @@
Avoid this problem by rewriting the regular expression to Avoid this problem by rewriting the regular expression to
not contain the ambiguity about when to start matching whitespace not contain the ambiguity about when to start matching whitespace
sequences. For instance, by using a negative look-behind sequences. For instance, by using a negative look-behind
(<code>^\s+|(?&lt;!\s)\s+$</code>), or just by using the built-in strip (<code>"^\\s+|(?&lt;!\\s)\\s+$"</code>), or just by using the built-in trim
method (<code>text.strip()</code>). method (<code>text.trim()</code>).
</p> </p>
<p> <p>
Note that the sub-expression <code>"^\s+"</code> is Note that the sub-expression <code>"^\\s+"</code> is
<strong>not</strong> problematic as the <code>^</code> anchor restricts <strong>not</strong> problematic as the <code>^</code> anchor restricts
when that sub-expression can start matching, and as the regular when that sub-expression can start matching, and as the regular
expression engine matches from left to right. expression engine matches from left to right.
@@ -70,8 +70,8 @@
using scientific notation: using scientific notation:
</p> </p>
<sample language="python"> <sample language="java">
^0\.\d+E?\d+$ # BAD "^0\\.\\d+E?\\d+$""
</sample> </sample>
<p> <p>
@@ -97,7 +97,7 @@
To make the processing faster, the regular expression To make the processing faster, the regular expression
should be rewritten such that the two <code>\d+</code> sub-expressions should be rewritten such that the two <code>\d+</code> sub-expressions
do not have overlapping matches: <code>^0\.\d+(E\d+)?$</code>. do not have overlapping matches: <code>"^0\\.\\d+(E\\d+)?$"</code>.
</p> </p>

View File

@@ -10,7 +10,7 @@
<p> <p>
Consider this regular expression: Consider this regular expression:
</p> </p>
<sample language="python"> <sample language="java">
^_(__|.)+_$ ^_(__|.)+_$
</sample> </sample>
<p> <p>
@@ -24,7 +24,7 @@
This problem can be avoided by rewriting the regular expression to remove the ambiguity between This problem can be avoided by rewriting the regular expression to remove the ambiguity between
the two branches of the alternative inside the repetition: the two branches of the alternative inside the repetition:
</p> </p>
<sample language="python"> <sample language="java">
^_(__|[^_])+_$ ^_(__|[^_])+_$
</sample> </sample>
</example> </example>

View File

@@ -17,7 +17,7 @@
<p> <p>
The regular expression engine provided by Python uses a backtracking non-deterministic finite The regular expression engine provided by Java uses a backtracking non-deterministic finite
automata to implement regular expression matching. While this approach automata to implement regular expression matching. While this approach
is space-efficient and allows supporting advanced features like is space-efficient and allows supporting advanced features like
capture groups, it is not time-efficient in general. The worst-case capture groups, it is not time-efficient in general. The worst-case
@@ -38,6 +38,11 @@
references. references.
</p> </p>
<p>
Note that Java versions 9 and above have some mitigations against ReDoS; however they aren't perfect
and more complex regular expressions can still be affected by this problem.
</p>
</overview> </overview>
<recommendation> <recommendation>
@@ -48,6 +53,8 @@
ensure that the strings matched with the regular expression are short ensure that the strings matched with the regular expression are short
enough that the time-complexity does not matter. enough that the time-complexity does not matter.
Alternatively, an alternate regex library that guarantees linear time execution, such as Google's RE2J, may be used.
</p> </p>
</recommendation> </recommendation>