diff --git a/java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp b/java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp new file mode 100644 index 00000000000..fa8a3563d23 --- /dev/null +++ b/java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp @@ -0,0 +1,108 @@ + + + + + + + +

+ + Consider this use of a regular expression, which removes + all leading and trailing whitespace in a string: + +

+ + + re.sub(r"^\s+|\s+$", "", text) # BAD + + +

+ + The sub-expression "\s+$" will match the + whitespace characters in text from left to right, but it + can start matching anywhere within a whitespace sequence. This is + problematic for strings that do not end with a whitespace + character. Such a string will force the regular expression engine to + process each whitespace sequence once per whitespace character in the + sequence. + +

+ +

+ + This ultimately means that the time cost of trimming a + string is quadratic in the length of the string. So a string like + "a b" will take milliseconds to process, but a similar + string with a million spaces instead of just one will take several + minutes. + +

+ +

+ + Avoid this problem by rewriting the regular expression to + not contain the ambiguity about when to start matching whitespace + sequences. For instance, by using a negative look-behind + (^\s+|(?<!\s)\s+$), or just by using the built-in strip + method (text.strip()). + +

+ +

+ + Note that the sub-expression "^\s+" is + not problematic as the ^ anchor restricts + when that sub-expression can start matching, and as the regular + expression engine matches from left to right. + +

+ +
+ + + +

+ + As a similar, but slightly subtler problem, consider the + regular expression that matches lines with numbers, possibly written + using scientific notation: +

+ + + ^0\.\d+E?\d+$ # BAD + + +

+ + The problem with this regular expression is in the + sub-expression \d+E?\d+ because the second + \d+ can start matching digits anywhere after the first + match of the first \d+ if there is no E in + the input string. + +

+ +

+ + This is problematic for strings that do not + end with a digit. Such a string will force the regular expression + engine to process each digit sequence once per digit in the sequence, + again leading to a quadratic time complexity. + +

+ +

+ + To make the processing faster, the regular expression + should be rewritten such that the two \d+ sub-expressions + do not have overlapping matches: ^0\.\d+(E\d+)?$. + +

+ +
+ + + +
diff --git a/java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.ql b/java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.ql new file mode 100644 index 00000000000..13d5bb8e8a6 --- /dev/null +++ b/java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.ql @@ -0,0 +1,34 @@ +/** + * @name Polynomial regular expression used on uncontrolled data + * @description A regular expression that can require polynomial time + * to match may be vulnerable to denial-of-service attacks. + * @kind path-problem + * @problem.severity warning + * @precision high + * @id java/polynomial-redos + * @tags security + * external/cwe/cwe-730 + * external/cwe/cwe-400 + */ + +import java +import semmle.code.java.security.performance.SuperlinearBackTracking +import semmle.code.java.dataflow.DataFlow +// import semmle.python.security.dataflow.PolynomialReDoS +import DataFlow::PathGraph + +from + PolynomialReDoS::Configuration config, DataFlow::PathNode source, DataFlow::PathNode sink, + PolynomialReDoS::Sink sinkNode, PolynomialBackTrackingTerm regexp +where + config.hasFlowPath(source, sink) and + sinkNode = sink.getNode() and + regexp.getRootTerm() = sinkNode.getRegExp() +// not ( +// source.getNode().(Source).getKind() = "url" and +// regexp.isAtEndLine() +// ) +select sinkNode.getHighlight(), source, sink, + "This $@ that depends on $@ may run slow on strings " + regexp.getPrefixMessage() + + "with many repetitions of '" + regexp.getPumpString() + "'.", regexp, "regular expression", + source.getNode(), "a user-provided value" diff --git a/java/ql/src/Security/CWE/CWE-730/ReDoS.qhelp b/java/ql/src/Security/CWE/CWE-730/ReDoS.qhelp new file mode 100644 index 00000000000..9cfbcc32354 --- /dev/null +++ b/java/ql/src/Security/CWE/CWE-730/ReDoS.qhelp @@ -0,0 +1,34 @@ + + + + + + + +

+ Consider this regular expression: +

+ + ^_(__|.)+_$ + +

+ Its sub-expression "(__|.)+?" can match the string "__" either by the + first alternative "__" to the left of the "|" operator, or by two + repetitions of the second alternative "." to the right. Thus, a string consisting + of an odd number of underscores followed by some other character will cause the regular + expression engine to run for an exponential amount of time before rejecting the input. +

+

+ This problem can be avoided by rewriting the regular expression to remove the ambiguity between + the two branches of the alternative inside the repetition: +

+ + ^_(__|[^_])+_$ + +
+ + + +
diff --git a/java/ql/src/Security/CWE/CWE-730/ReDoS.ql b/java/ql/src/Security/CWE/CWE-730/ReDoS.ql new file mode 100644 index 00000000000..f72bfc3fc13 --- /dev/null +++ b/java/ql/src/Security/CWE/CWE-730/ReDoS.ql @@ -0,0 +1,25 @@ +/** + * @name Inefficient regular expression + * @description A regular expression that requires exponential time to match certain inputs + * can be a performance bottleneck, and may be vulnerable to denial-of-service + * attacks. + * @kind problem + * @problem.severity error + * @precision high + * @id java/redos + * @tags security + * external/cwe/cwe-730 + * external/cwe/cwe-400 + */ + +import java +import semmle.code.java.security.performance.ExponentialBackTracking + +from RegExpTerm t, string pump, State s, string prefixMsg +where + hasReDoSResult(t, pump, s, prefixMsg) and + // exclude verbose mode regexes for now + not t.getRegex().getAMode() = "VERBOSE" +select t, + "This part of the regular expression may cause exponential backtracking on strings " + prefixMsg + + "containing many repetitions of '" + pump + "'." diff --git a/java/ql/src/Security/CWE/CWE-730/ReDoSIntroduction.inc.qhelp b/java/ql/src/Security/CWE/CWE-730/ReDoSIntroduction.inc.qhelp new file mode 100644 index 00000000000..f533097c222 --- /dev/null +++ b/java/ql/src/Security/CWE/CWE-730/ReDoSIntroduction.inc.qhelp @@ -0,0 +1,54 @@ + + + +

+ + Some regular expressions take a long time to match certain + input strings to the point where the time it takes to match a string + of length n is proportional to nk or even + 2n. Such regular expressions can negatively affect + performance, or even allow a malicious user to perform a Denial of + Service ("DoS") attack by crafting an expensive input string for the + regular expression to match. + +

+ +

+ + The regular expression engine provided by Python uses a backtracking non-deterministic finite + automata to implement regular expression matching. While this approach + is space-efficient and allows supporting advanced features like + capture groups, it is not time-efficient in general. The worst-case + time complexity of such an automaton can be polynomial or even + exponential, meaning that for strings of a certain shape, increasing + the input length by ten characters may make the automaton about 1000 + times slower. + +

+ +

+ + Typically, a regular expression is affected by this + problem if it contains a repetition of the form r* or + r+ where the sub-expression r is ambiguous + in the sense that it can match some string in multiple ways. More + information about the precise circumstances can be found in the + references. + +

+
+ + + +

+ + Modify the regular expression to remove the ambiguity, or + ensure that the strings matched with the regular expression are short + enough that the time-complexity does not matter. + +

+ +
+
diff --git a/java/ql/src/Security/CWE/CWE-730/ReDoSReferences.inc.qhelp b/java/ql/src/Security/CWE/CWE-730/ReDoSReferences.inc.qhelp new file mode 100644 index 00000000000..2b3e5f17c62 --- /dev/null +++ b/java/ql/src/Security/CWE/CWE-730/ReDoSReferences.inc.qhelp @@ -0,0 +1,16 @@ + + + +
  • + OWASP: + Regular expression Denial of Service - ReDoS. +
  • +
  • Wikipedia: ReDoS.
  • +
  • Wikipedia: Time complexity.
  • +
  • James Kirrage, Asiri Rathnayake, Hayo Thielecke: + Static Analysis for Regular Expression Denial-of-Service Attack. +
  • +
    +