add a bad-tag-filter query for Python and JavaScript

2026-05-02 12:15:17 +02:00 · 2021-08-30 10:09:20 +02:00
parent fd64ff9ef1
commit 99ed4a1a89
20 changed files with 887 additions and 16 deletions
--- a/python/ql/src/Security/CWE-116/BadTagFilter.qhelp
+++ b/python/ql/src/Security/CWE-116/BadTagFilter.qhelp
@@ -0,0 +1,55 @@
+<!DOCTYPE qhelp PUBLIC
+  "-//Semmle//qhelp//EN"
+  "qhelp.dtd">
+<qhelp>
+
+<overview>
+<p>
+Parsing general HTML using regular expressions is impossible, however it is possible to match
+single HTML tags. However, if the regexp is not written well it might be easy 
+to circumvent the regexp, which can lead to XSS or other security issues.
+</p>
+<p>
+Many of these mistakes are caused by browsers having very forgiving HTML parsers: 
+Browsers will often render invalid HTML with parser errors. 
+Regular expressions that attempt to match HTML must recognize tags containing these parser errors.
+</p>
+</overview>
+
+<recommendation>
+<p>
+Use a well-tested sanitization or parser library if at all possible. These libraries are much more
+likely to handle corner cases correctly than a custom implementation.
+</p>
+</recommendation>
+
+<example>
+<p>
+For example, assume we want to write a function that filters out all <code>&lt;script&gt;</code> tags.
+Such a function might be written like below: 
+</p>
+
+<sample src="examples/BadTagFilter.py" />
+
+<p>
+This sanitizer does not filter out all <code>&lt;script&gt;</code> tags. 
+Browsers will not only accept <code>&lt;/script&gt;</code> as script end tags, but also tags such as <code>&lt;/script foo="bar"&gt;</code> even though it is a parser error.
+This means that an attack string such as <code>&lt;script&gt;alert(1)&lt;/script foo="bar"&gt;</code> will not be filtered by 
+the function, but <code>alert(1)</code> will be executed by a browser if the string is rendered as HTML.
+</p>
+
+<p>
+Other corner cases include that HTML comments can end with <code>--!&gt;</code>, 
+and that HTML tag names can contain upper case characters.
+</p>
+</example>
+
+<references>
+<li>Securitum: <a href="https://research.securitum.com/the-curious-case-of-copy-paste/">The Curious Case of Copy &amp; Paste</a>.</li>
+<li>stackoverflow.com: <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454">You can't parse [X]HTML with regex</a>.</li>
+<li>HTML Standard: <a href="https://html.spec.whatwg.org/multipage/parsing.html#comment-end-bang-state">Comment end bang state</a>.</li>
+<li>stackoverflow.com: <a href="https://stackoverflow.com/questions/25559999/why-arent-browsers-strict-about-html">Why aren't browsers strict about HTML?</a>.</li>
+</references>
+</qhelp>
+
+
--- a/python/ql/src/Security/CWE-116/BadTagFilter.ql
+++ b/python/ql/src/Security/CWE-116/BadTagFilter.ql
@@ -0,0 +1,19 @@
+/**
+ * @name Bad HTML filtering regexp
+ * @description Matching HTML tags using regular expressions is hard to do right, and can easily lead to security issues.
+ * @kind problem
+ * @problem.severity warning
+ * @security-severity 7.8
+ * @precision high
+ * @id py/bad-tag-filter
+ * @tags correctness
+ *       security
+ *       external/cwe/cwe-116
+ *       external/cwe/cwe-020
+ */
+
+import semmle.python.security.BadTagFilterQuery
+
+from HTMLMatchingRegExp regexp, string msg
+where msg = min(string m | isBadRegexpFilter(regexp, m) | m order by m.length(), m) // there might be multiple, we arbitrarily pick the shortest one
+select regexp, msg
--- a/python/ql/src/Security/CWE-116/examples/BadTagFilter.py
+++ b/python/ql/src/Security/CWE-116/examples/BadTagFilter.py
@@ -0,0 +1,8 @@
+import re
+
+def filterScriptTags(content): 
+    oldContent = ""
+    while oldContent != content:
+        oldContent = content
+        content = re.sub(r'<script.*?>.*?</script>', '', content, flags= re.DOTALL | re.IGNORECASE)
+    return content