mirror of
https://github.com/github/codeql.git
synced 2026-05-02 12:15:17 +02:00
add a bad-tag-filter query for Python and JavaScript
This commit is contained in:
55
python/ql/src/Security/CWE-116/BadTagFilter.qhelp
Normal file
55
python/ql/src/Security/CWE-116/BadTagFilter.qhelp
Normal file
@@ -0,0 +1,55 @@
|
||||
<!DOCTYPE qhelp PUBLIC
|
||||
"-//Semmle//qhelp//EN"
|
||||
"qhelp.dtd">
|
||||
<qhelp>
|
||||
|
||||
<overview>
|
||||
<p>
|
||||
Parsing general HTML using regular expressions is impossible, however it is possible to match
|
||||
single HTML tags. However, if the regexp is not written well it might be easy
|
||||
to circumvent the regexp, which can lead to XSS or other security issues.
|
||||
</p>
|
||||
<p>
|
||||
Many of these mistakes are caused by browsers having very forgiving HTML parsers:
|
||||
Browsers will often render invalid HTML with parser errors.
|
||||
Regular expressions that attempt to match HTML must recognize tags containing these parser errors.
|
||||
</p>
|
||||
</overview>
|
||||
|
||||
<recommendation>
|
||||
<p>
|
||||
Use a well-tested sanitization or parser library if at all possible. These libraries are much more
|
||||
likely to handle corner cases correctly than a custom implementation.
|
||||
</p>
|
||||
</recommendation>
|
||||
|
||||
<example>
|
||||
<p>
|
||||
For example, assume we want to write a function that filters out all <code><script></code> tags.
|
||||
Such a function might be written like below:
|
||||
</p>
|
||||
|
||||
<sample src="examples/BadTagFilter.py" />
|
||||
|
||||
<p>
|
||||
This sanitizer does not filter out all <code><script></code> tags.
|
||||
Browsers will not only accept <code></script></code> as script end tags, but also tags such as <code></script foo="bar"></code> even though it is a parser error.
|
||||
This means that an attack string such as <code><script>alert(1)</script foo="bar"></code> will not be filtered by
|
||||
the function, but <code>alert(1)</code> will be executed by a browser if the string is rendered as HTML.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Other corner cases include that HTML comments can end with <code>--!></code>,
|
||||
and that HTML tag names can contain upper case characters.
|
||||
</p>
|
||||
</example>
|
||||
|
||||
<references>
|
||||
<li>Securitum: <a href="https://research.securitum.com/the-curious-case-of-copy-paste/">The Curious Case of Copy & Paste</a>.</li>
|
||||
<li>stackoverflow.com: <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454">You can't parse [X]HTML with regex</a>.</li>
|
||||
<li>HTML Standard: <a href="https://html.spec.whatwg.org/multipage/parsing.html#comment-end-bang-state">Comment end bang state</a>.</li>
|
||||
<li>stackoverflow.com: <a href="https://stackoverflow.com/questions/25559999/why-arent-browsers-strict-about-html">Why aren't browsers strict about HTML?</a>.</li>
|
||||
</references>
|
||||
</qhelp>
|
||||
|
||||
|
||||
19
python/ql/src/Security/CWE-116/BadTagFilter.ql
Normal file
19
python/ql/src/Security/CWE-116/BadTagFilter.ql
Normal file
@@ -0,0 +1,19 @@
|
||||
/**
|
||||
* @name Bad HTML filtering regexp
|
||||
* @description Matching HTML tags using regular expressions is hard to do right, and can easily lead to security issues.
|
||||
* @kind problem
|
||||
* @problem.severity warning
|
||||
* @security-severity 7.8
|
||||
* @precision high
|
||||
* @id py/bad-tag-filter
|
||||
* @tags correctness
|
||||
* security
|
||||
* external/cwe/cwe-116
|
||||
* external/cwe/cwe-020
|
||||
*/
|
||||
|
||||
import semmle.python.security.BadTagFilterQuery
|
||||
|
||||
from HTMLMatchingRegExp regexp, string msg
|
||||
where msg = min(string m | isBadRegexpFilter(regexp, m) | m order by m.length(), m) // there might be multiple, we arbitrarily pick the shortest one
|
||||
select regexp, msg
|
||||
8
python/ql/src/Security/CWE-116/examples/BadTagFilter.py
Normal file
8
python/ql/src/Security/CWE-116/examples/BadTagFilter.py
Normal file
@@ -0,0 +1,8 @@
|
||||
import re
|
||||
|
||||
def filterScriptTags(content):
|
||||
oldContent = ""
|
||||
while oldContent != content:
|
||||
oldContent = content
|
||||
content = re.sub(r'<script.*?>.*?</script>', '', content, flags= re.DOTALL | re.IGNORECASE)
|
||||
return content
|
||||
Reference in New Issue
Block a user