mirror of
https://github.com/github/codeql.git
synced 2026-04-28 10:15:14 +02:00
Merge pull request #6561 from erik-krogh/htmlReg
JS/Py/Ruby: add a bad-tag-filter query
This commit is contained in:
54
python/ql/src/Security/CWE-116/BadTagFilter.qhelp
Normal file
54
python/ql/src/Security/CWE-116/BadTagFilter.qhelp
Normal file
@@ -0,0 +1,54 @@
|
||||
<!DOCTYPE qhelp PUBLIC
|
||||
"-//Semmle//qhelp//EN"
|
||||
"qhelp.dtd">
|
||||
<qhelp>
|
||||
|
||||
<overview>
|
||||
<p>
|
||||
It is possible to match some single HTML tags using regular expressions (parsing general HTML using
|
||||
regular expressions is impossible). However, if the regular expression is not written well it might
|
||||
be possible to circumvent it, which can lead to cross-site scripting or other security issues.
|
||||
</p>
|
||||
<p>
|
||||
Some of these mistakes are caused by browsers having very forgiving HTML parsers, and
|
||||
will often render invalid HTML containing syntax errors.
|
||||
Regular expressions that attempt to match HTML should also recognize tags containing such syntax errors.
|
||||
</p>
|
||||
</overview>
|
||||
|
||||
<recommendation>
|
||||
<p>
|
||||
Use a well-tested sanitization or parser library if at all possible. These libraries are much more
|
||||
likely to handle corner cases correctly than a custom implementation.
|
||||
</p>
|
||||
</recommendation>
|
||||
|
||||
<example>
|
||||
<p>
|
||||
The following example attempts to filters out all <code><script></code> tags.
|
||||
</p>
|
||||
|
||||
<sample src="examples/BadTagFilter.py" />
|
||||
|
||||
<p>
|
||||
The above sanitizer does not filter out all <code><script></code> tags.
|
||||
Browsers will not only accept <code></script></code> as script end tags, but also tags such as <code></script foo="bar"></code> even though it is a parser error.
|
||||
This means that an attack string such as <code><script>alert(1)</script foo="bar"></code> will not be filtered by
|
||||
the function, and <code>alert(1)</code> will be executed by a browser if the string is rendered as HTML.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Other corner cases include that HTML comments can end with <code>--!></code>,
|
||||
and that HTML tag names can contain upper case characters.
|
||||
</p>
|
||||
</example>
|
||||
|
||||
<references>
|
||||
<li>Securitum: <a href="https://research.securitum.com/the-curious-case-of-copy-paste/">The Curious Case of Copy & Paste</a>.</li>
|
||||
<li>stackoverflow.com: <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454">You can't parse [X]HTML with regex</a>.</li>
|
||||
<li>HTML Standard: <a href="https://html.spec.whatwg.org/multipage/parsing.html#comment-end-bang-state">Comment end bang state</a>.</li>
|
||||
<li>stackoverflow.com: <a href="https://stackoverflow.com/questions/25559999/why-arent-browsers-strict-about-html">Why aren't browsers strict about HTML?</a>.</li>
|
||||
</references>
|
||||
</qhelp>
|
||||
|
||||
|
||||
19
python/ql/src/Security/CWE-116/BadTagFilter.ql
Normal file
19
python/ql/src/Security/CWE-116/BadTagFilter.ql
Normal file
@@ -0,0 +1,19 @@
|
||||
/**
|
||||
* @name Bad HTML filtering regexp
|
||||
* @description Matching HTML tags using regular expressions is hard to do right, and can easily lead to security issues.
|
||||
* @kind problem
|
||||
* @problem.severity warning
|
||||
* @security-severity 7.8
|
||||
* @precision high
|
||||
* @id py/bad-tag-filter
|
||||
* @tags correctness
|
||||
* security
|
||||
* external/cwe/cwe-116
|
||||
* external/cwe/cwe-020
|
||||
*/
|
||||
|
||||
import semmle.python.security.BadTagFilterQuery
|
||||
|
||||
from HTMLMatchingRegExp regexp, string msg
|
||||
where msg = min(string m | isBadRegexpFilter(regexp, m) | m order by m.length(), m) // there might be multiple, we arbitrarily pick the shortest one
|
||||
select regexp, msg
|
||||
8
python/ql/src/Security/CWE-116/examples/BadTagFilter.py
Normal file
8
python/ql/src/Security/CWE-116/examples/BadTagFilter.py
Normal file
@@ -0,0 +1,8 @@
|
||||
import re
|
||||
|
||||
def filterScriptTags(content):
|
||||
oldContent = ""
|
||||
while oldContent != content:
|
||||
oldContent = content
|
||||
content = re.sub(r'<script.*?>.*?</script>', '', content, flags= re.DOTALL | re.IGNORECASE)
|
||||
return content
|
||||
Reference in New Issue
Block a user