Commit Graph

2702 Commits

Author SHA1 Message Date
Taus
2bb44d49d9 Python: Perform more deduplication
This cut the evaluation time on `django` down from 1.2 seconds to ~0.8
seconds (but the impact will likely be greater on bigger projects).
2021-07-14 13:38:05 +00:00
Taus
09993406f1 Python: Add explanatory QLDoc comment 2021-07-14 10:42:07 +00:00
Taus
1decf23785 Python: Fix bad join order for sensitive data
Not the prettiest of solutions, but it does the job. Basically, we were
calculating (and re-calculating) the same big relation between strings
and regexes and then checking whether the latter matched the former.

This resulted in tuple counts like the following:

```
[2021-07-12 16:09:24] (12s) Tuple counts for SensitiveDataSources::SensitiveDataModeling::SensitiveVariableAssignment#class#ff#shared/4@7489c6:
4918074 ~0%     {4} r1 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH Flow::NameNode::getId_dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0 'arg0', Lhs.1 'arg1', Rhs.0, Rhs.1 'arg3'
2654    ~0%     {4} r2 = JOIN r1 WITH PRIMITIVE regexpMatch#bb ON Lhs.3 'arg3',Lhs.1 'arg1'
                return r2
```
(The above being just the bit that handles `DefinitionNode` in
`SensitiveVariableAssignment`, and taking 12 seconds to evaluate.)

By applying a bit of manual inlining and magic, this becomes somewhat
more manageable:

```
[2021-07-12 15:59:44] (1s) Tuple counts for SensitiveDataSources::SensitiveDataModeling::sensitiveString#ff/2@8830e2:
27671  ~2%      {3} r1 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveParameterName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

334012 ~2%      {3} r2 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

361683 ~11%     {3} r3 = r1 UNION r2

154644 ~0%      {3} r4 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveFunctionName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

149198 ~1%      {3} r5 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveStrConst#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

124257 ~5%      {3} r6 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveAttributeName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

273455 ~21%     {3} r7 = r5 UNION r6
428099 ~30%     {3} r8 = r4 UNION r7
789782 ~78%     {3} r9 = r3 UNION r8
1121   ~77%     {3} r10 = JOIN r9 WITH PRIMITIVE regexpMatch#bb ON Lhs.2 'result',Lhs.1
1121   ~70%     {2} r11 = SCAN r10 OUTPUT In.0 'classification', In.2 'result'
                return r11
```
(The above being the total for all the sensitive names we care about,
taking only 1.2 seconds to evaluate.)

Incidentally, you may wonder why this has _fewer_ results than before.
The answer is control flow splitting -- every sensitively-named
`DefinitionNode` would have been matched in isolation previously. By
pre-matching on just the names of these, we can subsequently join
against those names that are known to be sensitive, which is a much
faster operation.

(We also get the benefit of deduplicating the strings that are matched,
before actually performing the match, so if, say, an attribute name and
a variable name are identical, then we'll only match them once.)

We also exclude all docstrings as relevant string constants, as these
presumably don't actually flow anywhere.
2021-07-12 16:10:49 +00:00
Taus
a73e382dfe Python: Prevent bad join in hashlib model
I'm not entirely sure what triggered this bad join order, but some
combination of the use of abstract classes and the exclusion of `new`
caused this to go really wrong:

```
WeakSensitiveDataHashing.ql-15:Stdlib::Stdlib::HashlibDataPassedToHashClass#class#ffff ......... 15.5s
```

with the following tuple counts:
```
[2021-07-12 13:20:15] (16s) Tuple counts for Stdlib::Stdlib::HashlibDataPassedToHashClass#class#ffff/4@217901:
148810  ~3%        {3} r1 = JOIN DataFlowPublic::CallCfgNode#class#ff#shared WITH project#DataFlowPublic::CallCfgNode::getArg_dispred#fff ON FIRST 1 OUTPUT "hashlib", Lhs.1 'node', Lhs.0 'this'
148810  ~4%        {3} r2 = JOIN r1 WITH ApiGraphs::API::Impl::MkModuleImport#ff@staged_ext ON FIRST 1 OUTPUT Rhs.1, Lhs.1 'node', Lhs.2 'this'
7589310 ~486%      {4} r3 = JOIN r2 WITH ApiGraphs::API::Impl::edge#2#fff@staged_ext ON FIRST 1 OUTPUT Lhs.1 'node', Lhs.2 'this', Rhs.1, InverseAppend("getMember(\"","\")",Rhs.1)
6994070 ~490%      {4} r4 = SELECT r3 ON In.3 != "new"
6994070 ~4503%     {2} r5 = SCAN r4 OUTPUT In.1 'this', In.0 'node'

22      ~4%        {3} r6 = JOIN DataFlowPublic::CallCfgNode#class#ff#shared WITH project#DataFlowPublic::CallCfgNode::getArgByName_dispred#fff ON FIRST 1 OUTPUT "hashlib", Lhs.1 'node', Lhs.0 'this'
22      ~0%        {3} r7 = JOIN r6 WITH ApiGraphs::API::Impl::MkModuleImport#ff@staged_ext ON FIRST 1 OUTPUT Rhs.1, Lhs.1 'node', Lhs.2 'this'
1122    ~437%      {4} r8 = JOIN r7 WITH ApiGraphs::API::Impl::edge#2#fff@staged_ext ON FIRST 1 OUTPUT Lhs.1 'node', Lhs.2 'this', Rhs.1, InverseAppend("getMember(\"","\")",Rhs.1)
1034    ~460%      {4} r9 = SELECT r8 ON In.3 != "new"
1034    ~4549%     {2} r10 = SCAN r9 OUTPUT In.1 'this', In.0 'node'

6995104 ~4503%     {2} r11 = r5 UNION r10
5213851 ~4683%     {3} r12 = JOIN r11 WITH ApiGraphs::API::Node::getACall_dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1 'hashClass', Lhs.1 'node', Lhs.0 'this'
6478480 ~4646%     {6} r13 = JOIN r12 WITH ApiGraphs::API::Impl::edge#2#fff_201#join_rhs ON FIRST 1 OUTPUT "hashlib", Rhs.1, Lhs.1 'node', Lhs.2 'this', Lhs.0 'hashClass', Rhs.2
1410    ~4693%     {5} r14 = JOIN r13 WITH ApiGraphs::API::Impl::MkModuleImport#ff@staged_ext ON FIRST 2 OUTPUT Lhs.2 'node', Lhs.3 'this', Lhs.4 'hashClass', Lhs.5, InverseAppend("getMember(\"","\")",Lhs.5)
1222    ~4540%     {5} r15 = SELECT r14 ON In.4 'hashName' != "new"
1222    ~4540%     {4} r16 = SCAN r15 OUTPUT In.1 'this', In.4 'hashName', In.2 'hashClass', In.0 'node'
```

By factoring out the insides, the biggest iteration now looks like

```
[2021-07-12 14:17:36] (0s) Tuple counts for Stdlib::Stdlib::HashlibDataPassedToHashClass#class#ffff/4@85bb21:
148810 ~0%     {2} r1 = JOIN DataFlowPublic::CallCfgNode#class#ff#shared WITH project#DataFlowPublic::CallCfgNode::getArg_dispred#fff ON FIRST 1 OUTPUT Lhs.1 'node', Lhs.0 'this'
148810 ~0%     {2} r2 = JOIN r1 WITH Stdlib::Stdlib::hashlibMember#ff#nonempty CARTESIAN PRODUCT OUTPUT Lhs.1 'this', Lhs.0 'node'

22     ~0%     {2} r3 = JOIN DataFlowPublic::CallCfgNode#class#ff#shared WITH project#DataFlowPublic::CallCfgNode::getArgByName_dispred#fff ON FIRST 1 OUTPUT Lhs.1 'node', Lhs.0 'this'
22     ~0%     {2} r4 = JOIN r3 WITH Stdlib::Stdlib::hashlibMember#ff#nonempty CARTESIAN PRODUCT OUTPUT Lhs.1 'this', Lhs.0 'node'

148832 ~0%     {2} r5 = r2 UNION r4
110933 ~2%     {3} r6 = JOIN r5 WITH ApiGraphs::API::Node::getACall_dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1 'hashClass', Lhs.1 'node', Lhs.0 'this'
26     ~0%     {4} r7 = JOIN r6 WITH Stdlib::Stdlib::hashlibMember#ff_10#join_rhs ON FIRST 1 OUTPUT Lhs.2 'this', Rhs.1 'hashName', Lhs.0 'hashClass', Lhs.1 'node'
               return r7
```

(The tuple counts themselves are not directly comparable.)
2021-07-12 14:22:21 +00:00
CodeQL CI
1d56748eed Merge pull request #6200 from yoff/pythonJS-make-expbtlib-private
Approved by RasmusWL, esbena
2021-07-02 09:09:18 -07:00
CodeQL CI
a25933aa56 Merge pull request #5926 from RasmusWL/small-cleanups
Approved by tausbn
2021-07-02 04:59:54 -07:00
Rasmus Wriedt Larsen
81fab487a4 Python: Apply suggestions from code review
Co-authored-by: Taus <tausbn@github.com>
2021-07-02 13:27:41 +02:00
Rasmus Wriedt Larsen
22c155687e Python: Fix code after removing getPostUpdateNode 2021-07-02 13:25:25 +02:00
Rasmus Wriedt Larsen
7a6eee50ff Revert "Python: Add getPostUpdateNode to DataFlow::Node"
This reverts commit 9137f04bd3.
2021-07-02 13:23:02 +02:00
Rasmus Wriedt Larsen
e56dfe75bd Python: AttrRef getOjbect/1 -> accesses/2
See this thread for discussion:
https://github.com/github/codeql/pull/5926#discussion_r635384981
2021-07-02 13:21:12 +02:00
Rasmus Lerchedahl Petersen
6f2642607e Python: make the import of RedosUtil public
This mirrors `SuperlinearBacktracking.qll`
An alternative is to keep it private and import it again
in the query files.
2021-07-02 12:32:04 +02:00
Rasmus Lerchedahl Petersen
77c329fb0f Python/JS: Make much more private 2021-07-02 12:13:52 +02:00
Rasmus Lerchedahl Petersen
1fc9638486 Python: port redos .qhelp from js 2021-07-02 11:36:46 +02:00
Taus
f151338def Merge pull request #6198 from RasmusWL/fix-cleartext-logging
Python: Some minor fixes to `py/clear-text-logging-sensitive-data`
2021-07-01 18:28:25 +02:00
Rasmus Lerchedahl Petersen
eee56e0156 Python/JS: Make most of the new library private 2021-07-01 15:34:06 +02:00
Anders Schack-Mulligen
37f8794d01 Merge pull request #6165 from edoardopirovano/fix-regression
Performance: Improve join order in data flow library
2021-07-01 14:13:18 +02:00
Rasmus Wriedt Larsen
b0309dd321 Python: Limit SensitiveDataSources to prevent _some_ cross-talk 2021-07-01 12:08:12 +02:00
Rasmus Wriedt Larsen
f64e58a21c Python: Fix a QLDoc for SensitiveDataSources 2021-07-01 12:05:59 +02:00
Rasmus Wriedt Larsen
d9e2f504f8 Python: Fix clear text logging sink
No need to restrict it to arguments that are calls
2021-06-30 20:31:17 +02:00
Taus
e4af14638b Merge pull request #6175 from yoff/python-port-ReDoS
Python: port ReDoS queries from Javascript
2021-06-30 16:26:07 +02:00
yoff
6a77b890af Merge pull request #6155 from RasmusWL/port-cleartext-queries
Python: Port cleartext queries
2021-06-30 15:52:34 +02:00
Rasmus Lerchedahl Petersen
a176e6ac30 Python: comment out temporarily unused predicate 2021-06-30 15:28:31 +02:00
Rasmus Lerchedahl Petersen
45e30b0c06 Python: comment out temporarily unused predicate 2021-06-30 15:04:37 +02:00
Rasmus Lerchedahl Petersen
c306cee04e Python: mimic JS file hierarchy 2021-06-30 15:03:22 +02:00
Rasmus Lerchedahl Petersen
651f8abba0 Python: Avoid multiple results for toString 2021-06-30 14:39:49 +02:00
Rasmus Wriedt Larsen
c2708176b1 Python: Support %-style formatting for MarkupSafe 2021-06-30 14:15:41 +02:00
Rasmus Wriedt Larsen
c84658dff1 Python: Use MethodCallNode for MarkupSafe string-format 2021-06-30 13:58:09 +02:00
Rasmus Wriedt Larsen
d6e8fafdbd Python: Proper sorting in Frameworks.qll 2021-06-30 13:55:26 +02:00
Rasmus Wriedt Larsen
075953860b Merge branch 'main' into markupsafe-modeling 2021-06-30 13:55:08 +02:00
Rasmus Lerchedahl Petersen
72986e1e28 Python: Add some comments on the booelan sweep
pattern
2021-06-30 12:50:36 +02:00
Rasmus Lerchedahl Petersen
52d91917aa Merge branch 'python-port-ReDoS' of github.com:yoff/codeql into python-port-ReDoS 2021-06-30 12:25:59 +02:00
Rasmus Lerchedahl Petersen
6dfbf80494 Python: Disable use of toUnicode
until supporting CLI is released
2021-06-30 12:21:52 +02:00
Rasmus Wriedt Larsen
e5d65992b4 Python: Use DefinitionNode instead of Assign
Based on https://github.com/github/codeql/pull/6155#discussion_r660964666:

> Hmm... Would it be better to do this using DefinitionNode instead of
> Assign? The latter is fairly limited in what it can represent, and also
> raises questions of whether this definition is sound with regard to
> control-flow splitting.
2021-06-30 12:08:32 +02:00
yoff
c19522e921 Apply suggestions from code review
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
2021-06-30 11:49:45 +02:00
Edoardo Pirovano
8354f66c29 Performance: Improve join order in data flow library 2021-06-29 18:23:22 +01:00
Rasmus Wriedt Larsen
94bcda3bae Python: Highlight problem picking DataFlow::Node for Assign 2021-06-29 15:32:16 +02:00
Rasmus Lerchedahl Petersen
6f2cdbf59e Python: Give up on providing values for form feeds 2021-06-29 11:14:27 +02:00
Rasmus Lerchedahl Petersen
ffb8938e52 Python: undo autoformat character mangling 2021-06-29 11:06:17 +02:00
Rasmus Lerchedahl Petersen
135b71b649 Python: Apply performance fix by @hvitved 2021-06-29 11:01:33 +02:00
Rasmus Lerchedahl Petersen
591b6ef69c Python: Add ReDoS as identical files from JS
The library specific file is `RegExpTreeView`.
The files are recorded as identical via the mapping
in `identical-files.json`.
2021-06-28 17:04:48 +02:00
Rasmus Lerchedahl Petersen
2c27ce7aa5 Python: Make ast viewer see regexes
This work is due to @erik-krogh who also
 - made corresponding fixes to `RegexTreeView.qll`
 - implemented `toUnicode` so it is available on `String`s
2021-06-28 17:04:48 +02:00
Rasmus Lerchedahl Petersen
d953ba8dd4 Python: A parse-tree-view of regular expressions
This contains several contributions from @erik-krogh
and also some fixes from @nickrolfe
2021-06-28 17:04:48 +02:00
Rasmus Lerchedahl Petersen
21007d21f4 Python: track if qualifiers allow unbounded
repeats. This in preparation for ReDoS
2021-06-28 17:04:48 +02:00
Rasmus Lerchedahl Petersen
74ca1d00b9 Python: More precise regex parsing 2021-06-28 17:04:48 +02:00
Rasmus Lerchedahl Petersen
e5f07cc4d3 Python: inline test of regex components
- Added naive implementation of `charRange` so the test can run.
- Made predicates public as needed.
2021-06-28 17:04:48 +02:00
Rasmus Wriedt Larsen
9573048ee8 Python: Port py/clear-text-logging-sensitive-data 2021-06-25 14:35:31 +02:00
Rasmus Wriedt Larsen
68cfeb0b5c Python: Model logging from the logging module 2021-06-25 14:26:35 +02:00
Rasmus Wriedt Larsen
c05e375401 Python: Fix indentation of hashlib modeling 2021-06-25 14:26:35 +02:00
Rasmus Wriedt Larsen
36c9ceb13b Python: Add Logging concept 2021-06-25 14:26:35 +02:00
Rasmus Wriedt Larsen
a7eb1b3a12 Python: Minor QLDoc fixup 2021-06-25 14:26:35 +02:00