mirror of https://github.com/github/codeql.git synced 2026-04-25 08:45:14 +02:00

Go to file

Taus 1decf23785 Python: Fix bad join order for sensitive data

Not the prettiest of solutions, but it does the job. Basically, we were
calculating (and re-calculating) the same big relation between strings
and regexes and then checking whether the latter matched the former.

This resulted in tuple counts like the following:

```
[2021-07-12 16:09:24] (12s) Tuple counts for SensitiveDataSources::SensitiveDataModeling::SensitiveVariableAssignment#class#ff#shared/4@7489c6:
4918074 ~0%     {4} r1 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH Flow::NameNode::getId_dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0 'arg0', Lhs.1 'arg1', Rhs.0, Rhs.1 'arg3'
2654    ~0%     {4} r2 = JOIN r1 WITH PRIMITIVE regexpMatch#bb ON Lhs.3 'arg3',Lhs.1 'arg1'
                return r2
```
(The above being just the bit that handles `DefinitionNode` in
`SensitiveVariableAssignment`, and taking 12 seconds to evaluate.)

By applying a bit of manual inlining and magic, this becomes somewhat
more manageable:

```
[2021-07-12 15:59:44] (1s) Tuple counts for SensitiveDataSources::SensitiveDataModeling::sensitiveString#ff/2@8830e2:
27671  ~2%      {3} r1 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveParameterName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

334012 ~2%      {3} r2 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

361683 ~11%     {3} r3 = r1 UNION r2

154644 ~0%      {3} r4 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveFunctionName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

149198 ~1%      {3} r5 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveStrConst#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

124257 ~5%      {3} r6 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveAttributeName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0

273455 ~21%     {3} r7 = r5 UNION r6
428099 ~30%     {3} r8 = r4 UNION r7
789782 ~78%     {3} r9 = r3 UNION r8
1121   ~77%     {3} r10 = JOIN r9 WITH PRIMITIVE regexpMatch#bb ON Lhs.2 'result',Lhs.1
1121   ~70%     {2} r11 = SCAN r10 OUTPUT In.0 'classification', In.2 'result'
                return r11
```
(The above being the total for all the sensitive names we care about,
taking only 1.2 seconds to evaluate.)

Incidentally, you may wonder why this has _fewer_ results than before.
The answer is control flow splitting -- every sensitively-named
`DefinitionNode` would have been matched in isolation previously. By
pre-matching on just the names of these, we can subsequently join
against those names that are known to be sensitive, which is a much
faster operation.

(We also get the benefit of deduplicating the strings that are matched,
before actually performing the match, so if, say, an attribute name and
a variable name are identical, then we'll only match them once.)

We also exclude all docstrings as relevant string constants, as these
presumably don't actually flow anywhere.

2021-07-12 16:10:49 +00:00

.devcontainer

Update devcontainer memory settings

2020-09-02 12:04:34 -07:00

.github

Apply code review findings

2021-06-24 09:13:08 +02:00

.vscode

Apply suggestions from code review

2021-03-11 15:57:44 +01:00

change-notes

Python: Update change notes for 1.26

2020-12-02 14:01:46 +01:00

config

Python: mimic JS file hierarchy

2021-06-30 15:03:22 +02:00

cpp

C++: Address code review.

2021-07-12 11:43:43 +02:00

csharp

C#: Remove Query.qll top-level modules

2021-07-04 09:35:27 +02:00

docs

Merge branch 'main' into markupsafe-modeling

2021-06-30 13:55:08 +02:00

java

Add changed framework coverage reports

2021-07-12 00:06:55 +00:00

javascript

Merge pull request #6200 from yoff/pythonJS-make-expbtlib-private

2021-07-02 09:09:18 -07:00

misc

Fix markdown link in framework coverage PR comment

2021-07-02 11:56:00 +02:00

python

Python: Fix bad join order for sensitive data

2021-07-12 16:10:49 +00:00

.codeqlmanifest.json

Replace an odd queries.xml with qlpack.yml

2021-06-06 09:04:18 -04:00

.editorconfig

Normalize all text files to LF

2018-09-23 16:24:31 -07:00

.gitattributes

.gitattributes: PDB files are binary

2019-03-13 10:42:28 +00:00

.gitignore

add .venv/ to .gitignore

2021-01-22 14:44:18 +01:00

.lgtm.yml

JS: Exclude test cases from extraction

2019-05-07 14:36:35 +01:00

CODE_OF_CONDUCT.md

Update code of conduct in line with GH

2020-04-23 10:19:13 +01:00

CODEOWNERS

Add @codeql-go as code owners for the shared data-flow library files

2021-03-02 10:39:47 +00:00

CONTRIBUTING.md

Fix dead link in CONTRIBUTING.md

2021-03-11 13:36:19 +01:00

LICENSE

Relicense under MIT

2020-04-07 12:03:26 +01:00

README.md

Docs: Rename default branch

2020-08-14 12:03:00 +01:00

README.md

CodeQL

This open source repository contains the standard CodeQL libraries and queries that power LGTM and the other CodeQL products that GitHub makes available to its customers worldwide. For the queries, libraries, and extractor that power Go analysis, visit the CodeQL for Go repository.

How do I learn CodeQL and run queries?

There is extensive documentation on getting started with writing CodeQL. You can use the interactive query console on LGTM.com or the CodeQL for Visual Studio Code extension to try out your queries on any open source project that's currently being analyzed.

Contributing

We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our contributing guidelines. You can also consult our style guides to learn how to format your code for consistency and clarity, how to write query metadata, and how to write query help documentation for your query.

License

The code in this repository is licensed under the MIT License by GitHub.

Visual Studio Code integration

If you use Visual Studio Code to work in this repository, there are a few integration features to make development easier.

CodeQL for Visual Studio Code

You can install the CodeQL for Visual Studio Code extension to get syntax highlighting, IntelliSense, and code navigation for the QL language, as well as unit test support for testing CodeQL libraries and queries.

Tasks

The .vscode/tasks.json file defines custom tasks specific to working in this repository. To invoke one of these tasks, select the Terminal | Run Task... menu option, and then select the desired task from the dropdown. You can also invoke the Tasks: Run Task command from the command palette.

Languages

CodeQL 32.3%

Kotlin 27.4%

C# 17.1%

Java 7.7%

Python 4.6%

Other 10.7%