Not the prettiest of solutions, but it does the job. Basically, we were
calculating (and re-calculating) the same big relation between strings
and regexes and then checking whether the latter matched the former.
This resulted in tuple counts like the following:
```
[2021-07-12 16:09:24] (12s) Tuple counts for SensitiveDataSources::SensitiveDataModeling::SensitiveVariableAssignment#class#ff#shared/4@7489c6:
4918074 ~0% {4} r1 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH Flow::NameNode::getId_dispred#ff CARTESIAN PRODUCT OUTPUT Lhs.0 'arg0', Lhs.1 'arg1', Rhs.0, Rhs.1 'arg3'
2654 ~0% {4} r2 = JOIN r1 WITH PRIMITIVE regexpMatch#bb ON Lhs.3 'arg3',Lhs.1 'arg1'
return r2
```
(The above being just the bit that handles `DefinitionNode` in
`SensitiveVariableAssignment`, and taking 12 seconds to evaluate.)
By applying a bit of manual inlining and magic, this becomes somewhat
more manageable:
```
[2021-07-12 15:59:44] (1s) Tuple counts for SensitiveDataSources::SensitiveDataModeling::sensitiveString#ff/2@8830e2:
27671 ~2% {3} r1 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveParameterName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0
334012 ~2% {3} r2 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0
361683 ~11% {3} r3 = r1 UNION r2
154644 ~0% {3} r4 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveFunctionName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0
149198 ~1% {3} r5 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveStrConst#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0
124257 ~5% {3} r6 = JOIN SensitiveDataHeuristics::HeuristicNames::maybeSensitiveRegexp#ff WITH SensitiveDataSources::SensitiveDataModeling::sensitiveAttributeName#f CARTESIAN PRODUCT OUTPUT Lhs.0 'classification', Lhs.1, Rhs.0
273455 ~21% {3} r7 = r5 UNION r6
428099 ~30% {3} r8 = r4 UNION r7
789782 ~78% {3} r9 = r3 UNION r8
1121 ~77% {3} r10 = JOIN r9 WITH PRIMITIVE regexpMatch#bb ON Lhs.2 'result',Lhs.1
1121 ~70% {2} r11 = SCAN r10 OUTPUT In.0 'classification', In.2 'result'
return r11
```
(The above being the total for all the sensitive names we care about,
taking only 1.2 seconds to evaluate.)
Incidentally, you may wonder why this has _fewer_ results than before.
The answer is control flow splitting -- every sensitively-named
`DefinitionNode` would have been matched in isolation previously. By
pre-matching on just the names of these, we can subsequently join
against those names that are known to be sensitive, which is a much
faster operation.
(We also get the benefit of deduplicating the strings that are matched,
before actually performing the match, so if, say, an attribute name and
a variable name are identical, then we'll only match them once.)
We also exclude all docstrings as relevant string constants, as these
presumably don't actually flow anywhere.
CodeQL
This open source repository contains the standard CodeQL libraries and queries that power LGTM and the other CodeQL products that GitHub makes available to its customers worldwide. For the queries, libraries, and extractor that power Go analysis, visit the CodeQL for Go repository.
How do I learn CodeQL and run queries?
There is extensive documentation on getting started with writing CodeQL. You can use the interactive query console on LGTM.com or the CodeQL for Visual Studio Code extension to try out your queries on any open source project that's currently being analyzed.
Contributing
We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our contributing guidelines. You can also consult our style guides to learn how to format your code for consistency and clarity, how to write query metadata, and how to write query help documentation for your query.
License
The code in this repository is licensed under the MIT License by GitHub.
Visual Studio Code integration
If you use Visual Studio Code to work in this repository, there are a few integration features to make development easier.
CodeQL for Visual Studio Code
You can install the CodeQL for Visual Studio Code extension to get syntax highlighting, IntelliSense, and code navigation for the QL language, as well as unit test support for testing CodeQL libraries and queries.
Tasks
The .vscode/tasks.json file defines custom tasks specific to working in this repository. To invoke one of these tasks, select the Terminal | Run Task... menu option, and then select the desired task from the dropdown. You can also invoke the Tasks: Run Task command from the command palette.