Although this test is added under the `wrong` folder, the current results from
this CodeQL test is actually correct (compared with the Python
interpreter). However, they don't match what the extractor does when invoked
with `codeql database create`.
Since I deemed it "more than an easy fix" to change the extractor behavior for
`codeql database create` to match the real python behavior, and it turned out to
be quite a challenge to change the extractor behavior for all tests, I'm just
going to make THIS ONE test-case behave like the extractor will with `codeql
database create`...
This is a first commit, to show how the extractor works with qltest by default.
Inspired by the debugging in https://github.com/github/codeql/issues/4640
The old Semmle duplicate-code detection code has never been done when
extracting databases for the CodeQL CLI, except that `codeql test run`
will run it _just_ in order to support tests of the feature. With the
sunsetting of Odasa there's no need to even _test_ the feature anymore.
This commit removes those tests that fail when the duplicate-code
detector is turned off. Once it is merged and bumped, we can finally
remove it from `codeql`.
From a local evaluation against flask DB, after
https://github.com/github/codeql/pull/4649 was merged we would get:
```
Tuple counts for TypeTracker::callStep#ff/2@a21b71:
9876 ~0% {3} r1 = SCAN DataFlowPrivate::DataFlowCall::getArg_dispred#fff AS I OUTPUT I.<2>, I.<0>, I.<1>
9876 ~2% {3} r2 = JOIN r1 WITH project#DataFlowPrivate::DataFlowCall::getArg_dispred#fff AS R ON FIRST 1 OUTPUT r1.<2>, R.<0>, r1.<1>
72388997 ~0% {4} r3 = JOIN r2 WITH DataFlowPublic::ParameterNode::isParameterOf_dispred#fff_201#join_rhs AS R ON FIRST 1 OUTPUT r2.<2>, R.<2>, r2.<1>, R.<1>
4952 ~0% {2} r4 = JOIN r3 WITH DataFlowPrivate::DataFlowCall::getCallable_dispred#ff AS R ON FIRST 2 OUTPUT r3.<2>, r3.<3>
return r4
```
Add a step from that `CfgNode` to the corresponding `EssaNode`.
The intended effect is seen in `ImpliesDataflow.expected`.
The efeect seen in other `.expected`-files is that parameter nodes
change type, that the extra steps are seen, and that flow from
`EssaVar`s is mirrored in flow from `CfgNode`s.
There is one surprise, which is the `.0` node in
`coverage/localFlow.expected`.
This makes it easy to extend the sources/sinks of the configuration and re-run
the query from the query console on LGTM.com.
File location in `semmle.<lang>.security.dataflow.<QueryName>.qll` is matching
what we currently do in other languages (JS and C# sampled).
I did not follow the pattern in other languages for wrapping all the code in a
`module CodeInjection`, since I didn't understand the value in doing so -- I
would like confirmation from the other teams if we _should_ actually do that,
before merging.
Also fixes a bug ("`B`" was not recognised as a bytestring prefix).
The basic idea behind this fix is that the set of possible prefixes is
fairly small, so it's easier just to precompute them, and then join
them with the entire prefix of the string in question (rather than
look at each string in isolation, get its prefix, and _then_ check
whether it looks like it's a unicode string prefix, which essentially
is what the code did before).
Here, `context.appliesTo(n)` was being distributed across all of the
disjuncts, which caused poor performance.
The new helper predicate, `literal_node_class` should be fairly small,
since it only applies to a subset of `ControlFlowNode`s, and only
assigns a limited set of `ClassObjectInternal`s to these nodes.
Since the number of relevant attributes in the `re` module is fairly
small, it made sense to factor this out in a separate predicate, and
the join order also became more sensible.
This is only _really_ expensive when there are a _lot_ of strings in
the database, but for this case, where we're always extracting the
same substring of the string, it's easier -- and faster -- to just
make a substring operation directly.
This fixes the major performance problem with type tracking on
some (pathological) databases.
The interface could probably be improved a bit. In particular, I'm
thinking that we might want to have `DataFlow::exprNode` return a
`LocalSourceNode` so that a cast isn't necessary in order to use
`flowsTo`.
I have added two `cached` annotations. The one on `flowsTo` is
crucial, as performance regresses without it. The one on
`simpleLocalFlowStep` may not be needed, but Java has a similar
annotation, and to me it makes sense to have this relation cached.