There was unfortunately a semantic merge conflict between #3419 and
#3587 that caused a performance regression on (at least) OpenJDK.
This reverts commit 982fb38807, reversing
changes made to b841cacb83.
Replace most direct calls to `TRawInstruction()` with calls to `getInstructionTranslatedElement()` and `getInstructionTag()`, matching existing practice. One tiny RA diff in an inconsequential join order in `getInstructionVariable`.
This noopt predicate is no longer necessary. It's equivalent to `instruction = TRawInstruction(element, tag)`, which is already materialized and has a more favorable column order anyway.
These predicates are only used within the new single IR stage, so there's no need to cache them beyond that. RA diffs are trivial. Where previously many of the predicate on `Instruction` were inline wrappers around cached predicates from `IRConstruction`, now the predicates from `IRConstruction` get inlined into the `Instruction` predicates, and the `Instruction` predicates get materialized. The net amount of work is the same, but now it's not getting cached unnecessarily.
We were creating a `TranslatedFunction` even for functions that were not from source code, but then telling the IR package that those functions didn't have IR. This resulted in having prologue/epilogue instructions (e.g. `EnterFunction`, `ExitFunction`) with no enclosing `IRFunction`.
Some of our IR consistency failure query predicates already produced results in the schema as an `@kind problem` query, including `$@` replacements for the enclosing `IRFunction` to make it easier to figure out which function to dump when debugging. This change moves the rest of the query predicates in `IRConsistency.qll` to do the same. In addition, it wraps each call to `getEnclosingIRFunction()` to return an `OptionalIRFunction`, which can be either a real `IRFunction` or a placeholder in case `getEnclosingIRFunction()` returned no results. This exposes a couple new consistency failures in `syntax-zoo`, which will be fixed in a subsequent commit.
This change also deals with consistency failures when the enclosing `IRFunction` has more than one `Function` or `Location`. For multiple `Function`s, we concatenate the function names. For multiple `Location`s, we pick the first one in lexicographical order. This changes the number of results produced in the existing tests, but does't change the actual number of problems.
Generalize logic by recognizing not only calls to
`Microsoft.AspNetCore.Mvc.ViewFeatures.HtmlHelper.Raw()`, but calls to all `Raw()`
methods that implement `Microsoft.AspNetCore.Mvc.Rendering.IHtmlHelper.Raw()`.
Instead of a vague reference to a code comment for another language, the
`controlsBlock` predicate now has the whole comment in it directly.
I've adjusted the wording so it should be reasonably correct for C/C++.
As with the other comments in this file, I don't distinguish between the
condition and its block. I think that makes the explanation clearer
without losing any detail we care about.
To make the code fit the wording of the comment, I changed the
`hasBranchEdge/2` predicate into `getBranchSuccessor/1`.
On kamailio/kamailio the `DataFlowUtil::simpleInstructionLocalFlowStep`
predicate was slow because of the case for single-field structs, where
there was a large tuple-count bulge when joining with
`getFieldSizeOfClass`:
3552902 ~2% {2} r1 = SCAN Instruction::CopyInstruction::getSourceValueOperand_dispred#3#ff AS I OUTPUT I.<1>, I.<0>
2065347 ~2% {2} r35 = JOIN r1 WITH Operand::NonPhiMemoryOperand::getAnyDef_dispred#3#ff AS R ON FIRST 1 OUTPUT r1.<1>, R.<1>
2065827 ~2% {3} r36 = JOIN r35 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r35.<1>, r35.<0>
2065825 ~3% {3} r37 = JOIN r36 WITH Type::Type::getSize_dispred#ff AS R ON FIRST 1 OUTPUT r36.<1>, r36.<2>, R.<1>
2068334 ~2% {4} r38 = JOIN r37 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r37.<2>, r37.<0>, r37.<1>
314603817 ~0% {3} r39 = JOIN r38 WITH DataFlowUtil::getFieldSizeOfClass#fff_120#join_rhs AS R ON FIRST 2 OUTPUT r38.<3>, R.<2>, r38.<2>
8 ~0% {2} r40 = JOIN r39 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 2 OUTPUT r39.<2>, r39.<0>
That's 314M tuples.
Strangely, there is no such bulge on more well-behaved snapshots like
mysql/mysql-server.
With this commit the explosion is gone:
...
2065825 ~0% {4} r37 = JOIN r36 WITH Type::Type::getSize_dispred#ff AS R ON FIRST 1 OUTPUT r36.<0>, R.<1>, r36.<1>, r36.<2>
1521 ~1% {3} r38 = JOIN r37 WITH DataFlowUtil::getFieldSizeOfClass#fff_021#join_rhs AS R ON FIRST 2 OUTPUT r37.<2>, R.<2>, r37.<3>
8 ~0% {2} r39 = JOIN r38 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 2 OUTPUT r38.<0>, r38.<2>
The `controlsBlock` predicate had some dramatic bulges in its tuple
counts. To make matters worse, those bulges were in materialized
intermediate predicates like `#shared` and `#antijoin_rhs`, not just in
the middle of a pipeline.
The problem was particularly evident on kamailio/kamailio, where
`controlsBlock` was the slowest predicate in the IR libraries:
IRGuards::IRGuardCondition::controlsBlock_dispred#fff#shared#4 ........ 58.8s
IRGuards::IRGuardCondition::controlsBlock_dispred#fff#antijoin_rhs .... 33.4s
IRGuards::IRGuardCondition::controlsBlock_dispred#fff#antijoin_rhs#1 .. 26.7s
The first of the above relations had 201M rows, and the others
had intermediate bulges of similar size.
The bulges could be observed even on small projects although they did
not cause measurable performance issues there. The
`controlsBlock_dispred#fff#shared#4` relation had 3M rows on git/git,
which is a lot for a project with only 1.5M IR instructions.
This commit borrows an efficient implementation from Java's
`Guards.qll`, tweaking it slightly to fit into `IRGuards`. Performance
is now much better:
IRGuards::IRGuardCondition::controlsBlock_dispred#fff ................... 6.1s
IRGuards::IRGuardCondition::hasDominatingEdgeTo_dispred#ff .............. 616ms
IRGuards::IRGuardCondition::hasDominatingEdgeTo_dispred#ff#antijoin_rhs . 540ms
After this commit, the biggest bulge in `controlsBlock` is the size of
`IRBlock::dominates`. On kamailio/kamailio this is an intermediate tuple
count of 18M rows in the calculation of `controlsBlock`, which in the
end produces 11M rows.
The optimizer picked a terrible join order in `VirtualDispatch::DataSensitiveCall::flowsFrom()`. Telling it that `getAnOutNode()` has a unique result convinces it to join first on the `Callable`, rather than on the `ReturnKind`.
There wasn't a good join order for the "store to global var" case in the
virtual dispatch library. When a global variable had millions of
accesses but few stores to it, the `flowsFrom` predicate would join to
see all those millions of accesses before filtering down to stores only.
The solution is to pull out a `storeIntoGlobal` helper predicate that
pre-computes which accesses are stores.
To make the code clearer, I've also pulled out a repeated chunk of code
into a new `addressOfGlobal` helper predicate.
For the kamailio/kamailio project, these are the tuple counts before:
Starting to evaluate predicate DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta/3[3]@21a1df (iteration 3)
Tuple counts for DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta:
...
59002 ~0% {3} r17 = SCAN DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#prev_delta AS I OUTPUT I.<1>, true, I.<0>
58260 ~1% {3} r31 = JOIN r17 WITH DataFlowUtil::Node::asVariable_dispred#fb AS R ON FIRST 1 OUTPUT R.<1>, true, r17.<2>
2536187389 ~6% {3} r32 = JOIN r31 WITH Instruction::VariableInstruction::getASTVariable_dispred#fb_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, true, r31.<2>
2536187389 ~6% {3} r33 = JOIN r32 WITH project#Instruction::VariableAddressInstruction#class#3#ff AS R ON FIRST 1 OUTPUT r32.<0>, true, r32.<2>
58208 ~0% {3} r34 = JOIN r33 WITH Instruction::StoreInstruction::getDestinationAddress_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, true, r33.<2>
Tuple counts after:
Starting to evaluate predicate DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta/3[3]@6073c5 (iteration 3)
Tuple counts for DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta:
...
59002 ~0% {3} r17 = SCAN DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#prev_delta AS I OUTPUT I.<1>, true, I.<0>
58260 ~1% {3} r23 = JOIN r17 WITH DataFlowUtil::Node::asVariable_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, true, r17.<2>
58208 ~0% {3} r24 = JOIN r23 WITH DataFlowDispatch::VirtualDispatch::storeIntoGlobal#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, true, r23.<2>
58208 ~0% {3} r25 = JOIN r24 WITH DataFlowUtil::InstructionNode#ff_10#join_rhs AS R ON FIRST 1 OUTPUT true, r24.<2>, R.<1>
Notice that the final tuple count, 58208, is the same before and after.
The kamailio/kamailio project seems to have been affected by this issue
because it has global variables to do with logging policy, and these
variables are loaded from in every place where their logging macro is
used.
The four cached predicates used to access common properties of instructions took a `TStageInstruction` as a parameter. This requires the calling code, in `Instruction.qll`, to then join the results with `hasInstruction()` to filter out results for `TRawInstruction`s that were discarded as unreachable. By simply switching the parameter types to `Instruction`, we can force that join to happen in the cached predicate itself. This makes the various accessor predicates on `Instruction` trivially inlinable to the cached predicate, instead of being joins of two huge relations that might have to be recomputed in later stages.
Most of the predicates on `Instruction` are thin wrappers around cached predicates in the `IRConstruction` or `SSAConstruction` modules. However, `getResultIRType()` has to join `Construction::getInstructionResultType()` with `LanguageType::getIRType()`. `getResultIRType()` is called frequently both within the IR code and by IR consumers, and that's a big join to have to repeat in multiple stages.
I looked at most of the other predicates in `Instruction.qll`, and didn't see any other predicates that met all of the criteria of "large, commonly called, and not already inline".
There were a few places in the IR itself where we use `Instruction.getResultType()`, which returns the C++ `Type` of the result, instead of `Instruction.getResultIRType()`, which returns the language-neutral `IRType` of the result. By removing this usage, we can avoid evaluating `getResultType()` at all.
There are still other uses of `Instruction.getResultType()` in other libraries. We should switch those as well.
Before this change, evaluation of the IR was spread out across about 5 stages. This resulted in a lot of redundant evaluation, especially tuple numbering of large IPA types like `TInstruction`. This change makes two small changes that, when combined, ensure that the IR is evaluated all in one stage:
First, we mark `TInstruction` as `cached`. This collapses all of the work to create instructions, across all three IR phases, into a single phase.
Second, we make the `SSA` module in `SSAConstruction.qll` just contain aliases to `cached` predicates defined in the `Cached` module. This ensures that all of the `Operand`-related SSA computation happens in the same stage as all of the `Instruction`-related SSA computation.
- Added experimental/Security/CWE/CWE-094/MvelInjection.ql
- Added experimental/Security/CWE/CWE-094/MvelInjectionLib.qll
- Added a qhelp file with an example of vulnerable code
- Added tests and stubs for mvel2-2.4.7
When running `sync-files` (or `sync-identical-files`) with the `--latest` switch, if one or more of the files in a group does not exist, the script will crash. This happens all the time when I add a new group, or add a new file path in an existing group. This has bothered me for a long time, so I finally fixed it when I ran into it again today.
I've changed the script as follows:
- If _none_ of the paths in the group exist, print an error message listing the paths in the group. This happens with or without `--latest`.
- If `--latest` is specified, copy the master file to the paths of the missing files.
This updates C#'s IR to share `TInstruction` across stages the same way C++ does. The only interesting part is that, since we have not yet ported full alias analysis to C#, I stubbed out the required parts of the aliased SSA interface in `AliasedSSAStub.qll`.
Making these part of the IPA object identity changes the failure mode for cases where we assign multiple result types to an instruction. Previously, we would just have one instruction with two result types, but now we'd have two instructions, which breaks things worse. This change goes back to how things were before, to avoid any new surprises on real-world code with invalid ASTs or IR.
The conflicts came from how `this` is now a parameter but not a
`Parameter` on `master`.
Conflicts:
cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/defaulttainttracking.cpp
cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/tainted.expected
cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/test_diff.expected
cpp/ql/test/library-tests/dataflow/dataflow-tests/dataflow-ir-consistency.expected
cpp/ql/test/library-tests/dataflow/fields/ir-flow.expected
cpp/ql/test/library-tests/syntax-zoo/dataflow-ir-consistency.expected
The taint-tracking configuration in `ExposureOfPrivateInformation.ql`
overlaps with the XSS taint-tracking configuration, as witnessed by this import chain:
```
semmle.code.csharp.security.dataflow.ExposureOfPrivateInformation.qll imports
semmle.code.csharp.security.dataflow.flowsinks.ExternalLocationSink imports
semmle.code.csharp.security.dataflow.flowsinks.Remote imports
semmle.code.csharp.security.dataflow.XSS
```
(The same for `CleartextStorage.qll` and `LogForging.ql`.)
The fix is to use `TaintTracking2` for the XSS configuration.
Now that `TInstruction` is shared between IR stages, several of the per-stage IR construction predicates can now be moved into the `Raw` interface exposed only by the initial construction of IR from the ASTs. This also removed a couple predicates that were not used previously at all.
Each stage of the IR reuses the majority of the instructions from previous stages. Previously, we've been wrapping each reused old instruction in a branch of the `TInstruction` type for the next stage. This causes use to create roughly three times as many `TInstruction` objects as we actually need.
Now that IPA union types are supported in the compiler, we can share a single `TInstruction` IPA type across stages. We create a single `TInstruction` IPA type, with individual branches of this type for instructions created directly from the AST (`TRawInstruction`) and for instructions added by each stage of SSA construction (`T*PhiInstruction`, `T*ChiInstruction`, `T*UnreachedInstruction`). Each stage then defines a `TStageInstruction` type that is a union of all of the branches that can appear in that particular stage. The public `Instruction` class for each phase extends the `TStageInstruction` type for that stage.
The interface that each stage exposes to the pyrameterized modules in the IR is now split into three pieces:
- The `Raw` module, exposed only by the original IR construction stage. This module identifies which functions have IR, which `TRawInstruction`s exist, and which `IRVariable`s exist.
- The `SSA` module, exposed only by the two SSA construction stages. This identifiers which `Phi`, `Chi`, and `Unreached` instructions exist.
- The global module, exposed by all three stages. This module has all of the predicates whose implementation is different for each stage, like gathering definitions of `MemoryOperand`s.
Similarly, there is now a single `TIRFunction` IPA type that is shared across all three stages. There is a single `IRFunctionBase` class that exposes the stage-indepdendent predicates; the `IRFunction` class for each stage extends `IRFunctionBase`.
Most of the other changes are largely mechanical.
- Added experimental/Security/CWE/CWE-094/SpelInjection.ql
and a couple of libraries
- Added a qhelp file with a few examples
- Added tests and stubs for Spring
This QL
```codeql
import python
import semmle.python.dataflow.TaintTracking
import semmle.python.security.strings.Untrusted
from CollectionKind ck
where
ck.(DictKind).getMember() instanceof StringKind
or
ck.getMember().(DictKind).getMember() instanceof StringKind
select ck, ck.getAQlClass(), ck.getMember().getAQlClass()
```
generates these 6 results.
```
1 {externally controlled string} ExternalStringDictKind UntrustedStringKind
2 {externally controlled string} StringDictKind UntrustedStringKind
3 [{externally controlled string}] SequenceKind ExternalStringDictKind
4 [{externally controlled string}] SequenceKind StringDictKind
5 {{externally controlled string}} DictKind ExternalStringDictKind
6 {{externally controlled string}} DictKind StringDictKind
```
StringDictKind was only used in *one* place in our library code. As illustrated
above, it pollutes our set of TaintKinds. Effectively, every time we make a
flow-step for dictionaries with tainted strings as values, we do it TWICE --
once for ExternalStringDictKind, and once for StringDictKind... that is just a
waste.
If I wanted to use my own TaintKind and not have any interaction with
`UntrustedStringKind` that wouldn't be possible today since these standard http
libraries import it directly. (also, I wouldn't get any sources of my custom
TaintKind from turbogears or bottle). I changed them to use the same pattern of
`ExternalStringKind` as everything else does.
While investigating a bug with `TInstruction` sharing, I discovered that we had a case where alias analysis could create two `VirtualVariable`s for the same `Allocation`. For an indirect parameter allocation, we were using the type of the pointer variable as the type of the indirect allocation, instead of just `Unknown`. If the `IRType` of the pointer variable was the same type as the type of at least one access to the indirect allocation, we'd create both an `EntireAllocationVirtualVariable` and a `VariableVirtualVariable` for the allocation.
I added a new consistency test to guard against this in the future. This also turned out to be the root cause of the one existing known consistency failure in the IR tests.
Currently, only `arguments[c]` for a constant value `c` is supported.
This allows us to detect the prototype-pollution vulnerabilities in (old versions of) `extend`, `jquery`, and `node.extend`.
All these queries have been deprecated since 2018. There is
unfortunately no way to deprecate a library, but it's been years since
we populated any databases using the VCS library, so nobody should be
using it.
I took most of the docs from the corresponding predicates in
JavaScript's `CodeDuplication.qll`. Where JavaScript had a corresponding
predicate but didn't have QLDoc, I added new QLDoc to both.
The `cpp/too-few-arguments` query produced alerts for ambiguous
databases where a function had multiple possible declarations, with some
declarations having the right number of parameters and some having too
many. With this change, the query errs on the side of caution in those
cases and does not produce an alert.
This fixes false positives on racket/racket.
The new `hasDefiniteNumberOfParameters` is exactly the negation of the
old `hasZeroParamDecl`.
I got my eyes on this one since it was using a deprecated method, BUT it was
also doing the thing, since File.getName() is the same as
File.getAbsolutePath(), and that doesn't match the description :\
In a near-future release overriding a deprecated predicate without making as
deprecated would give a compiler warning.
Not fixing the XML one. [I can see that this shouldn't be reported
anymore](https://github.com/github/codeql/pull/3520#issuecomment-631552943), and
it's not safe to remove since it was only marked as deprecated in
e6425bb4cf.
In a couple of cases, we use `glval<unknown>` as the result type of an instruction because we can't come up with anything better. Two examples are the result of `VariableAddress[#ellipsis]`, and the address of the temp variable that holds the lvalue result of the conditional operator in `(a ? b : c) = y`. In both cases, we call `getTypeForGLValue(any(UnknownType t))`, but that would have multiple results because `result.hasType(any(UnknownType t), true)` also holds for `CppFunctionGLValueType`. I tightened the result type to ensure we get the right one.
The QL compiler is about to be changed to emit a warning when overriding a deprecated predicate. This PR marks the existing overrides of deprecated predicates as `deprecated` themselves, which avoids the warning.
The `Print.qll` models seem to preserve the `isWideCharDefault()` predicate for backwards compatibility, so we can't remove them and must continue overriding them.
The `XML.qll` override is necessary because both superclasses declare the `getName()` predicate. One is `deprecated`, and the other is `abstract`, so we have to have an override.
These queries are currently run by default, but don't have their results displayed.
Looking through results on LGTM.com, they are either false positives (e.g., `BitwiseSignCheck` which flags many perfectly harmless operations and `CompareIdenticalValues` which mostly flags NaN checks) or harmless results that developers are unlikely to care about (e.g., `EmptyArrayInit` or `MisspelledIdentifier`).
With this PR, the only queries that are still run but not displayed are security queries, where different considerations may apply.
No evidence that Spring Actuators are being used, e.g. `http.authorizeRequests().anyRequest().permitAll()`
Only safe Actuators are enabled, e.g. `EndpointRequest.to("health", "info")`
The virtual-dispatch code for globals was missing any relationship
between the union field access and the global variable, which meant it
propagated function-pointer flow between any two fields of a global
struct. This resulted in false positives from
`cpp/tainted-format-string` on projects using SDL, such as
WohlSoft/PGE-Project.
In addition to fixing that bug, this commit also brings the code up to
date with the new style of modeling flow through global variables:
`DataFlow::Node.asVariable()`.
This test demonstrates that IR data flow conflates unrelated fields of a
global struct-typed variable and that this bug is not present in the old
AST-based implementation of `semmle.code.cpp.security.TaintTracking`.
This cleans up the test results, which were confusing because functions
like `sink` had multiple locations.
There are some additional results now involving casts to `const char *`
because previously it varied whether `sink` used `const`, and now it
always does.
The previous changes made the optimizer choose a bad join order for the RHS of the antijoin in `addressOperandAllocationAndOffset`. Once again, `unique` to the rescue.
After #3382 changed the escape analysis to model qualifiers as escaping,
there was an imbalance in the SSA library, where `addressTakenVariable`
excludes variables from SSA analysis if they have their address taken
but are _not_ passed by reference. This showed up as a missing result in
`TOCTOUFilesystemRace.ql`, demonstrated with a test case in #3432.
This commit changes the definition of "pass by reference" to include
call qualifiers, which allows SSA modeling of variables that have member
function calls on them.
Adding a new test case leads to changes in all `.expected` files in its
directory.
The new results show that the `DefinitionsAndUses` library does not
model `std::addressof` correctly, but that library is not intended to be
used for new code.
I didn't add this support in `AddressConstantExpression.qll` since I
think it would require extra work and testing to get the constexprness
right. My long-term plan for `AddressConstantExpression.qll` is to move
its functionality to the extractor.
It does so by first defining what a pointer dereference is (on the IR
`Instruction` level), and then using the array length analysis and the range
analysis together to prove that some of these pointer dereferences are safe.
The class org.apache.commons.io.serialization.ValidatingObjectInputStream
is an implementation of ObjectInputStream that validates the deserialized
classes against a white list. Therefore, this class should not be considered an
unsafe deserialization sink.
These already have results for BoundMethodValue, although
1) it's a bit strange that `getParameter(-1)` has results
2) why does `Method(Function C.n, class C)` exists? this would only be relevant
if `n` was a classmethod, but it isn't. It's not a problem that it exsits per
se, but curious.
The `fieldFlow` predicate contained a fragile join that has become
ordered wrong recently, either as result of an unrelated change in the
data-flow library or as part of the stats change for the last dbscheme
change.
The minimal fix is to use `getEnclosingCallable` instead of
`getFunction` since the former uses `unique` to ensure good join
ordering in its callers. A longer-term fix should be applied to the AST
base libraries, but this will be invasive and require independent
testing.
Tuple counts on Wireshark before (cancelled after a few minutes):
(747s) Starting to evaluate predicate DataFlowUtil::localFlowStep#ff/2@bdba82
(848s) Tuple counts for DataFlowUtil::localFlowStep#ff:
1766640980 ~1% {2} r1 = JOIN DataFlowUtil::Node::getFunction_dispred#ff_10#join_rhs AS L WITH DataFlowUtil::Node::getFunction_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT L.<1>, R.<1>
1327 ~0% {2} r2 = JOIN r1 WITH project#DataFlowImplLocal::Configuration::hasFlow#fbb AS R ON FIRST 2 OUTPUT r1.<0>, r1.<1>
9691232 ~0% {2} r3 = DataFlowUtil::simpleLocalFlowStep#ff@staged_ext \/ r2
return r3
After:
(0s) Starting to evaluate predicate DataFlowUtil::localFlowStep#ff/2@a852a0
(0s) Tuple counts for DataFlowUtil::localFlowStep#ff:
49017 ~4% {3} r1 = JOIN project#DataFlowImplLocal::Configuration::hasFlow#fff AS L WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff AS R ON FIRST 1 OUTPUT L.<1>, R.<1>, R.<0>
42359 ~0% {2} r2 = JOIN r1 WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff AS R ON FIRST 2 OUTPUT r1.<2>, r1.<0>
9732264 ~0% {2} r3 = DataFlowUtil::simpleLocalFlowStep#ff@staged_ext \/ r2
return r3
Report an alert in _any_ `Web.config` file, as long as it does not have an
`X-Frame-Options` entry (as opposed to only reporting alerts when _all_
`Web.config` files lack the entry).
This gives a slight speedup, and I think it makes the code shorter and
clearer.
On Wireshark, the time from the beginning of the `IRBlock` stage until
just before evaluation of `getInstruction` drops from 44s to 34s.
Flow from a definition by reference of a field into its object was
working inconsistently and in a very syntax-dependent way. For a
function `f` receiving a reference, `f(a->x)` could propagate data back
to `a` via the _reverse read_ mechanism in the shared data-flow library,
but for a function `g` receiving a pointer, `g(&a->x)` would not work.
And `f((*a).x)` would not work either.
In all cases, the issue was that the shared data-flow library propagates
data backwards between `PostUpdateNode`s only, but there is no
`PostUpdateNode` for `a->x` in `g(&a->x)`. This pull request inserts
such post-update nodes where appropriate and links them to their
neighbors. In this exapmle, flow back from the output parameter of `g`
passes first to the `PostUpdateNode` of `&`, then to the (new)
`PostUpdateNode` of `a->x`, and finally, as a _reverse read_ with the
appropriate field projection, to `a`.
This case was added in dccc0f4db. The surrounding code has changed a lot
since then, and the case no longer seems to have an effect except to
create some dead ends and possibly cycles in the local flow graph.
Also clarify the docs on `Call` to decrease the likelyhood of such an
omission happening again.
The updated test reflects that `f1.operator()` lets the address of `f1`
escape from the caller.
For each pointer, we start tracking (starting from the allocation or an array declaration)
1) how long is the chunk of memory allocated
2) where the current pointer is in this chunk of memory.
This information might not always exist, but when it does, it is reliable.
Currently only works intraprocedurally.
We already modelled usage of the protected `MessageDigest(String algo)`
constructor as a crypto algorithm specification. For some reason we did
not model the more commonly used public `MessageDigest.getInstance` method.
I've added a `.vscode/extensions.json` file that will automatically recommend the CodeQL for Visual Studio Code extension to anyone who opens the repo in VS Code (without the extension already installed).
This `select` clause had become very slow after we started caching
`ElementBase::toString` because the query used string concatenation to
produce alert messages, and those string concatenations were done very
early in the pipeline, producing lots of strings that would be discarded
moments later.
By using `$@` to interpolate elements into strings, the concatenation is
done outside of QL.
Testing on a Chromium snapshot, this commit takes us from
#select#ff ................ 6m2s
to
#select#cpe#134#fff ....... 15.2s
If you're developing one of the libraries that has muiltiple copies auto-generated by `sync-files.py`, you can now run `sync-files.py --latest` by going to the `Terminal | Run Task...` menu in VS Code and selecting the `Sync Identical Files` task. You can set a keyboard binding to run this task for quicker access.
Logic is hard, and I made a mistake inverting the formula for the second case, so the
predicate never held for a sizeExpr like sizeof(int)*sizeof(void).
Now, this case is correctly handled by the fallback.
Based on PR feedback. This will avoid a syntactic wart and make the
invocation do the right thing both with and without
`language[monotonicAggregates]`.
Since this test inheriently has `--max-import-depth=1`, by using six, we would
never look at the actual source-code of urllib.parse/urlparse and therefore the
test would never show if we understood the library code good enough that we
could propagate taint out-of-the-box.
All tests moved by one line... that is why the diff is so big
This yields more precise size information in a lot of the common cases of C allocation code,
as the common pattern malloc(count * sizeof(type)) is now understood.
We don't hyphenate "QL-library" and there were a few typos. Feel free to further revise this if I've changed the meaning too much.
As discussed separately, I was unable to raise this as a PR in GitHub.com and had to resort to a direct commit.
(cherry picked from commit e29468135d)
We don't hyphenate "QL-library" and there were a few typos. Feel free to further revise this if I've changed the meaning too much.
As discussed separately, I was unable to raise this as a PR in GitHub.com and had to resort to a direct commit.
This predicate is effectively a Cartesian product between all enum
types. It's infeasible to compute it in full, so luckily the optimizer
has been able to apply enough magic to make it feasible. That's not a
robust solution, and it has indeed broken on at least one version of the
1.24 release candidate.
On a Chromium snapshot where I ran the LGTM suite overnight, the
`m#MistypedFunctionArguments::arithTypesMatch#bb` predicate (magic for
`arithTypesMatch`) took 170m5s. That was commit b69fdf5 from the
internal repo. I tried to reproduce it in VSCode, this time with commit
646646, but it wasn't quite as bad: the predicate took only 38 seconds.
In any case, making the problematic predicate `pragma[inline]` removes
the slow magic and makes the `MistypedFunctionArguments.ql` query
faster.
IR generation was not handling the special two-operand flavor of the `?:` operator that GCC supports as an extension. The extractor doesn't quite give us enough information to do this correctly (see github/codeql-c-extractor-team#67), but we can get pretty close.
About half of the code could be shared between the two-operand and three-operand flavors. The main differences for the two-operand flavor are:
1. The "then" operand isn't a child of the `ConditionalExpr`. Instead, we just reuse the original value of the "condition" operand, skipping any implicit cast to `bool` (see comment for rationale).
2. For the three-operand flavor, we generate the condition as control flow rather than the computation of a `bool` value, to avoid creating unnecessarily complicated branching. For the two-operand version, we just compute the value, since we have to reuse that value in the "then" branch anyway.
I've added IR tests for these new cases. I've also updated the expectations for `SignAnalysis.ql` based on the fix. @rdmarsh2, can you please double-check that these diffs look correct? I believe they do, but you're the range/sign analysis expert.
Includes a fairly exhaustive test case for arithmetic operations involving `_Complex` and/or `_Imaginary` types. Thanks to these new tests, I discovered that the extractor treats certain arithmetic operations on `_Imaginary` types as separate expression kinds, so I added support for those kinds in IR construction.
`getVariableType()` is used to compute the actual semantic type of a variable from its declared type. That's where we handle pointer and function decay for parameters, and it's also where we handle arrays of unknown bound initialized with an initializer of known bound.
Previously, even if neither of the above situations applied, the type that we returned was the `getUnspecifiedType()` of the variable. This meant that, for example, `const char* p` would be treated as `char *`. This is inconsistent with how we handle types elsewhere in IR construction, where we preserve typedefs and cv-qualifiers when creating the `CppType` of an `IRVariable`, `Instruction`, or `Operand`.
The only visible effect this fix has is to fix the inferred result type for `Phi` instructions for variables affect by this change in `getVariableType()` behavior. Previously, we would see the variable accessed as both `const char*` and as `char*`, so we'd fall back to the canonical pointer type, which is `decltype(nullptr)`. Now, we see the same type for all accesses to the variable, so we use that type as the type of the SSA memory location and as the result type of the `Phi` instruction.
When `PrintSSA.qll` is imported, IR dumps will be annotated with the alias analysis information used during SSA construction. When printing this information, we incorrectly treated instructions at offset -1, which should only be `Phi` instructions, as `Chi` instructions for the instruction at offset 0. This produced phantom annotations, but did not affect the correctness of the actual IR.
- Merged the two notes for `cpp/uncontrolled-allocation-size` into one.
- Added note about renaming of a query id.
- Moved the use of IR in queries from the library section to the queries
section, rephrasing the note in terms of query results/performance
rather than library implementation.
- Grouped, without text changes, the three notes about the `Allocation`
library
- Grouped all the notes about standard-library models, abbreviating them
to eliminate the common text.
- Removed the note about `strlen` (#2647) since that should no longer
affect the results of queries or IR data flow after we started using
unsound IR for data flow.
The IR generation for `InitializeIndirection` currently connects its load operand to the result of the corresponding `InitializeParameter` instruction. This isn't exactly wrong, but it doesn't fit the IR invariant of "All unmodeled uses consume `UnmodeledDefinition`". Our current code doesn't care, because we just throw away all of the existing def-use information, modeled or otherwise, when we build unaliased SSA. However, some upcoming SSA changes don't work correctly if this invariant is broken.
I've added the trivial IR generation change, along with a new sanity query.
When the extractor can't prove that control flow will never reach the end of a non-`void`-returning function without reaching an explicit `return` statement, it inserts an implicit `return` without an operand. If control actually reaches this point, the behavior is undefined.
We were previously generating invalid IR for these implicit `return` statements, because the lack of an operand meant that there was no definition of the return value variable along that path. Instead, I've changed the IR generation to emit an `Unreached` instruction for the implicit `return`. This ensures that we don't create a control flow edge from the end of the body to the function epilogue.
The change to the range analysis test avoids having that test depend on the previous bad IR behavior, while still preserving the original spirit of the test.
These classes/predicates are not used by anything in our codebase, and is using
deprecated classes/predicates, so I think it's safe to assume they should also
have been marked with the deprecated annotation.
Changes the QL compiler warnings with:
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:663,50-63)
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:666,19-32)
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:671,19-32)
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:733,16-39)
-WARNING: Type CustomPointsToAttribute has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:181,28-51)
-WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:155,60-78)
-WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:159,19-37)
-WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:41,33-51)
+WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:41,44-62)
Flow through partial chi-instruction operands was introduced to make
definition-by-reference work, but its implementation also allowed all
other partial writes to propagate. In particular, tainting a field would
taint the whole struct, which in turn led to taint propagating across
unrelated fields of a struct.
The security test `CWE-134/semmle/argv/argvLocal.c` shows that we also
want to propagate taint from an array element to the whole array, and it
also seems right to propagate taint from a union member to the whole
union.
I was considering if this was actually something different than
Value.hasAttribute, and the names were just accidentially the same. But after
looking at the definition for Value, I'm happy about marking this as an
override (I did not test whether it was neede though):
```codeql
class Value extends TObject {
...
/** Holds if this value has the attribute `name` */
predicate hasAttribute(string name) { this.(ObjectInternal).hasAttribute(name) }
```
This workaround from `DataFlowUtil.qll` should be useful for any query
that selects an `Expr`. In particular, it's useful for IR data flow.
This commit does not include test changes.
Every data-flow node should have a unique enclosing function (_callable_
in the terminology of the data-flow library), but this was not evident
for the optimizer, and it led to a bad join order in `pathStep`. This
commit fixes the join order for C++ AST data flow. All other copies of
data flow seem to be fine.
These are the tuple counts for OpenJDK before this commit:
(231s) Tuple counts for DataFlowImplLocal::pathStep#fffff#cur_delta:
5882 ~0% {6} r1 = SCAN DataFlowImplLocal::PathNodeMid#class#ffffff#prev_delta AS I OUTPUT I.<2>, I.<0>, I.<1>, I.<3>, I.<4>, I.<5>
1063406780 ~0% {7} r2 = JOIN r1 WITH DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS R ON FIRST 1 OUTPUT r1.<2>, R.<1>, r1.<1>, r1.<0>, r1.<3>, r1.<4>, r1.<5>
5882 ~1% {6} r3 = JOIN r2 WITH DataFlowUtil::Node::getFunction_dispred#ff AS R ON FIRST 2 OUTPUT r2.<0>, r2.<6>, r2.<2>, r2.<3>, r2.<4>, r2.<5>
105 ~0% {5} r4 = JOIN r3 WITH project#DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_021#join_rhs AS R ON FIRST 2 OUTPUT r3.<2>, r3.<3>, r3.<4>, r3.<5>, R.<2>
5882 ~1% {6} r5 = JOIN r2 WITH DataFlowUtil::Node::getFunction_dispred#ff AS R ON FIRST 2 OUTPUT r2.<5>, r2.<2>, r2.<0>, r2.<3>, r2.<4>, r2.<6>
5882 ~0% {6} r6 = JOIN r5 WITH DataFlowImplLocal::TNil#ff_1#join_rhs AS R ON FIRST 1 OUTPUT r5.<2>, false, r5.<5>, r5.<1>, r5.<3>, r5.<4>
0 ~0% {5} r7 = JOIN r6 WITH DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_02413#join_rhs AS R ON FIRST 3 OUTPUT R.<4>, r6.<3>, r6.<4>, r6.<5>, R.<3>
0 ~0% {5} r8 = JOIN r7 WITH DataFlowImplLocal::TNil#ff AS R ON FIRST 1 OUTPUT r7.<1>, r7.<2>, r7.<3>, R.<1>, r7.<4>
105 ~0% {5} r9 = r4 \/ r8
The problem is that `DataFlowUtil::Node::getFunction_dispred#ff`
(`getEnclosingCallable`) is joined too late.
After this commit, the tuple counts look like this:
(13s) Tuple counts for DataFlowImplLocal::pathStep#fffff#cur_delta:
5882 ~1% {6} r1 = SCAN DataFlowImplLocal::PathNodeMid#class#ffffff#prev_delta AS I OUTPUT I.<1>, I.<0>, I.<2>, I.<3>, I.<4>, I.<5>
5882 ~3% {7} r2 = JOIN r1 WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff AS R ON FIRST 1 OUTPUT r1.<2>, R.<1>, r1.<1>, r1.<0>, r1.<3>, r1.<4>, r1.<5>
5882 ~1% {6} r3 = JOIN r2 WITH DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS R ON FIRST 2 OUTPUT r2.<3>, r2.<6>, r2.<2>, r2.<0>, r2.<4>, r2.<5>
105 ~0% {5} r4 = JOIN r3 WITH project#DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_021#join_rhs AS R ON FIRST 2 OUTPUT r3.<2>, r3.<3>, r3.<4>, r3.<5>, R.<2>
5882 ~1% {6} r5 = JOIN r2 WITH DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS R ON FIRST 2 OUTPUT r2.<5>, r2.<2>, r2.<3>, r2.<0>, r2.<4>, r2.<6>
5882 ~0% {6} r6 = JOIN r5 WITH DataFlowImplLocal::TNil#ff_1#join_rhs AS R ON FIRST 1 OUTPUT r5.<2>, false, r5.<5>, r5.<1>, r5.<3>, r5.<4>
0 ~0% {5} r7 = JOIN r6 WITH DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_02413#join_rhs AS R ON FIRST 3 OUTPUT R.<4>, r6.<3>, r6.<4>, r6.<5>, R.<3>
0 ~0% {5} r8 = JOIN r7 WITH DataFlowImplLocal::TNil#ff AS R ON FIRST 1 OUTPUT r7.<1>, r7.<2>, r7.<3>, R.<1>, r7.<4>
105 ~0% {5} r9 = r4 \/ r8
There is a slight slowdown coming from the introduction of a new
predicate `DataFlowImplLocal::pathStep#fffff#join_rhs`, which is used
only in the standard order:
(12s) Tuple counts for DataFlowImplLocal::pathStep#fffff#join_rhs:
282057 ~0% {2} r1 = SCAN DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS I OUTPUT I.<1>, I.<0>
9159890 ~1% {2} r2 = JOIN r1 WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>
return r2
The evaluation of `unique` is cheap but not free:
DataFlowUtil::Node::getEnclosingCallable_dispred#ff .............. 3.9s
DataFlowUtil::Node::getEnclosingCallable_dispred#ff_10#join_rhs .. 3.5s
The first of these two predicates evaluates `unique`, and the second
simply reorders columns. They take about the same time, which suggests
that `unique` is about as fast as it can be, given the number of tuples
it needs to push around. Note that the column reordering predicate is
only needed because of the standard order.
This revert brings back the performance problems in
`DataFlowImplLocal.qll` so they can be fixed in a different way. The fix
in #2872 was asymptotically good but had undesired overhead because it
introduced another predicate in the SCC that existed purely for join
ordering.
I did the revert by inlining the helper predicate, eliminating the
`enclosing` variable, and re-ordering the resulting lines to what they
were before #2872.
For some reason I thought that these two queries were special because
they manipulate `SecurityOptions` to change the taint-tracking sources.
It turns out it was just the opposite: the queries used to be special
because they invalidated the cache for the `tainted` predicate, but that
predicate is no longer used, so these queries are no longer special.
This consistency check seems to have value for AST data flow, but I've
disabled it on the IR for now.
This commit also includes two unrelated changes that seem to fix a
semantic merge conflict.
How could the tests fail because of autoformatting, you may ask?
The answer is deprecation warnings. These specify the location of the deprecated
entity, and due to autoformatting these moved around.
We currently use a script to keep certain duplicate QL files in sync across the repo. For historical reasons, this script has lived in the private repo alongside the rest of CodeQL, even though it's only used for files in the public `ql` repo. This PR moves the script into the public `ql` repo. It is still invoked by Jenkins scripts that live in the private repo during CI, but it can also be invoked directly without having a checkout of the private repo. This is useful for anyone who is modifying the dataflow or IR libraries with only a QL checkout.
This PR adds better support for differentiating complex and imaginary floating-point types from real floating-point types, in both the AST and in the IR type system.
*AST Changes*
- Introduces the new class `TypeDomain`, which can be either `RealDomain`, `ImaginaryDomain` or `ComplexDomain`. "type domain" is the term used for this concept in the C standard, and I couldn't think of a better one.
- Introduces `FloatingPointType.getDomain()`, to get the type domain of the type.
- Introduces `FloatingPointType.getBase()`, to get the numeric base of the type (either 2 or 10).
- Introduces three new subtypes of `FloatingPointType`: `RealNumberType`, `ComplexNumberType`, and `ImaginaryNumberType`, which differentiate between the types based on their type domain. Note that the decimal types (e.g., `_Decimal32`) are included in `RealNumberType`.
- Introduces two new subtypes of `FloatingPointType`: `BinaryFloatingPointType` and `DecimalFloatingPointType`, which differentiate between the types based on their numeric base, independent of type domain.
*IR Changes*
- `IRFloatingPointType` now has two additional parameters: the base and the type domain.
- New test that ensures that C++ types get mapped to the correct IR types.
- New IR test that verifies the IR for some basic usage of complex FP types.
With the new `unique` aggregate added to QL, we can express directly
what the "min = max" pattern emulates.
Replacing "min and max" with `unique` might in general lead to fewer
results, but that happens only in cases where the aggregate expression
has multiple values. For the three predicates changed in this commit,
that should only happen on malformed databases.
When running ql/python/ql/src/Security/CWE-079/ReflectedXss.ql against the
database for flask.
Iitially there were 10 million result-tuples for iterable_unpacking_descent.
With this change, we're down to roughly 2100,
I had included `InitializeNonLocal` in the recursion because it made
everything look better in the presence of a bug that's since been fixed.
Taking it out means the sanity test is again aligned with the old
`isChiForAllAliasedMemory`.
`Instruction.getDefinitionOverlap()` depends on `SSAConstruction::getMemoryOperandDefinition()`, which in turn depends on `SSAConstruction::hasMemoryOperandDefinition()`. When the definition in question came from a `Chi` instruction, `hasMemoryOperandDefinition()` incorrectly bound `overlap` to the overlap relationship between the original (non-`Chi`) instruction and the use. The fix is to make use of the `actualDefLocation` parameter to `getDefinitionOrChiInstruction()`, which specifies the location for the result of the `Chi` in that case.
The result of `getDefinitionOverlap()` should never be `MayPartiallyOverlap`, because if that were the case, we should have inserted as `Chi` instruction and hooked the definition up to that instead.
There are quite a few existing failures.
This predicate replaces `isChiForAllAliasedMemory`, which was always
intended to be temporary. A test is added to `IRSanity.qll` to verify
that the new predicate corresponds exactly with (a fixed version of) the
old one.
The implementation of the new predicate,
`Cached::hasConflatedMemoryResult` in `SSAConstruction.qll`, is faster
to compute than the old `isChiForAllAliasedMemory` because it uses
information that's readily available during SSA construction.
In the Unix ABI, `std::va_list` is defined as `typedef struct __va_list_tag { ... } va_list[1];`, which means that any `std::va_list` used as a function parameter decays to `struct __va_list_tag*`. Handling this actually made the QL code slightly cleaner. The only tricky bit is that we have to determine what type to use as the actual `va_list` type when loading, storing, or modifying a `std::va_list`. To do this, we look at the type of the argument to the `va_*` macro. A detailed QLDoc comment explains the details.
I added a test case for passing a `va_list` as an argument, and then manipulating that `va_list` in the callee.
This PR changes the IR we generate for functions that accept a variable argument list. Rather than simply using `BuiltInOperationInstruction` to model the various `va_*` macros as mysterious function-like operations, we now model them in more detail. The intent is to enable better alias analysis and taint flow through varargs.
The `va_start` macro now generates a unary `VarArgsStart` instruction that takes the address of the ellipsis pseudo-parameter as its operand, and returns a value of type `std::va_list`. This value is then stored into the actual `std::va_list` variable via a regular `Store`.
The `va_arg` macro now loads the `std::va_list` argument, then emits a `VarArg` instruction on the result. This returns the address of the vararg argument to be loaded. That address is later used as the address operand of a regular `Load` to return the value of the argument. To model the side effect of moving to the next argument, we emit a `NextVarArg` instruction that takes the previous `std::va_list` value and returns an updated one, which is then stored back into the `std::va_list` variable.
The `va_end` macro just emits a `VarArgsEnd` unary instruction that takes the address of the `std::va_list` argument and does nothing, since `va_end` doesn't really do anything on most compiler implementations anyway.
The `va_copy` macro is just modeled as a plain copy.
Since `runtimeExprInStaticInitializer` only looks at expressions at the
top level of an initializer or directly below some number of top-level
aggregate literals, there is no need for `inStaticInitializer` to
include expressions strictly below those in the AST.
I tested this on Wireshark, which has very large static initializers,
but found no measureable difference in run time. There are some
differences in tuple counts and iteration counts, though:
- `inStaticInitializer` changes from 6,241,153 rows (86 iterations) to
5,031,617 rows (7 iterations).
- `runtimeExprInStaticInitializer` changes from 386,350 rows to 4,705
rows.
- `hasDynamicInitialization` has 410 rows both before and after, which
suggests that this change does not affect results.
Even though there is no impact on this snapshot at this time, things
might look different if/when the restriction on aggregate literals to
100 children is removed in the extractor.
This change introduces a new synthesized `IRVariable` in every varargs function. This variable represents the entire set of arguments passed to the ellipsis by the caller. We give it an opaque type big enough hold all of the arguments passed by the largest vararg call in the database. It is treated just like any other parameter. It is initialized the same, it has indirect buffers, etc.
I had to introduce a couple new APIs to `Call` and `Function`. The QLDoc comments should explain these. I added tests for these new APIs as well.
The next step will be to change the IR generation for the `va_*` macros to manipulate the ellipsis parameter.
The spaceship (<=>) operator adds a new row to the C++ precendence
table. In preparation for that shift the necessary precedences up one
to create a suitable hole.
Note: In investigations I belive precedence 14 was not used. However,
in order to make review easier I have kept that gap.
- Introduce `ReadTaintNode` and `TaintStoreNode` to simplify logic for taint
getters and taint setters, respectively.
- `nodeCandFwd2`: Restrict `stored` column after a read, based on what it might
be before a store of the same field.
- `nodeCand2`: Restrict `read` column (renamed from `stored`) after a store, based
on what it might be after a read of the same field.
- Move big step predicates into a `LocalFlowBigStep` module.
- Define predicates by dispatch in `AccessPath[Front]` class.
- `flowCandFwd0`: Restrict `apf` column after a read, as it should be able to match
a Boolean `read` column from `nodeCand2`.
- `flowFwd0`: Restrict columns `ap` and `apf` after a read, by introducing a
`flowConsCandFwd` predicate (similar to what is done in the previous pruning steps).
- `flowFwd0`: Restrict columns `ap` and `apf` after a store, by introducing a
`flowConsCand` predicate (similar to what is done in the previous pruning steps).
Added a new `StaticLocalVariable` class, which made several other pieces of the original change a bit cleaner.
Fixed test failures due to a mistake in the original `CFG.qll` change.
Added a test case for static local variables with constructors.
Removed the `Uninitialized` instruction from the initialization of a static local, because all objects with static storage duration are zero-initialized at startup.
Fixed expectations for `SignAnalysis.ql` to reflect that a bad result is now fixed.
Previously, the IR for the initialization of a static local variable ran the initialization unconditionally, every time the declaration was reached during execution. This means that we don't model the possibility that an access to the static variable fetches a value that was set on a previous execution of the function.
I've added some simple modelling of the correct behavior to the IR. For each static local variable that has a dynamic initializer, we synthesize a (static) `bool` variable to hold whether the initializer for the original variable has executed. When executing a declaration, we check the value of the synthesized variable, and skip the initialization code if it is `true`. If it is `false`, we execute the initialization code as before, and then set the flag to `true`. This doesn't capture the thread-safe nature of static initialization, but I think it's more than enough to handle anything we're likely to care about for the foreseeable future.
In `TranslatedDeclarationEntry.qll`, I split the translation of a static local variable declaration into two `TranslatedElement`s: one for the declaration itself, and one for the initialization. The declaration part handles the checking and setting of the flag; the initialization just does the initialization as before.
I've added an IR test case that has static variables with constant, zero, and dynamic initialization. I've also verified the new IR generated for @jbj's previous test cases for constant initialization.
I inverted the sense of the `hasConstantInitialization()` predicate to be `hasDynamicInitialization()`. Mostly this just made more sense to me, but I think it also fixed a potential bug where `hasConstantInitialization()` would not hold for a zero-initialized variable. Technically, constant initialization isn't the same as zero initialization, but I believe that most code really cares about the distinction between dynamic initialization and static initialization, where static initialization includes both constant and zero initialization.
I've fixed up the C# side of IR generation to continue working, but it doesn't use any of the dynamic initialization stuff. In theory, it could use something similar to model the initialization of static fields.
For syntax errors, we simply report the major version.
For unused imports, we were getting a result for `typing.py` when run under
Python 3.7.3. To prevent this import from being considered, I've set the maximum
import depth to `0`.
The only Semmle references now left in the public Markdown files are in
URLs and in legal text. There are also two Semmle references left in
`docs/language/vale-styles/README.md` because I didn't understand them
well enough to change them.
Since this code is shared between the AST CFG and the IR construction,
it seems right to have only one copy. That copy lives on a new class
`StaticStorageDurationVariable`, which may prove useful on its own.
Obviously the result module shouldn't be a package 🤦 I was confusing
myself, since I wanted to say that `Module::named("Crypto.Cipher")` should be a package :D
The problem with the deleted code is that it would add flow to what might be an
unrelated ControlFlowNode, which is illustrated in the query below (that gives
results on flask)
from ControlFlowNode arg, CallNode call, CallNode other_call
where
call.getNode().getAKeyword().getValue() = arg.getNode() and
not call.getAnArg() = arg and
other_call.getAnArg() = arg and
not other_call = call
select call, arg, other_call
The problem here was that in the base case, there was no relationship between
`left_parent` and `right_parent`. These could be any two tuples or lists, even
if they were not part of an assignment statement.
To fix this, we add a bit of manual "magic", requiring that both of these
arguments must belong to the left and right-hand sides of the same assignment
statement.
(Note that this is in principle _still_ a gross overapproximation, but since
assignment statements are usually quite restricted in size, I don't expect this
to be a major problem.)
Having that module in `Instruction.qll` slowed down the parsing of that
file both humans and the compiler.
This commit moves the `InstructionSanity` module to `IRSanity.qll`
without making any changes to its contents apart from adding some
imports.
On Wireshark with 6GB RAM, I've observed `definitionReachesRank` to be
the slowest predicate in the IR. It seems that the implementation was
slow because the optimizer failed to eliminate the common
`reachesRank - 1` subexpression. This led to context being pushed into
the `not`, which got implemented as `MATERIALIZE`. That wouldn't
normally be a disaster, but this is one of the largest predicates in the
IR SSA construction, and iteration 2 was very slow.
Before:
(1505s) Starting to evaluate predicate SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta/4[1]@93f592 (iteration 1)
(1535s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
130670697 ~0% {4} r1 = SCAN project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS I OUTPUT I.<0>, I.<1>, I.<2>, (I.<2> + 1)
130670697 ~6% {5} r2 = JOIN r1 WITH SSAConstruction::DefUse::exitRank#fff AS R ON FIRST 2 OUTPUT r1.<0>, r1.<1>, r1.<2>, r1.<3>, R.<2>
130670697 ~6% {5} r3 = SELECT r2 ON r2.<3> <= r2.<4>
130670697 ~0% {4} r4 = SCAN r3 OUTPUT r3.<0>, r3.<1>, r3.<2>, r3.<3>
return r4
(1535s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 130670697 rows (order for disjuncts: delta=<standard>).
(1535s) Starting to evaluate predicate SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta/4[2]@866c14 (iteration 2)
(1626s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
261341394 ~107% {4} r1 = JOIN SSAConstruction::DefUse::definitionReachesRank#ffff#prev_delta AS L WITH SSAConstruction::DefUse::definitionReachesRank#ffff#join_rhs AS R ON FIRST 3 OUTPUT R.<0>, R.<1>, R.<2>, (1 + L.<3>)
261341394 ~107% {4} r2 = r1 AND NOT SSAConstruction::DefUse::definitionReachesRank#ffff#prev AS R(r1.<0>, r1.<1>, r1.<2>, r1.<3>)
130670697 ~0% {5} r3 = SCAN r2 OUTPUT r2.<0>, r2.<1>, (r2.<3> - 1), r2.<2>, r2.<3>
106034590 ~1% {4} r4 = JOIN r3 WITH project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS R ON FIRST 3 OUTPUT r3.<0>, r3.<1>, r3.<3>, r3.<4>
106034590 {4} r5 = MATERIALIZE r4 AS antijoin_rhs
24636107 ~3% {4} r6 = r2 AND NOT r5(r2.<0>, r2.<1>, r2.<2>, r2.<3>)
24636107 ~0% {5} r7 = JOIN r6 WITH SSAConstruction::DefUse::exitRank#fff AS R ON FIRST 2 OUTPUT r6.<0>, r6.<1>, r6.<2>, r6.<3>, R.<2>
2749441 ~0% {5} r8 = SELECT r7 ON r7.<3> <= r7.<4>
2749441 ~4% {4} r9 = SCAN r8 OUTPUT r8.<0>, r8.<1>, r8.<2>, r8.<3>
return r9
(1626s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 2749441 rows (order for disjuncts: delta=<standard>).
After:
(12s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
130670697 ~0% {4} r1 = SCAN project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS I OUTPUT I.<0>, I.<1>, I.<2>, (I.<2> + 1)
return r1
(12s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 130670697 rows (order for disjuncts: delta=<standard>).
(12s) Starting to evaluate predicate SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta/4[2]@fff64c (iteration 2)
(34s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
108784031 ~0% {4} r1 = SSAConstruction::DefUse::definitionReachesRank#ffff#prev_delta AS L AND NOT SSAConstruction::DefUse::exitRank#fff AS R(L.<0>, L.<1>, L.<3>)
2749441 ~5% {4} r2 = r1 AND NOT project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS R(r1.<0>, r1.<1>, r1.<3>)
2749441 ~4% {4} r3 = SCAN r2 OUTPUT r2.<0>, r2.<1>, r2.<2>, (r2.<3> + 1)
2749441 ~4% {4} r4 = r3 AND NOT SSAConstruction::DefUse::definitionReachesRank#ffff#prev AS R(r3.<0>, r3.<1>, r3.<2>, r3.<3>)
return r4
(34s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 2749441 rows (order for disjuncts: delta=<standard>).
Note that the row counts are exactly the same before and after.
The charpred of class `GVN` in `ASTValueNumbering.qll` got inlined into
the member predicate `getAnInstruction` and caused a tuple explosion on
Wireshark in the query `StrncpyFlippedArgs.ql`.
I interrupted the predicate after 10 minutes and got these intermediate
tuple counts:
(5208s) Tuple counts for ASTValueNumbering::GVN::getAnInstruction_dispred#ff:
8754900909 ~5% {3} r1 = JOIN ValueNumberingInternal::tvalueNumber#ff_10#join_rhs AS L WITH ValueNumberingInternal::tvalueNumber#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, L.<1>, L.<0>
4390274632 ~150085% {2} r2 = JOIN r1 WITH project#SSAConstruction::Cached::getInstructionUnconvertedResultExpression AS R ON FIRST 1 OUTPUT r1.<2>, r1.<1>
return r2
After this change, the `getAnInstruction` predicate is itself inlined,
like it should be. The new non-inlined charpred takes 2.1s and has these
tuple counts:
(2s) Tuple counts for ASTValueNumbering::GVN#f:
9158442 ~117% {1} r1 = JOIN project#SSAConstruction::Cached::getInstructionUnconvertedResultExpression AS L WITH ValueNumberingInternal::tvalueNumber#ff@staged_ext AS R ON FIRST 1 OUTPUT R.<1>
return r1
The code is now quadratic in the number of statements in a basic block,
whereas before it was quadratic in the number of _control-flow nodes_ in
a basic block.
The `iDominates` relation is directly on control-flow nodes, and its
transitive closure is far too large. It got compiled into a recursion
rather than `fastTC`, and I've observed that recursion to take about an
hour on a medium-size customer snapshot.
The fix is to check for dominance at the basic-block level.
Also rewrites `multi_assignment_points_to` to be a bit more readable.
I'm not entirely sure that we want an unknown instance of `object` rather than
just `UnknownInternal`. The latter gets filtered out in the characteristic
predicate for `Value`, though, so I opted for the slightly more permissive
variant.
The `cpp/missing-case-in-switch` performed badly on some snapshots, to
the extent where it was as slow as the most expensive IR stages
(example: ChakraCore). This commit makes it faster, removing a
`pragma[noopt]` along the way.
The intermediate tuple counts on a customer codebase drop from 84M to
3M, while the content hash of `getAMissingCase` is the same.
Before:
(124s) Tuple counts for Stmt::EnumSwitch::getAMissingCase#ff#antijoin_rhs:
20867789 ~0% {3} r1 = JOIN Stmt::SwitchStmt::getASwitchCase_dispred#ff AS L WITH Stmt::EnumSwitch::getAMissingCase#ff#shared AS R ON FIRST 1 OUTPUT L.<1>, R.<0>, R.<1>
20122830 ~0% {3} r2 = JOIN r1 WITH Stmt::SwitchCase::getExpr_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<2>
20122830 ~0% {3} r3 = JOIN r2 WITH Expr::Expr::getValue_dispred#ff AS R ON FIRST 1 OUTPUT r2.<2>, r2.<1>, R.<1>
83961918 ~0% {4} r4 = JOIN r3 WITH Enum::EnumConstant::getInitializer_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r3.<1>, r3.<0>, r3.<2>
83961918 ~0% {4} r5 = JOIN r4 WITH initialisers AS R ON FIRST 1 OUTPUT R.<2>, r4.<3>, r4.<1>, r4.<2>
234348 ~185% {2} r6 = JOIN r5 WITH Expr::Expr::getValue_dispred#ff AS R ON FIRST 2 OUTPUT r5.<2>, r5.<3>
return r6
...
(124s) Tuple counts for Stmt::EnumSwitch::getAMissingCase#ff:
663127 ~4% {2} r1 = Stmt::EnumSwitch::getAMissingCase#ff#shared AS L AND NOT Stmt::EnumSwitch::getAMissingCase#ff#antijoin_rhs AS R(L.<0>, L.<1>)
return r1
(124s) Registering Stmt::EnumSwitch::getAMissingCase#ff + [] with content 2060ff326cvhihcsvoph6k9divuv4
(124s) >>> Wrote relation Stmt::EnumSwitch::getAMissingCase#ff with 663127 rows and 2 columns.
After:
(5s) Tuple counts for Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs:
746029 ~0% {2} r1 = JOIN Stmt::EnumSwitch::getAMissingCase_dispred#ff#shared AS L WITH Enum::Enum::getAnEnumConstant_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, L.<1>
3116197 ~2% {3} r2 = JOIN r1 WITH Enum::EnumConstant::getInitializer_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<0>
3116197 ~0% {3} r3 = JOIN r2 WITH initialisers AS R ON FIRST 1 OUTPUT R.<2>, r2.<1>, r2.<2>
3116197 ~311% {3} r4 = JOIN r3 WITH Expr::Expr::getValue_dispred#ff AS R ON FIRST 1 OUTPUT r3.<1>, R.<1>, r3.<2>
234348 ~185% {2} r5 = JOIN r4 WITH Stmt::EnumSwitch::matchesValue#ff AS R ON FIRST 2 OUTPUT r4.<0>, r4.<2>
return r5
(5s) Registering Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs + [] with content 173483d71508vl534mvlr1g0ehi12
(5s) >>> Wrote relation Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs with 82902 rows and 2 columns.
(5s) Starting to evaluate predicate Stmt::EnumSwitch::getAMissingCase_dispred#ff/2@ae4c0b
(5s) Tuple counts for Stmt::EnumSwitch::getAMissingCase_dispred#ff:
746029 ~2% {2} r1 = JOIN Stmt::EnumSwitch::getAMissingCase_dispred#ff#shared AS L WITH Enum::Enum::getAnEnumConstant_dispred#ff AS R ON FIRST 1 OUTPUT L.<1>, R.<1>
663127 ~4% {2} r2 = r1 AND NOT Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs AS R(r1.<0>, r1.<1>)
return r2
(5s) Registering Stmt::EnumSwitch::getAMissingCase_dispred#ff + [] with content 2060ff326cvhihcsvoph6k9divuv4
(5s) >>> Wrote relation Stmt::EnumSwitch::getAMissingCase_dispred#ff with 663127 rows and 2 columns.
This turned out to be a fairly simple but easy to make bug. When we want to
figure out the value pointed-to in a multi-assignment, we look at the left hand
side to see what value from the right hand side we should assign. Unfortunately,
we accidentally attempted to look up this information in the _left hand side_ of
the assignment, resulting in no points-to information at all. The only thing
needed to fix this was to properly link up the left and right hand sides: using
the left hand side to figure out what index to look at, and then looking up the
points-to information for the corresponding place in the right hand side.
This query used two fastTC operations that were already somewhat
inefficient on their own but could send the evaluator into an OOM loop
when run in parallel without enough RAM.
The fix is to recurse manually, starting just from the expressions that
are potential candidates for alerts.
Move the stdlib tests from test/{2,3}/library-tests/ into /test/library-tests/,
and deal with version by using sys.version_info (results should be the same for
both versions).
six tests were moved from /library-tests/web/client/stdlib => /library-tests/web/client/six
To ease the rollout of this test, currently we only report missing points-to
information for nodes that either
- appear as an argument in a call to a function named `check`, or
- appear inside a scope where the first line is annotated with a comment ending
in "check".
The idea behind the second version is that once we have points-to running at a
level where no node inside a scope that _ought_ to have points-to is missing
this information, we can simply remove all uses of `check(...)` from inside this
scope, and annotate the entire scope with `# check`. Once this has been done for
the entire file, we can then remove all the comments and just require
_everything_ to be checked.
Note that I don't expect all nodes to have the need for points-to information.
For instance, there are nodes representing scope entry and exit, and for these
it doesn't make sense to require that they "point-to" anything. Similarly,
`NameNode` appearing in a "store" (i.e. as the left hand side of an assignment)
do not strictly need to have points-to information, although it might be more
intuitive if they did.
Thus, the `relevant_node` predicate will almost certainly need to be extended to
exclude these kinds of nodes.
Fixes issue https://github.com/github/codeql-c-analysis-team/issues/20
This change undoes the workaround in https://github.com/Semmle/ql/pull/2736, and replaces it with a fix for the underlying cause. The problem was that the IR construction code for side effects incorrectly assumed that `BufferAccessOpcode` included `SizedBufferAccessOpcode`. I think that was actually a perfectly reasonable assumption to make, so I changed the `Opcode` hierarchy to make it true.
Following the setup I invented for library-tests/taint/unpacking.
TestStep is still a bit annoying, since the output is not easy to eyeball; but
for now I guess we can live with it :)
I honestly didn't get the point of DistinctStringKinds.ql, other than showing we
can handle multiple taint kinds
This is a temporary fix!
Added minimal working example (MWE) as a regression, so it's easier to fix the
real problem.
only Python 3 is facing the problem -- and without --max-import-depth=1 the test
times out at 10 minutes :O
Introduced a regression, since the old code was:
```
predicate is_a_string_type(ClassObject seqtype) {
seqtype = theBytesType() and major_version() = 2
or
seqtype = theUnicodeType()
}
```
but *now* we're good!
+ Extend PropertyInternal.getSetter to handle non-decorator
+ Add PropertyInternal.getDeleter
It seems like a bit hacky way to do things, since we're not using the
PropertySetterOrDeleter class at all, but for now I'll leave it be.
e.(StrConst).isDocString() can only hold if e instanceof StrConst, since we have
that condition on the line above, we can safely remove this condition.
* Adds the...Type() predicates as foresight modernizations.
* Removes predicates that are not currently ported/portable
* Adds range types
* Update python/ql/src/semmle/python/objects/ObjectAPI.qll
Co-Authored-By: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
* Update python/ql/src/semmle/python/objects/ObjectAPI.qll
Co-Authored-By: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
* Swaps xType for just x, at least when it's new
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
The `isInCycle` predicate would take a long time on Wireshark with 6GB
RAM, sometimes OOMing in the fastTC HOP. Analyzing wireshark with 6GB is
important because that's the standard configuration on our Jenkins
workers. With this commit, I can analyze Wireshark with 6GB on my
laptop.
The `getNonPhiOperandDef` predicate on Wireshark is 34M tuples, while
`getDefIfHasNeighbors` is 11M tuples, and the TC of
`getDefIfHasNeighbors` is 23M tuples (487 MB).
When multiple results are available, we usually name the function
`getAnArgument` or `getASomething`. The support for django copied the way bottle
did things, so this commits cleans up both
Without this fix, running the full LGTM suite would get the IR evaluated
twice. That's because we have multiple IPA types and constructors with
the same name (like `TInstruction` and `MkIRFunction`), and the QL
compiler chooses how to disambiguate those names differently depending
on import order.
I've tested that the IR is only evaluated once now by running the whole
suite on a tiny project (jbj/magicrescue) and looking at the output of
perl -ne 'print if /^RESULTS IN:/ .. /^\[/ and not /^\[/' runSnapshotQueries-debug.log | sort |uniq -c |sort -n |less
Instructions that are removed from the normal value numbering recursion
because they have a duplicated type or AST element get unique value
numbers rather than going unnumbered. This ensures comparisons of value
numbers using `!=` hold for filtered instructions.
This entry in `.codeqlmanifest.json` was intended to allow
unpacking the CodeQL CLI as a subdirectory of `ql`, and things
would Just Work.
However, it is not necessary anymore because recent releases of
the CLI will search their own directory as a fallback
_independently_ of the parent directory.
On the contrary, removing this link will make internal testing
easier because you then run a test build of the CLI with
`--search-path` pointing to the `ql` checkout without inadvertently
making extractors in a _different_ build that is unpacked there visible.
On some snapshots, notably ffmpeg, the IR `ValueNumbering` recursion
would generate billions of tuples and eventually run out of space.
It turns out it was fairly common for an `Instruction` to get more than
one `ValueNumber` in the base cases for `VariableAddressInstruction` and
`InitializeParameterInstruction`, and it could also happen in an
instruction with more than one operand of the same `OperandTag`. When a
binary operation was applied to an instruction with `m` value numbers
and another instruction with `n` value numbers, the result would get
`m * n` value numbers. This led to doubly-exponential growth in the
number of value numbers in rare cases.
The underlying reason why a `VariableAddressInstruction` could get
multiple value numbers is that it was keyed on the associated
`IRVariable`, and the `IRVariable` is defined in part by the type of its
underlying `Variable` (or other AST element). If the extractor defines a
variable to have multiple types because of linker ambiguity, this leads
to the creation of multiple `IRVariable`s. That should ideally be solved
in `TIRVariable.qll`, but for now I've put a workaround in
`ValueNumberingInternal.qll` instead.
To remove the problem with instructions having multiple operands, the
construction in `Operand.qll` will now filter out any such operand. It
wasn't enough to apply that filter to the `raw` stage, so I've applied
it to all three stages.
Before I added `--max-import-depth=2`, there was a bit of trouble, where it
would alert on `from pkg_ok import foo2` -- since all the `pkg_ok.foo<n>`
modules were missing, I guess the analysis didn't make any assumptions on
whether `foo2` is a module or a regular attribute.
So I've been thinking a bit about import pkg_ok.foo1 after reading the Python
references for imports of submodules
https://docs.python.org/3/reference/import.html#submodules
> When a submodule is loaded using any mechanism (...) a binding is placed in the
parent module’s namespace to the submodule object. For example, if package spam
has a submodule foo, after importing spam.foo, spam will have an attribute foo
which is bound to the submodule.
That does at least explain what is going on here.
I feel that import pkg_ok.foo1 might be a very contrived example. In principle
it should be an alert, since the module pkg_ok ends up with an import of itself,
but my gut feeling is that in practice it's not a very important piece of code
to give alerts for. if we really care about giving these import related alerts,
we could probably add a new query for this pattern, as it's kind of surprising
that it works when you're just an ordinary python programmer.
This predicate got an unfortunate join order, leading to these tuple
counts on ElektraInitiative/libelektra:
(290s) Tuple counts for AliasedSSA::VariableMemoryLocation::coversEntireVariable_dispred#f:
57117 ~0% {3} r1 = SCAN IRType::IRType::getByteSize_dispred#ff AS I OUTPUT 0, (I.<1> * 8), I.<0>
421445272 ~0% {3} r2 = JOIN r1 WITH AliasedSSA::VariableMemoryLocation#fffffff_5601#join_rhs AS R ON FIRST 2 OUTPUT R.<3>, r1.<2>, R.<2>
103282 ~2% {1} r3 = JOIN r2 WITH AliasConfiguration::Allocation::getIRType_dispred#ff AS R ON FIRST 2 OUTPUT r2.<2>
return r3
With this commit, we get these tuple counts instead:
(0s) Tuple counts for AliasedSSA::VariableMemoryLocation::varIRTypeHasBitRange#bff:
361874 ~0% {3} r1 = SCAN AliasedSSA::VariableMemoryLocation#fffffff AS I OUTPUT I.<1>, 0, I.<0>
361874 ~0% {3} r2 = JOIN r1 WITH AliasConfiguration::Allocation::getIRType_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, 0, r1.<2>
361874 ~1% {3} r3 = JOIN r2 WITH IRType::IRType::getByteSize_dispred#ff AS R ON FIRST 1 OUTPUT r2.<2>, 0, (R.<1> * 8)
return r3
(0s) Tuple counts for AliasedSSA::VariableMemoryLocation::coversEntireVariable_dispred#f:
103282 ~2% {1} r1 = JOIN AliasedSSA::VariableMemoryLocation#fffffff_056#join_rhs AS L WITH AliasedSSA::VariableMemoryLocation::varIRTypeHasBitRange#bff AS R ON FIRST 3 OUTPUT L.<0>
103282 ~2% {1} r2 = STREAM DEDUP r1
return r2
We had these tuple counts on ElektraInitiative/libelektra (note that the
`modelFlow` predicate got inlined into
`simpleInstructionLocalFlowStep`):
(652s) Tuple counts for DataFlowUtil::simpleInstructionLocalFlowStep#ff:
...
19701 ~1% {4} r27 = JOIN r26 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r26.<2>, r26.<1>, r26.<0>
7908 ~0% {3} r28 = JOIN r27 WITH SSAConstruction::Cached::getInstructionIndex#ff@staged_ext AS R ON FIRST 2 OUTPUT r27.<0>, r27.<2>, r27.<3>
4023 ~0% {3} r29 = JOIN r28 WITH Instruction::WriteSideEffectInstruction#class#ff AS R ON FIRST 1 OUTPUT r28.<1>, r28.<2>, r28.<0>
...
1060807009 ~3% {3} r34 = JOIN r33 WITH SSAConstruction::Cached::getInstructionIndex#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r33.<1>, r33.<2>
15670 ~5% {2} r35 = JOIN r34 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 2 OUTPUT r34.<0>, r34.<2>
7973 ~0% {2} r36 = JOIN r35 WITH Instruction::ReadSideEffectInstruction::getSideEffectOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r35.<1>
...
In this predicate there are two cases (`WriteSideEffectInstruction` and
`ReadSideEffectInstruction`) where we need to join on both the call and
the argument index of a side effect. It works well enough for the first
case, `WriteSideEffectInstruction`, where the call is joined on before
the index, but it explodes in the second case,
`ReadSideEffectInstruction`, where the index is joined first. To fix the
second case, and to guard against future optimizer accidents in the
first case, this commit changes both of those cases to use a new helper
predicate that makes it possible to join on both columns at once. The
resulting tuple counts are:
(3s) Tuple counts for DataFlowUtil::simpleInstructionLocalFlowStep#ff:
...
7908 ~0% {3} r27 = JOIN r26 WITH DataFlowUtil::getSideEffectFor#fff AS R ON FIRST 2 OUTPUT R.<2>, r26.<2>, r26.<0>
4023 ~0% {3} r28 = JOIN r27 WITH Instruction::WriteSideEffectInstruction#class#ff AS R ON FIRST 1 OUTPUT r27.<1>, r27.<2>, r27.<0>
...
15670 ~5% {2} r33 = JOIN r32 WITH DataFlowUtil::getSideEffectFor#fff AS R ON FIRST 2 OUTPUT R.<2>, r32.<2>
7973 ~0% {2} r34 = JOIN r33 WITH Instruction::ReadSideEffectInstruction::getSideEffectOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r33.<1>
...
The bulge is now limited to a factor of two, and that's just because I
didn't write separate versions of `getSideEffectFor` for
`ReadSideEffectInstruction` and `WriteSideEffectInstruction`.
Avoids a Cartesian product on nodes:
```
[2020-02-07 11:01:22] (432s) Tuple counts for dom#DataFlowImpl::TPathNodeSink#ff:
0 ~0% {2} r1 = JOIN DataFlowImpl::Configuration::isSource_dispred#ff AS L WITH DataFlowImpl::Configuration::isSink_dispred#ff AS R ON FIRST 2 OUTPUT R.<1>, R.<0>
101611 ~0% {2} r2 = SCAN DataFlowImpl::PathNodeMid#class#ffffff AS I OUTPUT I.<5>, I.<0>
3534537047 ~3% {3} r3 = JOIN r2 WITH DataFlowImpl::Configuration::isSink_dispred#ff AS R ON FIRST 1 OUTPUT r2.<1>, R.<1>, R.<0>
251 ~41% {3} r4 = JOIN r3 WITH project#DataFlowImpl::pathStep#fffff AS R ON FIRST 2 OUTPUT R.<2>, r3.<2>, r3.<1>
251 ~50% {2} r5 = JOIN r4 WITH DataFlowImpl::TNil#ff_1#join_rhs AS R ON FIRST 1 OUTPUT r4.<2>, r4.<1>
251 ~50% {2} r6 = r1 \/ r5
323 ~67% {3} r7 = JOIN r6 WITH DataFlowImpl::flow#ff AS R ON FIRST 1 OUTPUT r6.<1>, r6.<0>, R.<1>
288 ~58% {3} r8 = SELECT r7 ON r7.<2> >= r7.<0>
251 ~53% {3} r9 = SELECT r8 ON r8.<2> <= r8.<0>
251 ~50% {2} r10 = SCAN r9 OUTPUT r9.<1>, r9.<0>
```
Joining on the enclosing callable before the kind is crucial, as witnessed by this pipeline:
```
[2020-02-06 17:58:21] (1086s) Starting to evaluate predicate DataFlowImplCommon::getReturnPosition#ff/2@83c546
[2020-02-06 18:53:16] (4382s) Tuple counts for DataFlowImplCommon::getReturnPosition#ff:
385478 ~1% {3} r1 = SCAN DataFlowImplCommon::Cached::TReturnPosition0#fff@staged_ext AS I OUTPUT I.<2>, I.<0>, I.<1>
385478 ~2% {3} r2 = JOIN r1 WITH DataFlowImplCommon::Cached::TReturnPosition0#fff_2#join_rhs AS R ON FIRST 1 OUTPUT r1.<2>, r1.<1>, r1.<0>
58638116860 ~0% {3} r3 = JOIN r2 WITH DataFlowImplCommon::ReturnNodeExt::getKind_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r2.<1>, r2.<2>
914049 ~0% {2} r4 = JOIN r3 WITH DataFlowImplCommon::returnNodeGetEnclosingCallable#ff AS R ON FIRST 2 OUTPUT r3.<0>, r3.<2>
return r4
```
We can't understand the real `six.py` file, so we have some internal plumbing
that enables us to handle six anyway. While updating that, I had a hell of a lot
of trouble with these tests.
What we actually want, is to see that we can understand what the values imported
from six are (i.e., their points-to information). I added a few more, that I
think would be useful. If we can figure out all of these, I don't actually care
if we're doing it by understanding the real `six.py` file, or by some internal
trick.
I verified that we don't get results with the real `six.py` file by disabling
our internal tricks, and putting a copy of six.py just next to test.py.
We used to have an other file that would list all the properties we knew and
their value, but that turned out to be a fragile and annoying test, since the
results differed from which version of python you ran it with (3.5 vs 3.8) and
which machine you ran it on (my machien vs jenkins). I don't care about the
results in this file, and I can certainly not eyeball it to see if it's correct
or not.
Also fixed a minor bug where we should have been treating `AllNonLocalMemory` as _totally_ overlapping an access to a non-local variable, rather than _partially_ overlapping it. This fix is exhibited both in the new test case and in a couple existing test functions in `ssa.cpp`.
The predicate
```
argumentValueFlowsThrough(ArgumentNode arg, OutNode out, CallContext cc)
```
has been generalized to
```
argumentValueFlowsThrough(
DataFlowCall call, ArgumentNode arg, Node out, ContentOption contentIn,
ContentOption contentOut
)
```
This enables us to summarize normal flow-through (as before), getters, setters,
as well as getter-setters.
Replacing `Value.booleanValue`. We wanted to match `Object.booleanValue` that
only gives a result if it is either `true` or `false`, but also wanted to keep
the flexibility to see if the Value _could_ be `true`/`false`. We don't have a
motivating usecase, so let's see if we ever need it :P
+ fix modernisation regression on py/jinja2/autoescape-false
Since `IntObjectInternal` extends `TInt`, and `TInt` is defined for all
instances of `Builtin.intValue`, and `Builtin.intValue` includes both `int` and
`long`, we don't need to handles Longs in a special manner, as we did in NumericObject.
It turns out that the evaluator will evaluate the GVN stage even when no
predicate from it is needed after optimization of the subsequent stages.
The GVN library is expensive to evaluate, and it'll become even more
expensive when we switch its implementation to IR.
This PR disables the use of GVN in `DataFlow::BarrierNode` for the AST
data-flow library, which should improve performance when evaluating a
single data-flow query on a snapshot with no cache. Precision decreases
slightly, leading to a new FP in the qltests.
There is no corresponding change for the IR data-flow library since IR
GVN is not very expensive.
We were hitting a combinatorial explosion in `hasDefinitionAtRank` for functions that contain a large number of string literals. The problem was that every `Chi` instruction for `AliasedVirtualVariable` was treated as a definition of every string literal. We already mark string literals as `isReadOnly()`, but we were allowing `AliasedVirtualVariable` to define read-only locations so that the `AliasedDefinition` instruction would provide the initial definition for all string literals.
To fix this, I've introduced the new `InitializeNonLocal` instruction, which is inserted in the prologue of every function right after `AliasedDefinition`. It provides the initial definition for every non-stack memory location, including read-only locations, but is never written to anywhere else. It is the conterpart of the `AliasedUse` instruction in the function epilogue, which represents the use of all non-stack memory after the function returns. I considered renaming `AliasedUse` to `ReturnNonLocal`, to match the `InitializeXXX`/`ReturnXXX` pattern we already use for parameters and indirections, but held off to avoid unnecessary churn. Any thoughts on whether I should make this name change?
This change has a significant speedup in evaluation time for a few of our troublesome databases:
`attnam/ivan`: 13%
`awslabs/s2n`: 26%
`SinaMostafanejad/OpenRDM`: 7%
`zcoinofficial/zcoin`: 8%
As a temporary workaround in the `DefaultTaintTracking` library, we
funnel flow across calls by conflating pointer and object both at the
caller and the callee.
The three cases in `adjustedSink` were deleted because they are now
covered by the one case for `ReadSideEffectInstruction` in
`instructionTaintStep`.
When enabling `DefaultTaintTracking`, this commit on top of #2736 has
the effect effect of recovering two lost results:
--- a/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
+++ b/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
@@ -1,2 +1,4 @@
| overflowdestination.cpp:30:2:30:8 | call to strncpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
| overflowdestination.cpp:46:2:46:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
+| overflowdestination.cpp:53:2:53:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
+| overflowdestination.cpp:64:2:64:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
In the internal repo, we recover one lost result. Additionally, there
are two queries that gain an extra source for an existing sink. I'll
classify that as noise. The new results look like this:
foo(argv); // this `argv` is a new source for the sink in `bar`
bar(argv); // this `argv` is the existing source for the sink in `bar`
The type of the source argument to `memcpy` is `void *`, and somehow
that meant that the copied object itself got type `void`. Since that has
size 0, the SSA construction did not model it as reading from the last
write.
This is probably not the right fix, but maybe it's good enough for now.
The right fix would ensure that the type reported by
`hasOperandMemoryAccess` is `UnknownType`.
When `DefaultTaintTracking.qll` is enabled, this commit has the effect
of restoring a lost results:
--- a/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
+++ b/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
@@ -1 +1,2 @@
| overflowdestination.cpp:30:2:30:8 | call to strncpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
+| overflowdestination.cpp:46:2:46:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
Some of the tests currently fail, since they can't reproduce the old tests
results (since the sinks/sources defined in the library code are not
HttpResponseTaintSink/HttpRequestTaintSource)
This commit undoes the code sharing between `TranslatedAssignExpr` (`=`)
and `TranslatedAssignOperation` (`+=`, `<<=`, ...). In the next commit,
when we change how the `Load` works on the LHS of
`TranslatedAssignOperation`, these classes will become so different that
sharing is no longer helpful.
Previously, the `Load` would be associated with the `CrementOperation`
rather than its operand, which gave surprising results when mapping
taint sinks back to `Expr`.
The changes in `raw_ir.expected` are to add `Copy` operations on the
`x++` in code like `y = x++`. This is now needed because the result that
`x++` would otherwise have (the Load) no longer belongs to the `++`
expression. Copies are inserted to ensure that all expressions are
associated with an `Instruction` result.
The changes in `*aliased_ssa_ir.expected` appear to be just wobble.
Performance on `taers232c/GAMADV-X` (which exhibited pathological behaviour in
the most recent dist upgrade) went from ~670s to ~313s on
`py/hardcoded-credentials`.
There are still a few tuple counts in the 10-100 million range, but this commit
takes care of all of the ones that numbered in the billions. (A single tuple
count in the 100-1000 million range remains, but it appears to be less critical,
taking only two seconds to calculate.)
This code of conduct outlines expectations for participation in the Semmle open source community, including any open source repositories on GitHub.com, as well as steps for reporting unacceptable behavior. We are committed to providing a welcoming and inspiring community for all.
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
People violating this code of conduct may be banned from the community.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
Our community strives to:
* Be friendly and patient: Remember you might not be communicating in someone else’s primary spoken or programming language, and others may not have your level of understanding.
* Be welcoming: Our community welcomes and supports people of all backgrounds and identities. This includes, but is not limited to members of any race, ethnicity, culture, national origin, color, immigration status, social and economic class, educational level, sex, sexual orientation, gender identity and expression, age, size, family status, political belief, religion, and mental and physical ability.
* Be respectful: We are a world-wide community of professionals, and we conduct ourselves professionally. Disagreement is no excuse for poor behavior and poor manners. Disrespectful and unacceptable behavior includes, but is not limited to:
* Violent threats or language.
* Discriminatory or derogatory jokes and language.
* Posting sexually explicit or violent material.
* Posting, or threatening to post, people’s personally identifying information (“doxing”).
* Insults, especially those using discriminatory terms or slurs.
* Behavior that could be perceived as sexual attention.
* Advocating for or encouraging any of the above behaviors.
* Understand disagreements: Disagreements, both social and technical, are useful learning opportunities. Seek to understand others’ viewpoints and resolve differences constructively.
## Our Standards
This code is not exhaustive or complete. It serves to capture our common understanding of a productive, collaborative environment. We expect the code to be followed in spirit as much as in the letter.
Examples of behavior that contributes to a positive environment for our
community include:
# Scope
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
This code of conduct applies to all repositories and communities for Semmle open source projects, regardless of whether or not the repository explicitly calls out its use of this code. The code also applies in public spaces when an individual is representing the Semmle open source community. Examples include using an official project email address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
# Reporting Code of Conduct Issues
We encourage members of the community to resolve issues on their own whenever possible. This builds a broader and deeper understanding and ultimately a healthier interaction. In the event that an issue cannot be resolved locally, please feel free to report your concerns by contacting code-of-conduct@semmle.com.
In your report please include:
* Your contact information.
* Names (real, usernames or pseudonyms) of any individuals involved. If there are additional witnesses, please include them as well.
* Your account of what occurred, and if you believe the incident is ongoing. If there is a publicly available record (e.g. a mailing list archive or a public chat log), please include a link or attachment.
* Any additional information that may be helpful.
## Enforcement Responsibilities
All reports will be reviewed by a multi-person team and will result in a response that is deemed necessary and appropriate to the circumstances. Where additional perspectives are needed, the team may seek insight from others with relevant expertise or experience. The confidentiality of the person reporting the incident will be kept at all times. Involved parties are never part of the review team.
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Anyone asked to stop unacceptable behavior is expected to comply immediately. If an individual engages in unacceptable behavior, the review team may take any action they deem appropriate, including a permanent ban from the community.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
*This text is licensed under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license. It is based on a template established by the [TODO Group](http://todogroup.org/) and variants thereof used by numerous other large communities (e.g., [Microsoft](https://microsoft.github.io/codeofconduct/), [Facebook](https://code.fb.com/codeofconduct/), [Yahoo](https://yahoo.github.io/codeofconduct), [Twitter](https://github.com/twitter/code-of-conduct), [GitHub](https://blog.github.com/2015-07-20-adopting-the-open-code-of-conduct/)) and the Scope section from the [Contributor Covenant version 1.4](http://contributor-covenant.org/version/1/4/).*
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
opensource@github.com.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
We welcome contributions to our standard library and standard checks. Got an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request!
We welcome contributions to our CodeQL libraries and queries. Got an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Contributions to this project are [released](https://help.github.com/articles/github-terms-of-service/#6-contributions-under-repository-license) to the public under the [project's open source license](LICENSE).
Before we accept your pull request, we require that you have agreed to our Contributor License Agreement, this is not something that you need to do before you submit your pull request, but until you've done so, we will be unable to accept your contribution.
There is lots of useful documentation to help you write queries, ranging from information about query file structure to tutorials for specific target languages. For more information on the documentation available, see [CodeQL queries](https://help.semmle.com/QL/learn-ql/writing-queries/writing-queries.html) on [help.semmle.com](https://help.semmle.com).
## Adding a new query
If you have an idea for a query that you would like to share with other Semmle users, please open a pull request to add it to this repository.
Follow the steps below to help other users understand what your query does, and to ensure that your query is consistent with the other Semmle queries.
## Submitting a new experimental query
1.**Consult the documentation for query writers**
If you have an idea for a query that you would like to share with other CodeQL users, please open a pull request to add it to this repository. New queries start out in a `<language>/ql/src/experimental` directory, to which they can be merged when they meet the following requirements.
There is lots of useful documentation to help you write queries, ranging from information about query file structure to tutorials for specific target languages. For more information on the documentation available, see [Writing CodeQL queries](https://help.semmle.com/QL/learn-ql/writing-queries/writing-queries.html) on [help.semmle.com](https://help.semmle.com).
1.**Directory structure**
2.**Format your code correctly**
There are five language-specific query directories in this repository:
All of Semmle's standard queries and libraries are uniformly formatted for clarity and consistency, so we strongly recommend that all contributions follow the same formatting guidelines. If you use CodeQL for VS Code, you can autoformat your query in the [Editor](https://help.semmle.com/codeql/codeql-for-vscode/reference/editor.html#autoformatting). For more information, see the [CodeQL style guide](https://github.com/Semmle/ql/blob/master/docs/ql-style-guide.md).
* C/C++: `cpp/ql/src`
* C#:`csharp/ql/src`
* Java: `java/ql/src`
* JavaScript: `javascript/ql/src`
* Python: `python/ql/src`
3.**Make sure your query has the correct metadata**
Each language-specific directory contains further subdirectories that group queries based on their `@tags` or purpose.
- Experimental queries and libraries are stored in the `experimental` subdirectory within each language-specific directory in the [CodeQL repository](https://github.com/github/codeql). For example, experimental Java queries and libraries are stored in `java/ql/src/experimental` and any corresponding tests in `java/ql/test/experimental`.
- The structure of an `experimental` subdirectory mirrors the structure of its parent directory.
- Select or create an appropriate directory in `experimental` based on the existing directory structure of `experimental` or its parent directory.
Query metadata is used by Semmle's analysis to identify your query and make sure the query results are displayed properly.
The most important metadata to include are the `@name`, `@description`, and the `@kind`.
Other metadata properties (`@precision`, `@severity`, and `@tags`) are usually added after the query has been reviewed by Semmle staff.
For more information on writing query metadata, see the [Query metadata style guide](https://github.com/Semmle/ql/blob/master/docs/query-metadata-style-guide.md).
2.**Query metadata**
4.**Make sure the `select` statement is compatible with the query type**
- The query `@id` must conform to all the requirements in the [guide on query metadata](docs/query-metadata-style-guide.md#query-id-id). In particular, it must not clash with any other queries in the repository, and it must start with the appropriate language-specific prefix.
- The query must have a `@name` and `@description` to explain its purpose.
- The query must have a `@kind` and `@problem.severity` as required by CodeQL tools.
The `select` statement of your query must be compatible with the query type (determined by the `@kind` metadata property) for alert or path results to be displayed correctly in LGTM and CodeQL for VS Code.
For more information on `select` statement format, see [Introduction to query files](https://help.semmle.com/QL/learn-ql/writing-queries/introduction-to-queries.html#select-clause) on help.semmle.com.
For details, see the [guide on query metadata](docs/query-metadata-style-guide.md).
5.**Save your query in a `.ql` file in the correct language directory in this repository**
Make sure the `select` statement is compatible with the query `@kind`. See [About CodeQL queries](https://help.semmle.com/QL/learn-ql/writing-queries/introduction-to-queries.html#select-clause) on help.semmle.com.
There are five language-specific directories in this repository:
* C/C++: `ql/cpp/ql/src`
* C#:`ql/csharp/ql/src`
* Java: `ql/java/ql/src`
* JavaScript: `ql/javascript/ql/src`
* Python: `ql/python/ql/src`
3.**Formatting**
Each language-specific directory contains further subdirectories that group queries based on their `@tags` properties or purpose. Select the appropriate subdirectory for your new query, or create a new one if necessary.
- The queries and libraries must be autoformatted, for example using the "Format Document" command in [CodeQL for Visual Studio Code](https://help.semmle.com/codeql/codeql-for-vscode/procedures/about-codeql-for-vscode.html).
6.**Write a query help file**
4.**Compilation**
Query help files explain the purpose of your query to other users. Write your query help in a `.qhelp` file and save it in the same directory as your new query.
For more information on writing query help, see the [Query help style guide](https://github.com/Semmle/ql/blob/master/docs/query-help-style-guide.md).
- Compilation of the query and any associated libraries and tests must be resilient to future development of the [supported](docs/supported-queries.md) libraries. This means that the functionality cannot use internal libraries, cannot depend on the output of `getAQlClass`, and cannot make use of regexp matching on `toString`.
- The query and any associated libraries and tests must not cause any compiler warnings to be emitted (such as use of deprecated functionality or missing `override` annotations).
5.**Results**
- The query must have at least one true positive result on some revision of a real project.
Experimental queries and libraries may not be actively maintained as the [supported](docs/supported-queries.md) libraries evolve. They may also be changed in backwards-incompatible ways or may be removed entirely in the future without deprecation warnings.
After the experimental query is merged, we welcome pull requests to improve it. Before a query can be moved out of the `experimental` subdirectory, it must satisfy [the requirements for being a supported query](docs/supported-queries.md).
## Using your personal data
If you contribute to this project, we will record your name and email
address (as provided by you with your contributions) as part of the code
repositories, which might be made public. We might also use this information
to contact you in relation to your contributions, as well as in the
normal course of software development. We also store records of your
CLA agreements. Under GDPR legislation, we do this
on the basis of our legitimate interest in creating the CodeQL product.
If you contribute to this project, we will record your name and email address (as provided by you with your contributions) as part of the code repositories, which are public. We might also use this information to contact you in relation to your contributions, as well as in the normal course of software development. We also store records of CLA agreements signed in the past, but no longer require contributors to sign a CLA. Under GDPR legislation, we do this on the basis of our legitimate interest in creating the CodeQL product.
Please do get in touch (privacy@semmle.com) if you have any questions about
this or our data protection policies.
## Contributor License Agreement
This Contributor License Agreement (“Agreement”) is entered into between Semmle Limited (“Semmle,” “we” or “us” etc.), and You (as defined and further identified below).
Accordingly, You hereby agree to the following terms for Your present and future Contributions submitted to Semmle:
1.**Definitions**.
* "You" (or "Your") shall mean the Contribution copyright owner (whether an individual or organization) or legal entity authorized by the copyright owner that is making this Agreement with Semmle. For legal entities, the entity making a Contribution and all other entities that control, are controlled by, or are under common control with that entity are considered to be a single Contributor. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
* "Contribution(s)" shall mean the code, documentation or other original works of authorship, including any modifications or additions to an existing work, submitted by You to Semmle for inclusion in, or documentation of, any of the products or projects owned or managed by Semmle (the "Work(s)"). For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to Semmle or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, Semmle for the purpose of discussing and/or improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by You as "Not a Contribution."
2.**Grant of Copyright License**. You hereby grant to Semmle and to recipients of software distributed by Semmle a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works.
3.**Grant of Patent License**. You hereby grant to Semmle and to recipients of software distributed by Semmle a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution(s) alone or by combination of Your Contribution(s) with the Work to which such Contribution(s) was submitted. If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that Your Contribution, or the Work to which You have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this Agreement for that Contribution or Work shall terminate as of the date such litigation is filed.
4.**Ownership**. Except as set out above, You keep all right, title, and interest in Your Contribution. The rights that You grant to us under this Agreement are effective on the date You first submitted a Contribution to us, even if Your submission took place before the date You entered this Agreement.
5.**Representations**. You represent and warrant that: (i) the Contributions are an original work and that You can legally grant the rights set out in this Agreement; (ii) the Contributions and Semmle’s exercise of any license rights granted hereunder, does not and will not, infringe the rights of any third party; (iii) You are not aware of any pending or threatened claims, suits, actions, or charges pertaining to the Contributions, including without limitation any claims or allegations that any or all of the Contributions infringes, violates, or misappropriate the intellectual property rights of any third party (You further agree that You will notify Semmle immediately if You become aware of any such actual or potential claims, suits, actions, allegations or charges).
6.**Employer**. If Your employer(s) has rights to intellectual property that You create that includes Your Contributions, You represent and warrant that Your employer has waived such rights for Your Contributions to Semmle, or that You have received permission to make Contributions on behalf of that employer and that You are authorized to execute this Agreement on behalf of Your employer.
7.**Inclusion of Code**. We determine the code that is in our Works. You understand that the decision to include the Contribution in any project or source repository is entirely that of Semmle, and this agreement does not guarantee that the Contributions will be included in any product.
8.**Disclaimer**. You are not expected to provide support for Your Contributions, except to the extent You desire to provide support. You may provide support for free, for a fee, or not at all. Except as set forth herein, and unless required by applicable law or agreed to in writing, You provide Your Contributions on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND.
9.**General**. The failure of either party to enforce its rights under this Agreement for any period shall not be construed as a waiver of such rights. No changes or modifications or waivers to this Agreement will be effective unless in writing and signed by both parties. In the event that any provision of this Agreement shall be determined to be illegal or unenforceable, that provision will be limited or eliminated to the minimum extent necessary so that this Agreement shall otherwise remain in full force and effect and enforceable. This Agreement shall be governed by and construed in accordance with the laws of the State of California in the United States without regard to the conflicts of laws provisions thereof. In any action or proceeding to enforce rights under this Agreement, the prevailing party will be entitled to recover costs and attorneys’ fees.
Please do get in touch (privacy@github.com) if you have any questions about this or our data protection policies.
This open source repository contains the standard CodeQL libraries and queries that power [LGTM](https://lgtm.com), and the other products that [Semmle](https://semmle.com) makes available to its customers worldwide.
This open source repository contains the standard CodeQL libraries and queries that power [LGTM](https://lgtm.com) and the other CodeQL products that [GitHub](https://github.com) makes available to its customers worldwide. For the queries, libraries, and extractor that power Go analysis, visit the [CodeQL for Go repository](https://github.com/github/codeql-go).
## How do I learn CodeQL and run queries?
@@ -9,8 +9,20 @@ You can use the [interactive query console](https://lgtm.com/help/lgtm/using-que
## Contributing
We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our [contributing guidelines](CONTRIBUTING.md). You can also consult our [style guides](https://github.com/Semmle/ql/tree/master/docs) to learn how to format your code for consistency and clarity, how to write query metadata, and how to write query help documentation for your query.
We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our [contributing guidelines](CONTRIBUTING.md). You can also consult our [style guides](https://github.com/github/codeql/tree/master/docs) to learn how to format your code for consistency and clarity, how to write query metadata, and how to write query help documentation for your query.
## License
The code in this repository is licensed under [Apache License 2.0](LICENSE) by [Semmle](https://semmle.com).
The code in this repository is licensed under the [MIT License](LICENSE) by [GitHub](https://github.com).
## Visual Studio Code integration
If you use Visual Studio Code to work in this repository, there are a few integration features to make development easier.
### CodeQL for Visual Studio Code
You can install the [CodeQL for Visual Studio Code](https://marketplace.visualstudio.com/items?itemName=GitHub.vscode-codeql) extension to get syntax highlighting, IntelliSense, and code navigation for the QL language, as well as unit test support for testing CodeQL libraries and queries.
### Tasks
The `.vscode/tasks.json` file defines custom tasks specific to working in this repository. To invoke one of these tasks, select the `Terminal | Run Task...` menu option, and then select the desired task from the dropdown. You can also invoke the `Tasks: Run Task` command from the command palette.
| Boost\_asio TLS Settings Misconfiguration (`cpp/boost/tls-settings-misconfiguration`) | Query id change | The identifier was updated to use dashes in place of underscores (previous identifier `cpp/boost/tls_settings_misconfiguration`). |
| Buffer not sufficient for string (`cpp/overflow-calculated`) | More true positive results | This query now identifies a wider variety of buffer allocations using the `semmle.code.cpp.models.interfaces.Allocation` library. |
| No space for zero terminator (`cpp/no-space-for-terminator`) | More true positive results | This query now identifies a wider variety of buffer allocations using the `semmle.code.cpp.models.interfaces.Allocation` library. |
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | | This query is no longer run on LGTM. |
| Memory is never freed (`cpp/memory-never-freed`) | More true positive results | This query now identifies a wider variety of buffer allocations using the `semmle.code.cpp.models.interfaces.Allocation` library. |
| Memory may not be freed (`cpp/memory-may-not-be-freed`) | More true positive results | This query now identifies a wider variety of buffer allocations using the `semmle.code.cpp.models.interfaces.Allocation` library. |
| Missing return statement (`cpp/missing-return`) | Fewer false positive results | Functions containing `asm` statements are no longer highlighted by this query. |
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | | This query is no longer run on LGTM. |
| No space for zero terminator (`cpp/no-space-for-terminator`) | Fewer false positive results | This query has been modified to be more conservative when identifying which pointers point to null-terminated strings. This approach produces fewer, more accurate results. |
| Missing return statement (`cpp/missing-return`) | Fewer false positive results and more accurate locations | Functions containing `asm` statements are no longer highlighted by this query. The locations reported by this query are now more accurate in some cases. |
| No space for zero terminator (`cpp/no-space-for-terminator`) | More results with greater precision | The query gives more precise results for a wider variety of buffer allocations. String arguments to formatting functions are now (usually) expected to be null terminated strings. Use of the `semmle.code.cpp.models.interfaces.Allocation` library identifies problems with a wider variety of buffer allocations. This query is also more conservative when identifying which pointers point to null-terminated strings. |
| Overflow in uncontrolled allocation size (`cpp/uncontrolled-allocation-size`) | Fewer false positive results | The query now produces fewer, more accurate results. Cases where the tainted allocation size is range checked are more reliably excluded. |
| Overloaded assignment does not return 'this' (`cpp/assignment-does-not-return-this`) | Fewer false positive results | This query no longer reports incorrect results in template classes. |
| Pointer overflow check (`cpp/pointer-overflow-check`),<br> Possibly wrong buffer size in string copy (`cpp/bad-strncpy-size`),<br> Signed overflow check (`cpp/signed-overflow-check`) | More correct results | A new library is used for determining which expressions have identical value, giving more precise results. There is a performance cost to this, and the LGTM suite will overall run slower than before. |
| Unsafe array for days of the year (`cpp/leap-year/unsafe-array-for-days-of-the-year`) | | This query is no longer run on LGTM. |
| Unsigned comparison to zero (`cpp/unsigned-comparison-zero`) | More correct results | This query now also looks for comparisons of the form `0 <= x`. |
## Changes to libraries
* The data-flow library has been improved when flow through functions needs to be
combined with both taint tracking and flow through fields allowing more flow
to be tracked. This affects and improves some security queries, which may
report additional results.
* Created the`semmle.code.cpp.models.interfaces.Allocation` library to model allocation such as `new` expressions and calls to `malloc`. This in intended to replace the functionality in `semmle.code.cpp.commons.Alloc` with a more consistent and useful interface.
* Created the `semmle.code.cpp.models.interfaces.Deallocation` library to model deallocation such as `delete` expressions and calls to `free`. This in intended to replace the functionality in `semmle.code.cpp.commons.Alloc` with a more consistent and useful interface.
* The built-in C++20 "spaceship operator" (`<=>`) is now supported via the QL
class `SpaceshipExpr`. Overloaded forms are modeled as calls to functions
named `operator<=>`.
* The data-flow library (`semmle.code.cpp.dataflow.DataFlow` and
`semmle.code.cpp.dataflow.TaintTracking`) has been improved, which affects
and improves some security queries. The improvements are:
- Track flow through functions that combine taint tracking with flow through fields.
- Track flow through clone-like functions, that is, functions that read contents of a field from a
parameter and stores the value in the field of a returned object.
* The security pack taint tracking library
(`semmle.code.cpp.security.TaintTracking`) uses a new intermediate
representation. This provides a more precise analysis of flow through
parameters and pointers. For new queries, however, we continue to recommend
using `semmle.code.cpp.dataflow.TaintTracking`.
* The global value numbering library
(`semmle.code.cpp.valuenumbering.GlobalValueNumbering`) uses a new
intermediate representation to provide a more precise analysis of
heap-allocated memory and pointers to stack variables.
* New libraries have been created to provide a more consistent and useful interface
for modeling allocation and deallocation. These replace the old
`semmle.code.cpp.commons.Alloc` library.
* The new `semmle.code.cpp.models.interfaces.Allocation` library models
allocations, such as `new` expressions and calls to `malloc`.
* The new `semmle.code.cpp.models.interfaces.Deallocation` library
models deallocations, such as `delete` expressions and calls to `free`.
* The predicate `freeCall` in `semmle.code.cpp.commons.Alloc` has been
deprecated. The `Allocation` and `Deallocation` models in
`semmle.code.cpp.models.interfaces` should be used instead.
* The new class `StackVariable` should be used in place of `LocalScopeVariable`
in most cases. The difference is that `StackVariable` does not include
variables declared with `static` or `thread_local`.
* As a rule of thumb, custom queries about the _values_ of variables should
be changed from `LocalScopeVariable` to `StackVariable`, while queries
about the _name or scope_ of variables should remain unchanged.
* The `LocalScopeVariableReachability` library is deprecated in favor of
`StackVariableReachability`. The functionality is the same.
* The models library models `strlen` in more detail, and includes common variations such as `wcslen`.
* The taint tracking library (`semmle.code.cpp.dataflow.TaintTracking`) has had
the following improvements:
*The library now models data flow through `strdup` and similar functions.
*The library now models data flow through formatting functions such as `sprintf`.
* As a rule of thumb, custom queries about the _values_ of variables should
be changed from `LocalScopeVariable` to `StackVariable`, while queries
about the _name or scope_ of variables should remain unchanged.
* The `LocalScopeVariableReachability` library is deprecated in favor of
`StackVariableReachability`. The functionality is the same.
* Taint tracking and data flow now features better modeling of commonly-used
| Assembly path injection (`cs/assembly-path-injection`) | security, external/cwe/cwe-114 | Finds user-controlled data used to load an assembly. |
| Insecure configuration for ASP.NET requestValidationMode (`cs/insecure-request-validation-mode`) | security, external/cwe/cwe-016 | Finds where this attribute has been set to a value less than 4.5, which turns off some validation features and makes the application less secure. |
| Page request validation is disabled (`cs/web/request-validation-disabled`) | security, frameworks/asp.net, external/cwe/cwe-016 | Finds where ASP.NET page request validation has been disabled, which could make the application less secure. |
| Serialization check bypass (`cs/serialization-check-bypass`) | security, external/cwe/cwe-20 | Finds where data is not validated in a deserialization method. |
| XML injection (`cs/xml-injection`) | security, external/cwe/cwe-091 | Finds user-controlled data that is used to write directly to an XML document. |
| Assembly path injection (`cs/assembly-path-injection`) | security, external/cwe/cwe-114 | Finds user-controlled data used to load an assembly. Results are shown on LGTM by default. |
| Insecure configuration for ASP.NET requestValidationMode (`cs/insecure-request-validation-mode`) | security, external/cwe/cwe-016 | Finds where this attribute has been set to a value less than 4.5, which turns off some validation features and makes the application less secure. By default, the query is not run on LGTM. |
| Insecure SQL connection (`cs/insecure-sql-connection`) | security, external/cwe/cwe-327 | Finds unencrypted SQL connection strings. Results are not shown on LGTM by default. |
| Page request validation is disabled (`cs/web/request-validation-disabled`) | security, frameworks/asp.net, external/cwe/cwe-016 | Finds where ASP.NET page request validation has been disabled, which could make the application less secure. By default, the query is not run on LGTM. |
| Serialization check bypass (`cs/serialization-check-bypass`) | security, external/cwe/cwe-20 | Finds where data is not validated in a deserialization method. Results are not shown on LGTM by default. |
| XML injection (`cs/xml-injection`) | security, external/cwe/cwe-091 | Finds user-controlled data that is used to write directly to an XML document. Results are shown on LGTM by default. |
| Useless assignment to local variable (`cs/useless-assignment-to-local`) | Fewer false positive results | Results have been removed when the variable is named `_` in a `foreach` statement. |
| Dereferenced variable may be null (`cs/dereferenced-value-may-be-null`) | More results | Results are reported from parameters with a default value of `null`. |
## Removal of old queries
| Information exposure through an exception (`cs/information-exposure-through-exception`) | More results | The query now recognizes writes to cookies, writes to ASP.NET (`Inner`)`Text` properties, and email contents as additional sinks. |
| Information exposure through transmitted data (`cs/sensitive-data-transmission`) | More results | The query now recognizes writes to cookies and writes to ASP.NET (`Inner`)`Text` properties as additional sinks. |
| Potentially dangerous use of non-short-circuit logic (`cs/non-short-circuit`) | Fewer false positive results | Results have been removed when the expression contains an `out` parameter. |
| Useless assignment to local variable (`cs/useless-assignment-to-local`) | Fewer false positive results | Results have been removed when the value assigned is an (implicitly or explicitly) cast default-like value. For example, `var s = (string)null` and `string s = default`. Results have also been removed when the variable is named `_` in a `foreach` statement. |
| XPath injection (`cs/xml/xpath-injection`) | More results | The query now recognizes calls to methods on `System.Xml.XPath.XPathNavigator` objects. |
## Changes to code extraction
* Tuple expressions, for example `(int,bool)` in `default((int,bool))` are now extracted correctly.
* Expression nullability flow state is extracted.
* Expression nullability flow state is extracted.
* Implicitly typed `stackalloc` expressions are now extracted correctly.
* The difference between `stackalloc` array creations and normal array creations is extracted.
## Changes to libraries
* The data-flow library has been improved when flow through methods needs to be
combined with both taint tracking and flow through fields allowing more flow
to be tracked. This affects and improves most security queries, which may
report additional results.
* The data-flow library has been improved, which affects and improves most security queries. The improvements are:
- Track flow through methods that combine taint tracking with flow through fields.
- Track flow through clone-like methods, that is, methods that read the contents of a field from a
parameter and store the value in the field of a returned object.
* The taint tracking library now tracks flow through (implicit or explicit) conversion operator calls.
* [Code contracts](https://docs.microsoft.com/en-us/dotnet/framework/debug-trace-profile/code-contracts) are now recognized, and are treated like any other assertion methods.
* Expression nullability flow state is given by the predicates `Expr.hasNotNullFlowState()` and `Expr.hasMaybeNullFlowState()`.
## Changes to autobuilder
*`stackalloc` array creations are now represented by the QL class `Stackalloc`. Previously they were represented by the class `ArrayCreation`.
* A new class `RemoteFlowSink` has been added to model sinks where data might be exposed to external users. Examples include web page output, emails, and cookies.
| Disabled Spring CSRF protection (`java/spring-disabled-csrf-protection`) | security, external/cwe/cwe-352 | Finds disabled Cross-Site Request Forgery (CSRF) protection in Spring. |
| Disabled Spring CSRF protection (`java/spring-disabled-csrf-protection`) | security, external/cwe/cwe-352 | Finds disabled Cross-Site Request Forgery (CSRF) protection in Spring. Results are shown on LGTM by default. |
| Failure to use HTTPS or SFTP URL in Maven artifact upload/download (`java/maven/non-https-url`) | security, external/cwe/cwe-300, external/cwe/cwe-319, external/cwe/cwe-494, external/cwe/cwe-829 | Finds use of insecure protocols during Maven dependency resolution. Results are shown on LGTM by default. |
| LDAP query built from user-controlled sources (`java/ldap-injection`) | security, external/cwe/cwe-090 | Finds LDAP queries vulnerable to injection of unsanitized user-controlled input. Results are shown on LGTM by default. |
| Left shift by more than the type width (`java/lshift-larger-than-type-width`) | correctness | Finds left shifts of ints by 32 bits or more and left shifts of longs by 64 bits or more. Results are shown on LGTM by default. |
| Suspicious date format (`java/suspicious-date-format`) | correctness | Finds date format patterns that use placeholders that are likely to be incorrect. |
| Suspicious date format (`java/suspicious-date-format`) | correctness | Finds date format patterns that use placeholders that are likely to be incorrect. Results are shown on LGTM by default. |
| Dereferenced variable may be null (`java/dereferenced-value-may-be-null`) | Fewer false positives | Final fields with a non-null initializer are no longer reported. |
| Expression always evaluates to the same value (`java/evaluation-to-constant`) | Fewer false positives | Expressions of the form `0 * x` are usually intended and no longer reported. Also left shift of ints by 32 bits and longs by 64 bits are no longer reported as they are not constant, these results are instead reported by the new query `java/lshift-larger-than-type-width`. |
| Useless null check (`java/useless-null-check`) | More true positives | Useless checks on final fields with a non-null initializer are now reported. |
| Dereferenced variable may be null (`java/dereferenced-value-may-be-null`) | Fewer false positive results | Final fields with a non-null initializer are no longer reported. |
| Expression always evaluates to the same value (`java/evaluation-to-constant`) | Fewer false positive results | Expressions of the form `0 * x` are usually intended and no longer reported. Also left shift of ints by 32 bits and longs by 64 bits are no longer reported as they are not constant, these results are instead reported by the new query `java/lshift-larger-than-type-width`. |
| Useless null check (`java/useless-null-check`) | More true positive results | Useless checks on final fields with a non-null initializer are now reported. |
## Changes to libraries
* The data-flow library has been improved when flow through methods needs to be
combined with both taint tracking and flow through fields allowing more flow
to be tracked. This affects and improves most security queries, which may
report additional results.
* The data-flow library has been improved, which affects and improves most security queries. The improvements are:
- Track flow through methods that combine taint tracking with flow through fields.
- Track flow through clone-like methods, that is, methods that read contents of a field from a
parameter and stores the value in the field of a returned object.
* Identification of test classes has been improved. Previously, one of the
match conditions would classify any class with a name containing the string
"Test" as a test class, but now this matching has been replaced with one that
@@ -36,6 +38,6 @@ The following changes in version 1.24 affect Java analysis in all applications.
general file classification mechanism and thus suppression of alerts, and
also any security queries using taint tracking, as test classes act as
default barriers stopping taint flow.
* Parentheses are now no longer modelled directly in the AST, that is, the
* Parentheses are now no longer modeled directly in the AST, that is, the
`ParExpr` class is empty. Instead, a parenthesized expression can be
identified with the `Expr.isParenthesized()` member predicate.
| Cross-site scripting through exception (`js/xss-through-exception`) | security, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities where an exception is written to the DOM. Results are not shown on LGTM by default. |
| Regular expression always matches (`js/regex/always-matches`) | correctness, regular-expressions | Highlights regular expression checks that trivially succeed by matching an empty substring. Results are shown on LGTM by default. |
| Missing await (`js/missing-await`) | correctness | Highlights expressions that operate directly on a promise object in a nonsensical way, instead of awaiting its result. Results are shown on LGTM by default. |
| Prototype pollution in utility function (`js/prototype-pollution-utility`) | security, external/cwe/cwe-400, external/cwe/cwe-471 | Highlights recursive copying operations that are susceptible to prototype pollution. Results are shown on LGTM by default. |
| Polynomial regular expression used on uncontrolled data (`js/polynomial-redos`) | security, external/cwe/cwe-730, external/cwe/cwe-400 | Highlights expensive regular expressions that may be used on malicious input. Results are shown on LGTM by default. |
| Prototype pollution in utility function (`js/prototype-pollution-utility`) | security, external/cwe/cwe-400, external/cwe/cwe-471 | Highlights recursive assignment operations that are susceptible to prototype pollution. Results are shown on LGTM by default. |
| Regular expression always matches (`js/regex/always-matches`) | correctness, regular-expressions | Highlights regular expression checks that trivially succeed by matching an empty substring. Results are shown on LGTM by default. |
| Unsafe jQuery plugin (`js/unsafe-jquery-plugin`) | | Highlights potential XSS vulnerabilities in unsafely designed jQuery plugins. Results are shown on LGTM by default. |
| Unnecessary use of `cat` process (`js/unnecessary-use-of-cat`) | correctness, security, maintainability | Highlights command executions of `cat` where the fs API should be used instead. Results are shown on LGTM by default. |
| Clear-text logging of sensitive information (`js/clear-text-logging`) | More results | More results involving `process.env` and indirect calls to logging methods are recognized. |
| Duplicate parameter names (`js/duplicate-parameter-name`) | Fewer results | This query now recognizes additional parameters that reasonably can have duplicated names. |
| Incomplete string escaping or encoding (`js/incomplete-sanitization`) | Fewer false positive results | This query now recognizes additional cases where a single replacement is likely to be intentional. |
| Unbound event handler receiver (`js/unbound-event-handler-receiver`) | Fewer false positive results | This query now recognizes additional ways event handler receivers can be bound. |
| Duplicate parameter names (`js/duplicate-parameter-name`) | Fewer results | This query now ignores additional parameters that reasonably can have duplicated names. |
| Expression has no effect (`js/useless-expression`) | Fewer false positive results | The query now recognizes block-level flow type annotations and ignores the first statement of a try block. |
| Use of call stack introspection in strict mode (`js/strict-mode-call-stack-introspection`) | Fewer false positive results | The query no longer flags expression statements. |
| Identical operands (`js/redundant-operation`) | Fewer results | This query now excludes cases where the operands change a value using ++/-- expressions. |
| Incomplete string escaping or encoding (`js/incomplete-sanitization`) | Fewer false positive results | This query now recognizes and excludes additional cases where a single replacement is likely to be intentional. |
| Incomplete URL scheme check (`js/incomplete-url-scheme-check`) | More results | This query now recognizes additional variations of URL scheme checks. |
| Missing CSRF middleware (`js/missing-token-validation`) | Fewer false positive results | The query reports fewer duplicates and only flags handlers that explicitly access cookie data. |
| Superfluous trailing arguments (`js/superfluous-trailing-arguments`) | Fewer results | This query now excludes cases where a function uses the `Function.arguments` value to process a variable number of parameters. |
| Syntax error (`js/syntax-error`) | Lower severity | This results of this query are now displayed with lower severity. |
| Unbound event handler receiver (`js/unbound-event-handler-receiver`) | Fewer false positive results | This query now recognizes additional ways event handler receivers can be bound. |
| Uncontrolled command line (`js/command-line-injection`) | More results | This query now recognizes additional ways of constructing arguments to `cmd.exe` and `/bin/sh`. |
| Uncontrolled data used in path expression (`js/path-injection`) | More results | This query now recognizes additional ways dangerous paths can be constructed and used. |
| Use of call stack introspection in strict mode (`js/strict-mode-call-stack-introspection`) | Fewer false positive results | The query no longer flags expression statements. |
| Use of password hash with insufficient computational effort (`js/insufficient-password-hash`) | Fewer false positive results | This query now recognizes and excludes additional cases that do not require secure hashing. |
| Useless regular-expression character escape (`js/useless-regexp-character-escape`) | Fewer false positive results | This query now distinguishes between escapes in strings and regular expression literals. |
## Changes to libraries
* The predicates `RegExpTerm.getSuccessor` and `RegExpTerm.getPredecessor` have been changed to reflect textual, not operational, matching order. This only makes a difference in lookbehind assertions, which are operationally matched backwards. Previously, `getSuccessor` would mimick this, so in an assertion `(?<=ab)` the term `b` would be considered the predecessor, not the successor, of `a`. Textually, however, `a` is still matched before `b`, and this is the order we now follow.
* An extensible model of the `EventEmitter` pattern has been implemented.
* Taint-tracking configurations now interact differently with the `data` flow label, which may affect queries
that combine taint-tracking and flow labels.
- Sources added by the 1-argument `isSource` predicate are associated with the `taint` label now, instead of the `data` label.
- Sanitizers now only block the `taint` label. As a result, sanitizers no longer block the flow of tainted values wrapped inside a property of an object.
To retain the old behavior, instead use a barrier, or block the `data` flow label using a labeled sanitizer.
| Arbitrary file write during tarfile extraction (`py/tarslip`) | Fewer false negative results | Negations are now handled correctly in conditional expressions that may sanitize tainted values. |
| First parameter of a method is not named 'self' (`py/not-named-self`) | Fewer false positive results | `__class_getitem__` is now recognized as a class method. |
| Import of deprecated module (`py/import-deprecated-module`) | Fewer false positive results | Deprecated modules that are used to provide backwards compatibility are no longer reported.|
| Module imports itself (`py/import-own-module`) | Fewer false positive results | Imports local to a given package are no longer classified as self-imports. |
| Uncontrolled command line (`py/command-line-injection`) | More results | We now model the `fabric` and `invoke` packages for command execution. |
### Web framework support
The CodeQL library has improved support for the web frameworks: Bottle, CherryPy, Falcon, Pyramid, TurboGears, Tornado, and Twisted. They now provide a proper `HttpRequestTaintSource`, instead of a `TaintSource`. This will enable results for the following queries:
- `py/path-injection`
- `py/command-line-injection`
- `py/reflective-xss`
- `py/sql-injection`
- `py/code-injection`
- `py/unsafe-deserialization`
- `py/url-redirection`
The library also has improved support for the web framework Twisted. It now provides a proper
`HttpResponseTaintSink`, instead of a `TaintSink`. This will enable results for the following
queries:
- `py/reflective-xss`
- `py/stack-trace-exposure`
## Changes to libraries
### Taint tracking
- The `urlsplit` and `urlparse` functions now propagate taint appropriately.
- HTTP requests using the `requests` library are now modeled.
* The library `VCS.qll` and all queries that imported it have been removed.
* The data-flow library has been improved, which affects most security queries by potentially
adding more results. Flow through functions now takes nested field reads/writes into account.
For example, the library is able to track flow from `taint()` to `sink()` via the method
`getf2f1()` in
```c
struct C {
int f1;
};
struct C2
{
C f2;
int getf2f1() {
return f2.f1; // Nested field read
}
void m() {
f2.f1 = taint();
sink(getf2f1()); // NEW: taint() reaches here
}
};
```
* The security pack taint tracking library (`semmle.code.cpp.security.TaintTracking`) now considers that equality checks may block the flow of taint. This results in fewer false positive results from queries that use this library.
* The length of a tainted string (such as the return value of a call to `strlen` or `strftime` with tainted parameters) is no longer itself considered tainted by the `models` library. This leads to fewer false positive results in queries that use any of our taint libraries.
| Cross-site scripting through DOM (`js/xss-through-dom`) | security, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities where existing text from the DOM is used as HTML. Results are not shown on LGTM by default. |
| Incomplete HTML attribute sanitization (`js/incomplete-html-attribute-sanitization`) | security, external/cwe/cwe-20, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities due to incomplete sanitization of HTML meta-characters. Results are shown on LGTM by default. |
| Unsafe expansion of self-closing HTML tag (`js/unsafe-html-expansion`) | security, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities caused by unsafe expansion of self-closing HTML tags. |
| Unsafe shell command constructed from library input (`js/shell-command-constructed-from-input`) | correctness, security, external/cwe/cwe-078, external/cwe/cwe-088 | Highlights potential command injections due to a shell command being constructed from library inputs. Results are shown on LGTM by default. |
| Client-side cross-site scripting (`js/xss`) | Fewer results | This query now recognizes additional safe patterns of constructing HTML. |
| Client-side URL redirect (`js/client-side-unvalidated-url-redirection`) | Fewer results | This query now recognizes additional safe patterns of doing URL redirects. |
| Code injection (`js/code-injection`) | More results | More potential vulnerabilities involving NoSQL code operators are now recognized. |
| Expression has no effect (`js/useless-expression`) | Fewer results | This query no longer flags an expression when that expression is the only content of the containing file. |
| Hard-coded credentials (`js/hardcoded-credentials`) | More results | This query now recognizes hard-coded credentials sent via HTTP authorization headers. |
| Incomplete URL scheme check (`js/incomplete-url-scheme-check`) | More results | This query now recognizes additional url scheme checks. |
| Misspelled variable name (`js/misspelled-variable-name`) | Message changed | The message for this query now correctly identifies the misspelled variable in additional cases. |
| Non-linear pattern (`js/non-linear-pattern`) | Fewer duplicates and message changed | This query now generates fewer duplicate alerts and has a clearer explanation in case of type annotations used in a pattern. |
| Prototype pollution in utility function (`js/prototype-pollution-utility`) | More results | This query now recognizes additional utility functions as vulnerable to prototype polution. |
| Uncontrolled command line (`js/command-line-injection`) | More results | This query now recognizes additional command execution calls. |
| Uncontrolled data used in path expression (`js/path-injection`) | More results | This query now recognizes additional file system calls. |
| Unknown directive (`js/unknown-directive`) | Fewer results | This query no longer flags directives generated by the Babel compiler. |
| Unused property (`js/unused-property`) | Fewer results | This query no longer flags properties of objects that are operands of `yield` expressions. |
| Zip Slip (`js/zipslip`) | More results | This query now recognizes additional vulnerabilities. |
The following low-precision queries are no longer run by default on LGTM (their results already were not displayed):
-`js/angular/dead-event-listener`
-`js/angular/unused-dependency`
-`js/bitwise-sign-check`
-`js/comparison-of-identical-expressions`
-`js/conflicting-html-attribute`
-`js/ignored-setter-parameter`
-`js/jsdoc/malformed-param-tag`
-`js/jsdoc/missing-parameter`
-`js/jsdoc/unknown-parameter`
-`js/json-in-javascript-file`
-`js/misspelled-identifier`
-`js/nested-loops-with-same-variable`
-`js/node/cyclic-import`
-`js/node/unused-npm-dependency`
-`js/omitted-array-element`
-`js/return-outside-function`
-`js/single-run-loop`
-`js/too-many-parameters`
-`js/unused-property`
-`js/useless-assignment-to-global`
## Changes to libraries
* A library `semmle.javascript.explore.CallGraph` has been added to help write queries for exploring the call graph.
* Added data flow for `Map` and `Set`, and added matching type-tracking steps that can accessed using the `CollectionsTypeTracking` module.
* The data-flow node representing a parameter or destructuring pattern is now always the `ValueNode` corresponding to that AST node. This has a few consequences:
-`Parameter.flow()` now gets the correct data flow node for a parameter. Previously this had a result, but the node was disconnected from the data flow graph.
-`ParameterNode.asExpr()` and `.getAstNode()` now gets the parameter's AST node, whereas previously it had no result.
-`Expr.flow()` now has a more meaningful result for destructuring patterns. Previously this node was disconnected from the data flow graph. Now it represents the values being destructured by the pattern.
* The global data-flow and taint-tracking libraries now model indirect parameter accesses through the `arguments` object in some cases, which may lead to additional results from some of the security queries, particularly "Prototype pollution in utility function".
* Importing `semmle.python.web.HttpRequest` will no longer import `UntrustedStringKind` transitively. `UntrustedStringKind` is the most commonly used non-abstract subclass of `ExternalStringKind`. If not imported (by one mean or another), taint-tracking queries that concern `ExternalStringKind` will not produce any results. Please ensure such queries contain an explicit import (`import semmle.python.security.strings.Untrusted`).
* @description Using the TLS or SSLv23 protocol from the boost::asio library, but not disabling deprecated protocols, or disabling minimum-recommended protocols.
<p>The code attempts to drop privilege in an incorrect order by
erroneous dropping user privilege before groups. This has security
impact if the return codes are not checked.</p>
<p>False positives include code performing negative checks, making
sure that setgid or setgroups does not work, meaning permissions are
dropped. Additionally, other forms of sandboxing may be present removing
any residual risk, for example a dedicated user namespace.</p>
</overview>
<recommendation>
<p>Set the new group ID, then set the target user's intended groups by
dropping previous supplemental source groups and initializing target
groups, and finally set the target user.</p>
</recommendation>
<example>
<p>The following example demonstrates out of order calls.</p>
<sample src="PrivilegeDroppingOutoforder.c" />
</example>
<references>
<li>CERT C Coding Standard:
<a href="https://wiki.sei.cmu.edu/confluence/display/c/POS37-C.+Ensure+that+privilege+relinquishment+is+successful">POS37-C. Ensure that privilege relinquishment is successful</a>.
* Provides a class and predicate for recognizing files that are likely to have been generated
* automatically.
*/
import semmle.code.cpp.Comments
import semmle.code.cpp.File
import semmle.code.cpp.Preprocessor
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.