The import doesn't actually work the intended way, so running
```
$ python python/ql/test/experimental/library-tests/CallGraph/test.py
```
will procude no output. but our extractor will extract the things we need, so
for a quick fix this will need to suffice.
Some method variants are captured by a super class. Added some comments
to indicate where this happens to make review of missing methods easier
in the future.
While most flow for a qualifierToArgumentStep goes through a variable use
this is not always the case. Therefore it is best to remove the restriction
to RValue to allow taint steps to use postupdate nodes.
See also: ba86dea657
Previously, we only considered an import relevant to portals if the path it imported was declared as a dependency. This falls down for deep imports where a specific module inside the package is imported rather than the default entry point, for imports of built-in modules like `fs`, and in cases where a developer simply forgets to declare a dependency.
So instead we now consider all imports relevant whose path does not start with a dot or a slash.
- The `UrlRedirect` class is renamed to `ServletUrlRedirect`.
- Abstract class `UrlRedirectSink` is defined that can be imported and
used to customise CWE-601 via Customizations.qll
On second thought, I'm going to make this apply only to the AutoBuilder directory. C# has it in the root of `csharp`, but they need it for their extractor as well.
C# has its own additional `.gitignore` to ignore the output files of the AutoBuilder build. Now that we have our own AutoBuilder in C++, we need the same thing.
Removes 'eclipse-cdt/cdt' and 'gradle/gradle' from the queried projects
because they cannot be queried currently, and instead queries all demo
projects which are currently available.
Removes 'gradle/gradle' from the queried projects because it cannot be
queried currently, and instead queries all demo projects which are currently
available.
Removes 'eclipse-cdt/cdt' and 'gradle/gradle' from the queried projects
because they cannot be queried currently, and instead queries all demo
projects which are currently available.
Removes 'eclipse-cdt/cdt' and 'gradle/gradle' from the queried projects
because they cannot be queried currently, and instead queries all demo
projects which are currently available.
Removes 'eclipse-cdt/cdt' and 'gradle/gradle' from the queried projects
because they cannot be queried currently, and instead queries all demo
projects which are currently available.
Removes 'eclipse-cdt/cdt' and 'gradle/gradle' from the queried projects
because they cannot be queried currently, and instead queries all demo
projects which are currently available.
Removes 'gradle/gradle' from the queried projects because it cannot be
queried currently, and instead queries all demo projects which are currently
available.
In this non-linear recursion, a `#prev` relation was joined earlier than
the `#prev_delta` relation. As a result, each iteration of the predicate
processes every tuple from previous iterations.
This quadratic behavior caused severe slowdowns on oneapi-src/oneDNN.
This relation was originally introduced to improve performance but may
no longer be necessary. The `localFlowStepPlus` predicate had an
explosion of tuples on oneapi-src/oneDNN for C++.
As discussed in today's C++ analysis team meeting. `Opcode` is rarely used directly, so we'll just refer to the documentation for the corresponding `Instruction` class.
I've preserved the script in case we want to do a bulk change of all of the `Opcode` comments, but I don't expect it will be needed if we just add a new `Opcode` or two.
Removed a dataflow cofiguration for setting a revocation checker.
Instead, the query just checks if addCertPathChecker() or setCertPathCheckers()
methods are called.
- Added experimental/Security/CWE/CWE-299/DisabledRevocationChecking.ql
The query looks for PKIXParameters.setRevocationEnabled(false) calls.
- Added RevocationCheckingLib.qll
- Added a qhelp file with examples
- Added tests in java/ql/test/experimental/Security/CWE/CWE-299
For every concrete `Opcode`, there is a corresponding `Instruction` class. Rather than duplicate all of the QLDoc by hand, I wrote a quick Python script to copy the QLDoc from `Instruction.qll` to `Opcode.qll`. I don't expect that we will need to do this often, so I'm not hooking it up to a PR check or anything like that, but I did commit the script itself in case we need it again.
- Renamed `DynamicCastToVoid` to the more descriptive `CompleteObjectAddress`
- Split verbose description from summary in a few Instruction QLDoc comments.
- Added `Instruction` classes for the few remaining `Opcode`s that didn't have one.
- Removed a use of "e.g."
#3546 changed the query but did not adjust the query link.
Additionally the old query could not be re-run because some of the projects it
targeted (gradle/gradle and eclipse-cdt/cdt) cannot be queried currently.
It now queries all available demo projects of the query console instead.
I think I've now documented every class and public predicate in `Instruction.qll` I've tried to include detailed semantics of each instruction where appropriate.
Rather, fulfiling the Liskov substitution principle is an opinionated
recommendation. Looking at `py/inheritance/incorrect-overridden-signature` and
`py/mixed-tuple-returns`, it seems very appropriate that this should have
`@severity recommendation`, and `@sub-severity high`.
There was unfortunately a semantic merge conflict between #3419 and
#3587 that caused a performance regression on (at least) OpenJDK.
This reverts commit 982fb38807, reversing
changes made to b841cacb83.
Replace most direct calls to `TRawInstruction()` with calls to `getInstructionTranslatedElement()` and `getInstructionTag()`, matching existing practice. One tiny RA diff in an inconsequential join order in `getInstructionVariable`.
This noopt predicate is no longer necessary. It's equivalent to `instruction = TRawInstruction(element, tag)`, which is already materialized and has a more favorable column order anyway.
These predicates are only used within the new single IR stage, so there's no need to cache them beyond that. RA diffs are trivial. Where previously many of the predicate on `Instruction` were inline wrappers around cached predicates from `IRConstruction`, now the predicates from `IRConstruction` get inlined into the `Instruction` predicates, and the `Instruction` predicates get materialized. The net amount of work is the same, but now it's not getting cached unnecessarily.
I think these queries have excellent results on lgtm.com. Many of the
results come from projects that use `sprintf` like it's a templating
engine, trusting that values from `argv` or `getenv` contain the correct
number of `%s`. I think we want to flag that.
The structure of the change note is modeled after 91af51cf46.
Methods annotated with a produces field which indicates a safe
content-type should not be considered XSS sinks. For example:
@RequestMapping(..., produces = "application/json")
When VS Code detects a Maven project, it automatically generates
a bunch of Eclipse files to describe the project.
These are now ignored in order to not pollute the repository
We were creating a `TranslatedFunction` even for functions that were not from source code, but then telling the IR package that those functions didn't have IR. This resulted in having prologue/epilogue instructions (e.g. `EnterFunction`, `ExitFunction`) with no enclosing `IRFunction`.
Some of our IR consistency failure query predicates already produced results in the schema as an `@kind problem` query, including `$@` replacements for the enclosing `IRFunction` to make it easier to figure out which function to dump when debugging. This change moves the rest of the query predicates in `IRConsistency.qll` to do the same. In addition, it wraps each call to `getEnclosingIRFunction()` to return an `OptionalIRFunction`, which can be either a real `IRFunction` or a placeholder in case `getEnclosingIRFunction()` returned no results. This exposes a couple new consistency failures in `syntax-zoo`, which will be fixed in a subsequent commit.
This change also deals with consistency failures when the enclosing `IRFunction` has more than one `Function` or `Location`. For multiple `Function`s, we concatenate the function names. For multiple `Location`s, we pick the first one in lexicographical order. This changes the number of results produced in the existing tests, but does't change the actual number of problems.
Generalize logic by recognizing not only calls to
`Microsoft.AspNetCore.Mvc.ViewFeatures.HtmlHelper.Raw()`, but calls to all `Raw()`
methods that implement `Microsoft.AspNetCore.Mvc.Rendering.IHtmlHelper.Raw()`.
Instead of a vague reference to a code comment for another language, the
`controlsBlock` predicate now has the whole comment in it directly.
I've adjusted the wording so it should be reasonably correct for C/C++.
As with the other comments in this file, I don't distinguish between the
condition and its block. I think that makes the explanation clearer
without losing any detail we care about.
To make the code fit the wording of the comment, I changed the
`hasBranchEdge/2` predicate into `getBranchSuccessor/1`.
On kamailio/kamailio the `DataFlowUtil::simpleInstructionLocalFlowStep`
predicate was slow because of the case for single-field structs, where
there was a large tuple-count bulge when joining with
`getFieldSizeOfClass`:
3552902 ~2% {2} r1 = SCAN Instruction::CopyInstruction::getSourceValueOperand_dispred#3#ff AS I OUTPUT I.<1>, I.<0>
2065347 ~2% {2} r35 = JOIN r1 WITH Operand::NonPhiMemoryOperand::getAnyDef_dispred#3#ff AS R ON FIRST 1 OUTPUT r1.<1>, R.<1>
2065827 ~2% {3} r36 = JOIN r35 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r35.<1>, r35.<0>
2065825 ~3% {3} r37 = JOIN r36 WITH Type::Type::getSize_dispred#ff AS R ON FIRST 1 OUTPUT r36.<1>, r36.<2>, R.<1>
2068334 ~2% {4} r38 = JOIN r37 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 1 OUTPUT R.<1>, r37.<2>, r37.<0>, r37.<1>
314603817 ~0% {3} r39 = JOIN r38 WITH DataFlowUtil::getFieldSizeOfClass#fff_120#join_rhs AS R ON FIRST 2 OUTPUT r38.<3>, R.<2>, r38.<2>
8 ~0% {2} r40 = JOIN r39 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 2 OUTPUT r39.<2>, r39.<0>
That's 314M tuples.
Strangely, there is no such bulge on more well-behaved snapshots like
mysql/mysql-server.
With this commit the explosion is gone:
...
2065825 ~0% {4} r37 = JOIN r36 WITH Type::Type::getSize_dispred#ff AS R ON FIRST 1 OUTPUT r36.<0>, R.<1>, r36.<1>, r36.<2>
1521 ~1% {3} r38 = JOIN r37 WITH DataFlowUtil::getFieldSizeOfClass#fff_021#join_rhs AS R ON FIRST 2 OUTPUT r37.<2>, R.<2>, r37.<3>
8 ~0% {2} r39 = JOIN r38 WITH Instruction::Instruction::getResultType_dispred#3#ff AS R ON FIRST 2 OUTPUT r38.<0>, r38.<2>
The `controlsBlock` predicate had some dramatic bulges in its tuple
counts. To make matters worse, those bulges were in materialized
intermediate predicates like `#shared` and `#antijoin_rhs`, not just in
the middle of a pipeline.
The problem was particularly evident on kamailio/kamailio, where
`controlsBlock` was the slowest predicate in the IR libraries:
IRGuards::IRGuardCondition::controlsBlock_dispred#fff#shared#4 ........ 58.8s
IRGuards::IRGuardCondition::controlsBlock_dispred#fff#antijoin_rhs .... 33.4s
IRGuards::IRGuardCondition::controlsBlock_dispred#fff#antijoin_rhs#1 .. 26.7s
The first of the above relations had 201M rows, and the others
had intermediate bulges of similar size.
The bulges could be observed even on small projects although they did
not cause measurable performance issues there. The
`controlsBlock_dispred#fff#shared#4` relation had 3M rows on git/git,
which is a lot for a project with only 1.5M IR instructions.
This commit borrows an efficient implementation from Java's
`Guards.qll`, tweaking it slightly to fit into `IRGuards`. Performance
is now much better:
IRGuards::IRGuardCondition::controlsBlock_dispred#fff ................... 6.1s
IRGuards::IRGuardCondition::hasDominatingEdgeTo_dispred#ff .............. 616ms
IRGuards::IRGuardCondition::hasDominatingEdgeTo_dispred#ff#antijoin_rhs . 540ms
After this commit, the biggest bulge in `controlsBlock` is the size of
`IRBlock::dominates`. On kamailio/kamailio this is an intermediate tuple
count of 18M rows in the calculation of `controlsBlock`, which in the
end produces 11M rows.
The optimizer picked a terrible join order in `VirtualDispatch::DataSensitiveCall::flowsFrom()`. Telling it that `getAnOutNode()` has a unique result convinces it to join first on the `Callable`, rather than on the `ReturnKind`.
There wasn't a good join order for the "store to global var" case in the
virtual dispatch library. When a global variable had millions of
accesses but few stores to it, the `flowsFrom` predicate would join to
see all those millions of accesses before filtering down to stores only.
The solution is to pull out a `storeIntoGlobal` helper predicate that
pre-computes which accesses are stores.
To make the code clearer, I've also pulled out a repeated chunk of code
into a new `addressOfGlobal` helper predicate.
For the kamailio/kamailio project, these are the tuple counts before:
Starting to evaluate predicate DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta/3[3]@21a1df (iteration 3)
Tuple counts for DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta:
...
59002 ~0% {3} r17 = SCAN DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#prev_delta AS I OUTPUT I.<1>, true, I.<0>
58260 ~1% {3} r31 = JOIN r17 WITH DataFlowUtil::Node::asVariable_dispred#fb AS R ON FIRST 1 OUTPUT R.<1>, true, r17.<2>
2536187389 ~6% {3} r32 = JOIN r31 WITH Instruction::VariableInstruction::getASTVariable_dispred#fb_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, true, r31.<2>
2536187389 ~6% {3} r33 = JOIN r32 WITH project#Instruction::VariableAddressInstruction#class#3#ff AS R ON FIRST 1 OUTPUT r32.<0>, true, r32.<2>
58208 ~0% {3} r34 = JOIN r33 WITH Instruction::StoreInstruction::getDestinationAddress_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, true, r33.<2>
Tuple counts after:
Starting to evaluate predicate DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta/3[3]@6073c5 (iteration 3)
Tuple counts for DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#cur_delta:
...
59002 ~0% {3} r17 = SCAN DataFlowDispatch::VirtualDispatch::DataSensitiveCall::flowsFrom#fff#prev_delta AS I OUTPUT I.<1>, true, I.<0>
58260 ~1% {3} r23 = JOIN r17 WITH DataFlowUtil::Node::asVariable_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, true, r17.<2>
58208 ~0% {3} r24 = JOIN r23 WITH DataFlowDispatch::VirtualDispatch::storeIntoGlobal#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, true, r23.<2>
58208 ~0% {3} r25 = JOIN r24 WITH DataFlowUtil::InstructionNode#ff_10#join_rhs AS R ON FIRST 1 OUTPUT true, r24.<2>, R.<1>
Notice that the final tuple count, 58208, is the same before and after.
The kamailio/kamailio project seems to have been affected by this issue
because it has global variables to do with logging policy, and these
variables are loaded from in every place where their logging macro is
used.
The four cached predicates used to access common properties of instructions took a `TStageInstruction` as a parameter. This requires the calling code, in `Instruction.qll`, to then join the results with `hasInstruction()` to filter out results for `TRawInstruction`s that were discarded as unreachable. By simply switching the parameter types to `Instruction`, we can force that join to happen in the cached predicate itself. This makes the various accessor predicates on `Instruction` trivially inlinable to the cached predicate, instead of being joins of two huge relations that might have to be recomputed in later stages.
Most of the predicates on `Instruction` are thin wrappers around cached predicates in the `IRConstruction` or `SSAConstruction` modules. However, `getResultIRType()` has to join `Construction::getInstructionResultType()` with `LanguageType::getIRType()`. `getResultIRType()` is called frequently both within the IR code and by IR consumers, and that's a big join to have to repeat in multiple stages.
I looked at most of the other predicates in `Instruction.qll`, and didn't see any other predicates that met all of the criteria of "large, commonly called, and not already inline".
There were a few places in the IR itself where we use `Instruction.getResultType()`, which returns the C++ `Type` of the result, instead of `Instruction.getResultIRType()`, which returns the language-neutral `IRType` of the result. By removing this usage, we can avoid evaluating `getResultType()` at all.
There are still other uses of `Instruction.getResultType()` in other libraries. We should switch those as well.
Before this change, evaluation of the IR was spread out across about 5 stages. This resulted in a lot of redundant evaluation, especially tuple numbering of large IPA types like `TInstruction`. This change makes two small changes that, when combined, ensure that the IR is evaluated all in one stage:
First, we mark `TInstruction` as `cached`. This collapses all of the work to create instructions, across all three IR phases, into a single phase.
Second, we make the `SSA` module in `SSAConstruction.qll` just contain aliases to `cached` predicates defined in the `Cached` module. This ensures that all of the `Operand`-related SSA computation happens in the same stage as all of the `Instruction`-related SSA computation.
- Added experimental/Security/CWE/CWE-094/MvelInjection.ql
- Added experimental/Security/CWE/CWE-094/MvelInjectionLib.qll
- Added a qhelp file with an example of vulnerable code
- Added tests and stubs for mvel2-2.4.7
When running `sync-files` (or `sync-identical-files`) with the `--latest` switch, if one or more of the files in a group does not exist, the script will crash. This happens all the time when I add a new group, or add a new file path in an existing group. This has bothered me for a long time, so I finally fixed it when I ran into it again today.
I've changed the script as follows:
- If _none_ of the paths in the group exist, print an error message listing the paths in the group. This happens with or without `--latest`.
- If `--latest` is specified, copy the master file to the paths of the missing files.
This updates C#'s IR to share `TInstruction` across stages the same way C++ does. The only interesting part is that, since we have not yet ported full alias analysis to C#, I stubbed out the required parts of the aliased SSA interface in `AliasedSSAStub.qll`.
Making these part of the IPA object identity changes the failure mode for cases where we assign multiple result types to an instruction. Previously, we would just have one instruction with two result types, but now we'd have two instructions, which breaks things worse. This change goes back to how things were before, to avoid any new surprises on real-world code with invalid ASTs or IR.
The conflicts came from how `this` is now a parameter but not a
`Parameter` on `master`.
Conflicts:
cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/defaulttainttracking.cpp
cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/tainted.expected
cpp/ql/test/library-tests/dataflow/DefaultTaintTracking/test_diff.expected
cpp/ql/test/library-tests/dataflow/dataflow-tests/dataflow-ir-consistency.expected
cpp/ql/test/library-tests/dataflow/fields/ir-flow.expected
cpp/ql/test/library-tests/syntax-zoo/dataflow-ir-consistency.expected
The taint-tracking configuration in `ExposureOfPrivateInformation.ql`
overlaps with the XSS taint-tracking configuration, as witnessed by this import chain:
```
semmle.code.csharp.security.dataflow.ExposureOfPrivateInformation.qll imports
semmle.code.csharp.security.dataflow.flowsinks.ExternalLocationSink imports
semmle.code.csharp.security.dataflow.flowsinks.Remote imports
semmle.code.csharp.security.dataflow.XSS
```
(The same for `CleartextStorage.qll` and `LogForging.ql`.)
The fix is to use `TaintTracking2` for the XSS configuration.
Now that `TInstruction` is shared between IR stages, several of the per-stage IR construction predicates can now be moved into the `Raw` interface exposed only by the initial construction of IR from the ASTs. This also removed a couple predicates that were not used previously at all.
Each stage of the IR reuses the majority of the instructions from previous stages. Previously, we've been wrapping each reused old instruction in a branch of the `TInstruction` type for the next stage. This causes use to create roughly three times as many `TInstruction` objects as we actually need.
Now that IPA union types are supported in the compiler, we can share a single `TInstruction` IPA type across stages. We create a single `TInstruction` IPA type, with individual branches of this type for instructions created directly from the AST (`TRawInstruction`) and for instructions added by each stage of SSA construction (`T*PhiInstruction`, `T*ChiInstruction`, `T*UnreachedInstruction`). Each stage then defines a `TStageInstruction` type that is a union of all of the branches that can appear in that particular stage. The public `Instruction` class for each phase extends the `TStageInstruction` type for that stage.
The interface that each stage exposes to the pyrameterized modules in the IR is now split into three pieces:
- The `Raw` module, exposed only by the original IR construction stage. This module identifies which functions have IR, which `TRawInstruction`s exist, and which `IRVariable`s exist.
- The `SSA` module, exposed only by the two SSA construction stages. This identifiers which `Phi`, `Chi`, and `Unreached` instructions exist.
- The global module, exposed by all three stages. This module has all of the predicates whose implementation is different for each stage, like gathering definitions of `MemoryOperand`s.
Similarly, there is now a single `TIRFunction` IPA type that is shared across all three stages. There is a single `IRFunctionBase` class that exposes the stage-indepdendent predicates; the `IRFunction` class for each stage extends `IRFunctionBase`.
Most of the other changes are largely mechanical.
- Added experimental/Security/CWE/CWE-094/SpelInjection.ql
and a couple of libraries
- Added a qhelp file with a few examples
- Added tests and stubs for Spring
This QL
```codeql
import python
import semmle.python.dataflow.TaintTracking
import semmle.python.security.strings.Untrusted
from CollectionKind ck
where
ck.(DictKind).getMember() instanceof StringKind
or
ck.getMember().(DictKind).getMember() instanceof StringKind
select ck, ck.getAQlClass(), ck.getMember().getAQlClass()
```
generates these 6 results.
```
1 {externally controlled string} ExternalStringDictKind UntrustedStringKind
2 {externally controlled string} StringDictKind UntrustedStringKind
3 [{externally controlled string}] SequenceKind ExternalStringDictKind
4 [{externally controlled string}] SequenceKind StringDictKind
5 {{externally controlled string}} DictKind ExternalStringDictKind
6 {{externally controlled string}} DictKind StringDictKind
```
StringDictKind was only used in *one* place in our library code. As illustrated
above, it pollutes our set of TaintKinds. Effectively, every time we make a
flow-step for dictionaries with tainted strings as values, we do it TWICE --
once for ExternalStringDictKind, and once for StringDictKind... that is just a
waste.
If I wanted to use my own TaintKind and not have any interaction with
`UntrustedStringKind` that wouldn't be possible today since these standard http
libraries import it directly. (also, I wouldn't get any sources of my custom
TaintKind from turbogears or bottle). I changed them to use the same pattern of
`ExternalStringKind` as everything else does.
While investigating a bug with `TInstruction` sharing, I discovered that we had a case where alias analysis could create two `VirtualVariable`s for the same `Allocation`. For an indirect parameter allocation, we were using the type of the pointer variable as the type of the indirect allocation, instead of just `Unknown`. If the `IRType` of the pointer variable was the same type as the type of at least one access to the indirect allocation, we'd create both an `EntireAllocationVirtualVariable` and a `VariableVirtualVariable` for the allocation.
I added a new consistency test to guard against this in the future. This also turned out to be the root cause of the one existing known consistency failure in the IR tests.
Since HttpResponse is not *only* used for XSS, it is still valuable to know the
content is send as part of the response.
The *proper* solution to this problem of not all HttpResponses being vulnerable
to XSS is probably to define a new abstract class in Http.qll called
HttpResponseXSSVulnerableSink (or similar). I would like to model a few more
libraries/frameworks before fully comitting to an approach though.
Currently, only `arguments[c]` for a constant value `c` is supported.
This allows us to detect the prototype-pollution vulnerabilities in (old versions of) `extend`, `jquery`, and `node.extend`.
All these queries have been deprecated since 2018. There is
unfortunately no way to deprecate a library, but it's been years since
we populated any databases using the VCS library, so nobody should be
using it.
I took most of the docs from the corresponding predicates in
JavaScript's `CodeDuplication.qll`. Where JavaScript had a corresponding
predicate but didn't have QLDoc, I added new QLDoc to both.
The `cpp/too-few-arguments` query produced alerts for ambiguous
databases where a function had multiple possible declarations, with some
declarations having the right number of parameters and some having too
many. With this change, the query errs on the side of caution in those
cases and does not produce an alert.
This fixes false positives on racket/racket.
The new `hasDefiniteNumberOfParameters` is exactly the negation of the
old `hasZeroParamDecl`.
I got my eyes on this one since it was using a deprecated method, BUT it was
also doing the thing, since File.getName() is the same as
File.getAbsolutePath(), and that doesn't match the description :\
In a near-future release overriding a deprecated predicate without making as
deprecated would give a compiler warning.
Not fixing the XML one. [I can see that this shouldn't be reported
anymore](https://github.com/github/codeql/pull/3520#issuecomment-631552943), and
it's not safe to remove since it was only marked as deprecated in
e6425bb4cf.
In a couple of cases, we use `glval<unknown>` as the result type of an instruction because we can't come up with anything better. Two examples are the result of `VariableAddress[#ellipsis]`, and the address of the temp variable that holds the lvalue result of the conditional operator in `(a ? b : c) = y`. In both cases, we call `getTypeForGLValue(any(UnknownType t))`, but that would have multiple results because `result.hasType(any(UnknownType t), true)` also holds for `CppFunctionGLValueType`. I tightened the result type to ensure we get the right one.
The QL compiler is about to be changed to emit a warning when overriding a deprecated predicate. This PR marks the existing overrides of deprecated predicates as `deprecated` themselves, which avoids the warning.
The `Print.qll` models seem to preserve the `isWideCharDefault()` predicate for backwards compatibility, so we can't remove them and must continue overriding them.
The `XML.qll` override is necessary because both superclasses declare the `getName()` predicate. One is `deprecated`, and the other is `abstract`, so we have to have an override.
These queries are currently run by default, but don't have their results displayed.
Looking through results on LGTM.com, they are either false positives (e.g., `BitwiseSignCheck` which flags many perfectly harmless operations and `CompareIdenticalValues` which mostly flags NaN checks) or harmless results that developers are unlikely to care about (e.g., `EmptyArrayInit` or `MisspelledIdentifier`).
With this PR, the only queries that are still run but not displayed are security queries, where different considerations may apply.
No evidence that Spring Actuators are being used, e.g. `http.authorizeRequests().anyRequest().permitAll()`
Only safe Actuators are enabled, e.g. `EndpointRequest.to("health", "info")`
The virtual-dispatch code for globals was missing any relationship
between the union field access and the global variable, which meant it
propagated function-pointer flow between any two fields of a global
struct. This resulted in false positives from
`cpp/tainted-format-string` on projects using SDL, such as
WohlSoft/PGE-Project.
In addition to fixing that bug, this commit also brings the code up to
date with the new style of modeling flow through global variables:
`DataFlow::Node.asVariable()`.
This test demonstrates that IR data flow conflates unrelated fields of a
global struct-typed variable and that this bug is not present in the old
AST-based implementation of `semmle.code.cpp.security.TaintTracking`.
This cleans up the test results, which were confusing because functions
like `sink` had multiple locations.
There are some additional results now involving casts to `const char *`
because previously it varied whether `sink` used `const`, and now it
always does.
The previous changes made the optimizer choose a bad join order for the RHS of the antijoin in `addressOperandAllocationAndOffset`. Once again, `unique` to the rescue.
After #3382 changed the escape analysis to model qualifiers as escaping,
there was an imbalance in the SSA library, where `addressTakenVariable`
excludes variables from SSA analysis if they have their address taken
but are _not_ passed by reference. This showed up as a missing result in
`TOCTOUFilesystemRace.ql`, demonstrated with a test case in #3432.
This commit changes the definition of "pass by reference" to include
call qualifiers, which allows SSA modeling of variables that have member
function calls on them.
Adding a new test case leads to changes in all `.expected` files in its
directory.
The new results show that the `DefinitionsAndUses` library does not
model `std::addressof` correctly, but that library is not intended to be
used for new code.
I didn't add this support in `AddressConstantExpression.qll` since I
think it would require extra work and testing to get the constexprness
right. My long-term plan for `AddressConstantExpression.qll` is to move
its functionality to the extractor.
It does so by first defining what a pointer dereference is (on the IR
`Instruction` level), and then using the array length analysis and the range
analysis together to prove that some of these pointer dereferences are safe.
The class org.apache.commons.io.serialization.ValidatingObjectInputStream
is an implementation of ObjectInputStream that validates the deserialized
classes against a white list. Therefore, this class should not be considered an
unsafe deserialization sink.
These already have results for BoundMethodValue, although
1) it's a bit strange that `getParameter(-1)` has results
2) why does `Method(Function C.n, class C)` exists? this would only be relevant
if `n` was a classmethod, but it isn't. It's not a problem that it exsits per
se, but curious.
The `fieldFlow` predicate contained a fragile join that has become
ordered wrong recently, either as result of an unrelated change in the
data-flow library or as part of the stats change for the last dbscheme
change.
The minimal fix is to use `getEnclosingCallable` instead of
`getFunction` since the former uses `unique` to ensure good join
ordering in its callers. A longer-term fix should be applied to the AST
base libraries, but this will be invasive and require independent
testing.
Tuple counts on Wireshark before (cancelled after a few minutes):
(747s) Starting to evaluate predicate DataFlowUtil::localFlowStep#ff/2@bdba82
(848s) Tuple counts for DataFlowUtil::localFlowStep#ff:
1766640980 ~1% {2} r1 = JOIN DataFlowUtil::Node::getFunction_dispred#ff_10#join_rhs AS L WITH DataFlowUtil::Node::getFunction_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT L.<1>, R.<1>
1327 ~0% {2} r2 = JOIN r1 WITH project#DataFlowImplLocal::Configuration::hasFlow#fbb AS R ON FIRST 2 OUTPUT r1.<0>, r1.<1>
9691232 ~0% {2} r3 = DataFlowUtil::simpleLocalFlowStep#ff@staged_ext \/ r2
return r3
After:
(0s) Starting to evaluate predicate DataFlowUtil::localFlowStep#ff/2@a852a0
(0s) Tuple counts for DataFlowUtil::localFlowStep#ff:
49017 ~4% {3} r1 = JOIN project#DataFlowImplLocal::Configuration::hasFlow#fff AS L WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff AS R ON FIRST 1 OUTPUT L.<1>, R.<1>, R.<0>
42359 ~0% {2} r2 = JOIN r1 WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff AS R ON FIRST 2 OUTPUT r1.<2>, r1.<0>
9732264 ~0% {2} r3 = DataFlowUtil::simpleLocalFlowStep#ff@staged_ext \/ r2
return r3
Report an alert in _any_ `Web.config` file, as long as it does not have an
`X-Frame-Options` entry (as opposed to only reporting alerts when _all_
`Web.config` files lack the entry).
This gives a slight speedup, and I think it makes the code shorter and
clearer.
On Wireshark, the time from the beginning of the `IRBlock` stage until
just before evaluation of `getInstruction` drops from 44s to 34s.
Flow from a definition by reference of a field into its object was
working inconsistently and in a very syntax-dependent way. For a
function `f` receiving a reference, `f(a->x)` could propagate data back
to `a` via the _reverse read_ mechanism in the shared data-flow library,
but for a function `g` receiving a pointer, `g(&a->x)` would not work.
And `f((*a).x)` would not work either.
In all cases, the issue was that the shared data-flow library propagates
data backwards between `PostUpdateNode`s only, but there is no
`PostUpdateNode` for `a->x` in `g(&a->x)`. This pull request inserts
such post-update nodes where appropriate and links them to their
neighbors. In this exapmle, flow back from the output parameter of `g`
passes first to the `PostUpdateNode` of `&`, then to the (new)
`PostUpdateNode` of `a->x`, and finally, as a _reverse read_ with the
appropriate field projection, to `a`.
This case was added in dccc0f4db. The surrounding code has changed a lot
since then, and the case no longer seems to have an effect except to
create some dead ends and possibly cycles in the local flow graph.
Also clarify the docs on `Call` to decrease the likelyhood of such an
omission happening again.
The updated test reflects that `f1.operator()` lets the address of `f1`
escape from the caller.
For each pointer, we start tracking (starting from the allocation or an array declaration)
1) how long is the chunk of memory allocated
2) where the current pointer is in this chunk of memory.
This information might not always exist, but when it does, it is reliable.
Currently only works intraprocedurally.
Default values for positional arugments follow a rule, so if an argument has a
default value, later positional arguments must also have default values.
The database only stores the actual default values, and nothing about the
arguments that doesn't have default values.
This turns out to be a major problem for Argument.getKwDefault(i), since default
values for keyword-only arguments doesn't have the same rule. So if you know
there is one default value, you can't tell if it is associated with `foo` or
`bar`, as in the examples below:
```
def a(*, foo=None, bar):
pass
def b(*, foo, bar=None):
pass
```
We already modelled usage of the protected `MessageDigest(String algo)`
constructor as a crypto algorithm specification. For some reason we did
not model the more commonly used public `MessageDigest.getInstance` method.
I've added a `.vscode/extensions.json` file that will automatically recommend the CodeQL for Visual Studio Code extension to anyone who opens the repo in VS Code (without the extension already installed).
This `select` clause had become very slow after we started caching
`ElementBase::toString` because the query used string concatenation to
produce alert messages, and those string concatenations were done very
early in the pipeline, producing lots of strings that would be discarded
moments later.
By using `$@` to interpolate elements into strings, the concatenation is
done outside of QL.
Testing on a Chromium snapshot, this commit takes us from
#select#ff ................ 6m2s
to
#select#cpe#134#fff ....... 15.2s
If you're developing one of the libraries that has muiltiple copies auto-generated by `sync-files.py`, you can now run `sync-files.py --latest` by going to the `Terminal | Run Task...` menu in VS Code and selecting the `Sync Identical Files` task. You can set a keyboard binding to run this task for quicker access.
Logic is hard, and I made a mistake inverting the formula for the second case, so the
predicate never held for a sizeExpr like sizeof(int)*sizeof(void).
Now, this case is correctly handled by the fallback.
Based on PR feedback. This will avoid a syntactic wart and make the
invocation do the right thing both with and without
`language[monotonicAggregates]`.
Since this test inheriently has `--max-import-depth=1`, by using six, we would
never look at the actual source-code of urllib.parse/urlparse and therefore the
test would never show if we understood the library code good enough that we
could propagate taint out-of-the-box.
All tests moved by one line... that is why the diff is so big
This yields more precise size information in a lot of the common cases of C allocation code,
as the common pattern malloc(count * sizeof(type)) is now understood.
We don't hyphenate "QL-library" and there were a few typos. Feel free to further revise this if I've changed the meaning too much.
As discussed separately, I was unable to raise this as a PR in GitHub.com and had to resort to a direct commit.
(cherry picked from commit e29468135d)
We don't hyphenate "QL-library" and there were a few typos. Feel free to further revise this if I've changed the meaning too much.
As discussed separately, I was unable to raise this as a PR in GitHub.com and had to resort to a direct commit.
This predicate is effectively a Cartesian product between all enum
types. It's infeasible to compute it in full, so luckily the optimizer
has been able to apply enough magic to make it feasible. That's not a
robust solution, and it has indeed broken on at least one version of the
1.24 release candidate.
On a Chromium snapshot where I ran the LGTM suite overnight, the
`m#MistypedFunctionArguments::arithTypesMatch#bb` predicate (magic for
`arithTypesMatch`) took 170m5s. That was commit b69fdf5 from the
internal repo. I tried to reproduce it in VSCode, this time with commit
646646, but it wasn't quite as bad: the predicate took only 38 seconds.
In any case, making the problematic predicate `pragma[inline]` removes
the slow magic and makes the `MistypedFunctionArguments.ql` query
faster.
IR generation was not handling the special two-operand flavor of the `?:` operator that GCC supports as an extension. The extractor doesn't quite give us enough information to do this correctly (see github/codeql-c-extractor-team#67), but we can get pretty close.
About half of the code could be shared between the two-operand and three-operand flavors. The main differences for the two-operand flavor are:
1. The "then" operand isn't a child of the `ConditionalExpr`. Instead, we just reuse the original value of the "condition" operand, skipping any implicit cast to `bool` (see comment for rationale).
2. For the three-operand flavor, we generate the condition as control flow rather than the computation of a `bool` value, to avoid creating unnecessarily complicated branching. For the two-operand version, we just compute the value, since we have to reuse that value in the "then" branch anyway.
I've added IR tests for these new cases. I've also updated the expectations for `SignAnalysis.ql` based on the fix. @rdmarsh2, can you please double-check that these diffs look correct? I believe they do, but you're the range/sign analysis expert.
Includes a fairly exhaustive test case for arithmetic operations involving `_Complex` and/or `_Imaginary` types. Thanks to these new tests, I discovered that the extractor treats certain arithmetic operations on `_Imaginary` types as separate expression kinds, so I added support for those kinds in IR construction.
`getVariableType()` is used to compute the actual semantic type of a variable from its declared type. That's where we handle pointer and function decay for parameters, and it's also where we handle arrays of unknown bound initialized with an initializer of known bound.
Previously, even if neither of the above situations applied, the type that we returned was the `getUnspecifiedType()` of the variable. This meant that, for example, `const char* p` would be treated as `char *`. This is inconsistent with how we handle types elsewhere in IR construction, where we preserve typedefs and cv-qualifiers when creating the `CppType` of an `IRVariable`, `Instruction`, or `Operand`.
The only visible effect this fix has is to fix the inferred result type for `Phi` instructions for variables affect by this change in `getVariableType()` behavior. Previously, we would see the variable accessed as both `const char*` and as `char*`, so we'd fall back to the canonical pointer type, which is `decltype(nullptr)`. Now, we see the same type for all accesses to the variable, so we use that type as the type of the SSA memory location and as the result type of the `Phi` instruction.
When `PrintSSA.qll` is imported, IR dumps will be annotated with the alias analysis information used during SSA construction. When printing this information, we incorrectly treated instructions at offset -1, which should only be `Phi` instructions, as `Chi` instructions for the instruction at offset 0. This produced phantom annotations, but did not affect the correctness of the actual IR.
- Merged the two notes for `cpp/uncontrolled-allocation-size` into one.
- Added note about renaming of a query id.
- Moved the use of IR in queries from the library section to the queries
section, rephrasing the note in terms of query results/performance
rather than library implementation.
- Grouped, without text changes, the three notes about the `Allocation`
library
- Grouped all the notes about standard-library models, abbreviating them
to eliminate the common text.
- Removed the note about `strlen` (#2647) since that should no longer
affect the results of queries or IR data flow after we started using
unsound IR for data flow.
The IR generation for `InitializeIndirection` currently connects its load operand to the result of the corresponding `InitializeParameter` instruction. This isn't exactly wrong, but it doesn't fit the IR invariant of "All unmodeled uses consume `UnmodeledDefinition`". Our current code doesn't care, because we just throw away all of the existing def-use information, modeled or otherwise, when we build unaliased SSA. However, some upcoming SSA changes don't work correctly if this invariant is broken.
I've added the trivial IR generation change, along with a new sanity query.
When the extractor can't prove that control flow will never reach the end of a non-`void`-returning function without reaching an explicit `return` statement, it inserts an implicit `return` without an operand. If control actually reaches this point, the behavior is undefined.
We were previously generating invalid IR for these implicit `return` statements, because the lack of an operand meant that there was no definition of the return value variable along that path. Instead, I've changed the IR generation to emit an `Unreached` instruction for the implicit `return`. This ensures that we don't create a control flow edge from the end of the body to the function epilogue.
The change to the range analysis test avoids having that test depend on the previous bad IR behavior, while still preserving the original spirit of the test.
These classes/predicates are not used by anything in our codebase, and is using
deprecated classes/predicates, so I think it's safe to assume they should also
have been marked with the deprecated annotation.
Changes the QL compiler warnings with:
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:663,50-63)
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:666,19-32)
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:671,19-32)
-WARNING: Type Configuration has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/dataflow/TaintTracking.qll:733,16-39)
-WARNING: Type CustomPointsToAttribute has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:181,28-51)
-WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:155,60-78)
-WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:159,19-37)
-WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:41,33-51)
+WARNING: Type CustomPointsToFact has been deprecated and may be removed in future (/home/rasmus/code/ql/python/ql/src/semmle/python/types/Extensions.qll:41,44-62)
Flow through partial chi-instruction operands was introduced to make
definition-by-reference work, but its implementation also allowed all
other partial writes to propagate. In particular, tainting a field would
taint the whole struct, which in turn led to taint propagating across
unrelated fields of a struct.
The security test `CWE-134/semmle/argv/argvLocal.c` shows that we also
want to propagate taint from an array element to the whole array, and it
also seems right to propagate taint from a union member to the whole
union.
I was considering if this was actually something different than
Value.hasAttribute, and the names were just accidentially the same. But after
looking at the definition for Value, I'm happy about marking this as an
override (I did not test whether it was neede though):
```codeql
class Value extends TObject {
...
/** Holds if this value has the attribute `name` */
predicate hasAttribute(string name) { this.(ObjectInternal).hasAttribute(name) }
```
This workaround from `DataFlowUtil.qll` should be useful for any query
that selects an `Expr`. In particular, it's useful for IR data flow.
This commit does not include test changes.
Every data-flow node should have a unique enclosing function (_callable_
in the terminology of the data-flow library), but this was not evident
for the optimizer, and it led to a bad join order in `pathStep`. This
commit fixes the join order for C++ AST data flow. All other copies of
data flow seem to be fine.
These are the tuple counts for OpenJDK before this commit:
(231s) Tuple counts for DataFlowImplLocal::pathStep#fffff#cur_delta:
5882 ~0% {6} r1 = SCAN DataFlowImplLocal::PathNodeMid#class#ffffff#prev_delta AS I OUTPUT I.<2>, I.<0>, I.<1>, I.<3>, I.<4>, I.<5>
1063406780 ~0% {7} r2 = JOIN r1 WITH DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS R ON FIRST 1 OUTPUT r1.<2>, R.<1>, r1.<1>, r1.<0>, r1.<3>, r1.<4>, r1.<5>
5882 ~1% {6} r3 = JOIN r2 WITH DataFlowUtil::Node::getFunction_dispred#ff AS R ON FIRST 2 OUTPUT r2.<0>, r2.<6>, r2.<2>, r2.<3>, r2.<4>, r2.<5>
105 ~0% {5} r4 = JOIN r3 WITH project#DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_021#join_rhs AS R ON FIRST 2 OUTPUT r3.<2>, r3.<3>, r3.<4>, r3.<5>, R.<2>
5882 ~1% {6} r5 = JOIN r2 WITH DataFlowUtil::Node::getFunction_dispred#ff AS R ON FIRST 2 OUTPUT r2.<5>, r2.<2>, r2.<0>, r2.<3>, r2.<4>, r2.<6>
5882 ~0% {6} r6 = JOIN r5 WITH DataFlowImplLocal::TNil#ff_1#join_rhs AS R ON FIRST 1 OUTPUT r5.<2>, false, r5.<5>, r5.<1>, r5.<3>, r5.<4>
0 ~0% {5} r7 = JOIN r6 WITH DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_02413#join_rhs AS R ON FIRST 3 OUTPUT R.<4>, r6.<3>, r6.<4>, r6.<5>, R.<3>
0 ~0% {5} r8 = JOIN r7 WITH DataFlowImplLocal::TNil#ff AS R ON FIRST 1 OUTPUT r7.<1>, r7.<2>, r7.<3>, R.<1>, r7.<4>
105 ~0% {5} r9 = r4 \/ r8
The problem is that `DataFlowUtil::Node::getFunction_dispred#ff`
(`getEnclosingCallable`) is joined too late.
After this commit, the tuple counts look like this:
(13s) Tuple counts for DataFlowImplLocal::pathStep#fffff#cur_delta:
5882 ~1% {6} r1 = SCAN DataFlowImplLocal::PathNodeMid#class#ffffff#prev_delta AS I OUTPUT I.<1>, I.<0>, I.<2>, I.<3>, I.<4>, I.<5>
5882 ~3% {7} r2 = JOIN r1 WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff AS R ON FIRST 1 OUTPUT r1.<2>, R.<1>, r1.<1>, r1.<0>, r1.<3>, r1.<4>, r1.<5>
5882 ~1% {6} r3 = JOIN r2 WITH DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS R ON FIRST 2 OUTPUT r2.<3>, r2.<6>, r2.<2>, r2.<0>, r2.<4>, r2.<5>
105 ~0% {5} r4 = JOIN r3 WITH project#DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_021#join_rhs AS R ON FIRST 2 OUTPUT r3.<2>, r3.<3>, r3.<4>, r3.<5>, R.<2>
5882 ~1% {6} r5 = JOIN r2 WITH DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS R ON FIRST 2 OUTPUT r2.<5>, r2.<2>, r2.<3>, r2.<0>, r2.<4>, r2.<6>
5882 ~0% {6} r6 = JOIN r5 WITH DataFlowImplLocal::TNil#ff_1#join_rhs AS R ON FIRST 1 OUTPUT r5.<2>, false, r5.<5>, r5.<1>, r5.<3>, r5.<4>
0 ~0% {5} r7 = JOIN r6 WITH DataFlowImplLocal::LocalFlowBigStep::localFlowBigStep#ffffff_02413#join_rhs AS R ON FIRST 3 OUTPUT R.<4>, r6.<3>, r6.<4>, r6.<5>, R.<3>
0 ~0% {5} r8 = JOIN r7 WITH DataFlowImplLocal::TNil#ff AS R ON FIRST 1 OUTPUT r7.<1>, r7.<2>, r7.<3>, R.<1>, r7.<4>
105 ~0% {5} r9 = r4 \/ r8
There is a slight slowdown coming from the introduction of a new
predicate `DataFlowImplLocal::pathStep#fffff#join_rhs`, which is used
only in the standard order:
(12s) Tuple counts for DataFlowImplLocal::pathStep#fffff#join_rhs:
282057 ~0% {2} r1 = SCAN DataFlowImplCommon::CallContext::relevantFor_dispred#ff AS I OUTPUT I.<1>, I.<0>
9159890 ~1% {2} r2 = JOIN r1 WITH DataFlowUtil::Node::getEnclosingCallable_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>
return r2
The evaluation of `unique` is cheap but not free:
DataFlowUtil::Node::getEnclosingCallable_dispred#ff .............. 3.9s
DataFlowUtil::Node::getEnclosingCallable_dispred#ff_10#join_rhs .. 3.5s
The first of these two predicates evaluates `unique`, and the second
simply reorders columns. They take about the same time, which suggests
that `unique` is about as fast as it can be, given the number of tuples
it needs to push around. Note that the column reordering predicate is
only needed because of the standard order.
This revert brings back the performance problems in
`DataFlowImplLocal.qll` so they can be fixed in a different way. The fix
in #2872 was asymptotically good but had undesired overhead because it
introduced another predicate in the SCC that existed purely for join
ordering.
I did the revert by inlining the helper predicate, eliminating the
`enclosing` variable, and re-ordering the resulting lines to what they
were before #2872.
For some reason I thought that these two queries were special because
they manipulate `SecurityOptions` to change the taint-tracking sources.
It turns out it was just the opposite: the queries used to be special
because they invalidated the cache for the `tainted` predicate, but that
predicate is no longer used, so these queries are no longer special.
This consistency check seems to have value for AST data flow, but I've
disabled it on the IR for now.
This commit also includes two unrelated changes that seem to fix a
semantic merge conflict.
How could the tests fail because of autoformatting, you may ask?
The answer is deprecation warnings. These specify the location of the deprecated
entity, and due to autoformatting these moved around.
We currently use a script to keep certain duplicate QL files in sync across the repo. For historical reasons, this script has lived in the private repo alongside the rest of CodeQL, even though it's only used for files in the public `ql` repo. This PR moves the script into the public `ql` repo. It is still invoked by Jenkins scripts that live in the private repo during CI, but it can also be invoked directly without having a checkout of the private repo. This is useful for anyone who is modifying the dataflow or IR libraries with only a QL checkout.
This PR adds better support for differentiating complex and imaginary floating-point types from real floating-point types, in both the AST and in the IR type system.
*AST Changes*
- Introduces the new class `TypeDomain`, which can be either `RealDomain`, `ImaginaryDomain` or `ComplexDomain`. "type domain" is the term used for this concept in the C standard, and I couldn't think of a better one.
- Introduces `FloatingPointType.getDomain()`, to get the type domain of the type.
- Introduces `FloatingPointType.getBase()`, to get the numeric base of the type (either 2 or 10).
- Introduces three new subtypes of `FloatingPointType`: `RealNumberType`, `ComplexNumberType`, and `ImaginaryNumberType`, which differentiate between the types based on their type domain. Note that the decimal types (e.g., `_Decimal32`) are included in `RealNumberType`.
- Introduces two new subtypes of `FloatingPointType`: `BinaryFloatingPointType` and `DecimalFloatingPointType`, which differentiate between the types based on their numeric base, independent of type domain.
*IR Changes*
- `IRFloatingPointType` now has two additional parameters: the base and the type domain.
- New test that ensures that C++ types get mapped to the correct IR types.
- New IR test that verifies the IR for some basic usage of complex FP types.
With the new `unique` aggregate added to QL, we can express directly
what the "min = max" pattern emulates.
Replacing "min and max" with `unique` might in general lead to fewer
results, but that happens only in cases where the aggregate expression
has multiple values. For the three predicates changed in this commit,
that should only happen on malformed databases.
When running ql/python/ql/src/Security/CWE-079/ReflectedXss.ql against the
database for flask.
Iitially there were 10 million result-tuples for iterable_unpacking_descent.
With this change, we're down to roughly 2100,
I had included `InitializeNonLocal` in the recursion because it made
everything look better in the presence of a bug that's since been fixed.
Taking it out means the sanity test is again aligned with the old
`isChiForAllAliasedMemory`.
`Instruction.getDefinitionOverlap()` depends on `SSAConstruction::getMemoryOperandDefinition()`, which in turn depends on `SSAConstruction::hasMemoryOperandDefinition()`. When the definition in question came from a `Chi` instruction, `hasMemoryOperandDefinition()` incorrectly bound `overlap` to the overlap relationship between the original (non-`Chi`) instruction and the use. The fix is to make use of the `actualDefLocation` parameter to `getDefinitionOrChiInstruction()`, which specifies the location for the result of the `Chi` in that case.
The result of `getDefinitionOverlap()` should never be `MayPartiallyOverlap`, because if that were the case, we should have inserted as `Chi` instruction and hooked the definition up to that instead.
There are quite a few existing failures.
This predicate replaces `isChiForAllAliasedMemory`, which was always
intended to be temporary. A test is added to `IRSanity.qll` to verify
that the new predicate corresponds exactly with (a fixed version of) the
old one.
The implementation of the new predicate,
`Cached::hasConflatedMemoryResult` in `SSAConstruction.qll`, is faster
to compute than the old `isChiForAllAliasedMemory` because it uses
information that's readily available during SSA construction.
In the Unix ABI, `std::va_list` is defined as `typedef struct __va_list_tag { ... } va_list[1];`, which means that any `std::va_list` used as a function parameter decays to `struct __va_list_tag*`. Handling this actually made the QL code slightly cleaner. The only tricky bit is that we have to determine what type to use as the actual `va_list` type when loading, storing, or modifying a `std::va_list`. To do this, we look at the type of the argument to the `va_*` macro. A detailed QLDoc comment explains the details.
I added a test case for passing a `va_list` as an argument, and then manipulating that `va_list` in the callee.
This PR changes the IR we generate for functions that accept a variable argument list. Rather than simply using `BuiltInOperationInstruction` to model the various `va_*` macros as mysterious function-like operations, we now model them in more detail. The intent is to enable better alias analysis and taint flow through varargs.
The `va_start` macro now generates a unary `VarArgsStart` instruction that takes the address of the ellipsis pseudo-parameter as its operand, and returns a value of type `std::va_list`. This value is then stored into the actual `std::va_list` variable via a regular `Store`.
The `va_arg` macro now loads the `std::va_list` argument, then emits a `VarArg` instruction on the result. This returns the address of the vararg argument to be loaded. That address is later used as the address operand of a regular `Load` to return the value of the argument. To model the side effect of moving to the next argument, we emit a `NextVarArg` instruction that takes the previous `std::va_list` value and returns an updated one, which is then stored back into the `std::va_list` variable.
The `va_end` macro just emits a `VarArgsEnd` unary instruction that takes the address of the `std::va_list` argument and does nothing, since `va_end` doesn't really do anything on most compiler implementations anyway.
The `va_copy` macro is just modeled as a plain copy.
Since `runtimeExprInStaticInitializer` only looks at expressions at the
top level of an initializer or directly below some number of top-level
aggregate literals, there is no need for `inStaticInitializer` to
include expressions strictly below those in the AST.
I tested this on Wireshark, which has very large static initializers,
but found no measureable difference in run time. There are some
differences in tuple counts and iteration counts, though:
- `inStaticInitializer` changes from 6,241,153 rows (86 iterations) to
5,031,617 rows (7 iterations).
- `runtimeExprInStaticInitializer` changes from 386,350 rows to 4,705
rows.
- `hasDynamicInitialization` has 410 rows both before and after, which
suggests that this change does not affect results.
Even though there is no impact on this snapshot at this time, things
might look different if/when the restriction on aggregate literals to
100 children is removed in the extractor.
This change introduces a new synthesized `IRVariable` in every varargs function. This variable represents the entire set of arguments passed to the ellipsis by the caller. We give it an opaque type big enough hold all of the arguments passed by the largest vararg call in the database. It is treated just like any other parameter. It is initialized the same, it has indirect buffers, etc.
I had to introduce a couple new APIs to `Call` and `Function`. The QLDoc comments should explain these. I added tests for these new APIs as well.
The next step will be to change the IR generation for the `va_*` macros to manipulate the ellipsis parameter.
The spaceship (<=>) operator adds a new row to the C++ precendence
table. In preparation for that shift the necessary precedences up one
to create a suitable hole.
Note: In investigations I belive precedence 14 was not used. However,
in order to make review easier I have kept that gap.
- Introduce `ReadTaintNode` and `TaintStoreNode` to simplify logic for taint
getters and taint setters, respectively.
- `nodeCandFwd2`: Restrict `stored` column after a read, based on what it might
be before a store of the same field.
- `nodeCand2`: Restrict `read` column (renamed from `stored`) after a store, based
on what it might be after a read of the same field.
- Move big step predicates into a `LocalFlowBigStep` module.
- Define predicates by dispatch in `AccessPath[Front]` class.
- `flowCandFwd0`: Restrict `apf` column after a read, as it should be able to match
a Boolean `read` column from `nodeCand2`.
- `flowFwd0`: Restrict columns `ap` and `apf` after a read, by introducing a
`flowConsCandFwd` predicate (similar to what is done in the previous pruning steps).
- `flowFwd0`: Restrict columns `ap` and `apf` after a store, by introducing a
`flowConsCand` predicate (similar to what is done in the previous pruning steps).
Added a new `StaticLocalVariable` class, which made several other pieces of the original change a bit cleaner.
Fixed test failures due to a mistake in the original `CFG.qll` change.
Added a test case for static local variables with constructors.
Removed the `Uninitialized` instruction from the initialization of a static local, because all objects with static storage duration are zero-initialized at startup.
Fixed expectations for `SignAnalysis.ql` to reflect that a bad result is now fixed.
Previously, the IR for the initialization of a static local variable ran the initialization unconditionally, every time the declaration was reached during execution. This means that we don't model the possibility that an access to the static variable fetches a value that was set on a previous execution of the function.
I've added some simple modelling of the correct behavior to the IR. For each static local variable that has a dynamic initializer, we synthesize a (static) `bool` variable to hold whether the initializer for the original variable has executed. When executing a declaration, we check the value of the synthesized variable, and skip the initialization code if it is `true`. If it is `false`, we execute the initialization code as before, and then set the flag to `true`. This doesn't capture the thread-safe nature of static initialization, but I think it's more than enough to handle anything we're likely to care about for the foreseeable future.
In `TranslatedDeclarationEntry.qll`, I split the translation of a static local variable declaration into two `TranslatedElement`s: one for the declaration itself, and one for the initialization. The declaration part handles the checking and setting of the flag; the initialization just does the initialization as before.
I've added an IR test case that has static variables with constant, zero, and dynamic initialization. I've also verified the new IR generated for @jbj's previous test cases for constant initialization.
I inverted the sense of the `hasConstantInitialization()` predicate to be `hasDynamicInitialization()`. Mostly this just made more sense to me, but I think it also fixed a potential bug where `hasConstantInitialization()` would not hold for a zero-initialized variable. Technically, constant initialization isn't the same as zero initialization, but I believe that most code really cares about the distinction between dynamic initialization and static initialization, where static initialization includes both constant and zero initialization.
I've fixed up the C# side of IR generation to continue working, but it doesn't use any of the dynamic initialization stuff. In theory, it could use something similar to model the initialization of static fields.
For syntax errors, we simply report the major version.
For unused imports, we were getting a result for `typing.py` when run under
Python 3.7.3. To prevent this import from being considered, I've set the maximum
import depth to `0`.
The only Semmle references now left in the public Markdown files are in
URLs and in legal text. There are also two Semmle references left in
`docs/language/vale-styles/README.md` because I didn't understand them
well enough to change them.
Since this code is shared between the AST CFG and the IR construction,
it seems right to have only one copy. That copy lives on a new class
`StaticStorageDurationVariable`, which may prove useful on its own.
Obviously the result module shouldn't be a package 🤦 I was confusing
myself, since I wanted to say that `Module::named("Crypto.Cipher")` should be a package :D
The problem with the deleted code is that it would add flow to what might be an
unrelated ControlFlowNode, which is illustrated in the query below (that gives
results on flask)
from ControlFlowNode arg, CallNode call, CallNode other_call
where
call.getNode().getAKeyword().getValue() = arg.getNode() and
not call.getAnArg() = arg and
other_call.getAnArg() = arg and
not other_call = call
select call, arg, other_call
The problem here was that in the base case, there was no relationship between
`left_parent` and `right_parent`. These could be any two tuples or lists, even
if they were not part of an assignment statement.
To fix this, we add a bit of manual "magic", requiring that both of these
arguments must belong to the left and right-hand sides of the same assignment
statement.
(Note that this is in principle _still_ a gross overapproximation, but since
assignment statements are usually quite restricted in size, I don't expect this
to be a major problem.)
Having that module in `Instruction.qll` slowed down the parsing of that
file both humans and the compiler.
This commit moves the `InstructionSanity` module to `IRSanity.qll`
without making any changes to its contents apart from adding some
imports.
On Wireshark with 6GB RAM, I've observed `definitionReachesRank` to be
the slowest predicate in the IR. It seems that the implementation was
slow because the optimizer failed to eliminate the common
`reachesRank - 1` subexpression. This led to context being pushed into
the `not`, which got implemented as `MATERIALIZE`. That wouldn't
normally be a disaster, but this is one of the largest predicates in the
IR SSA construction, and iteration 2 was very slow.
Before:
(1505s) Starting to evaluate predicate SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta/4[1]@93f592 (iteration 1)
(1535s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
130670697 ~0% {4} r1 = SCAN project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS I OUTPUT I.<0>, I.<1>, I.<2>, (I.<2> + 1)
130670697 ~6% {5} r2 = JOIN r1 WITH SSAConstruction::DefUse::exitRank#fff AS R ON FIRST 2 OUTPUT r1.<0>, r1.<1>, r1.<2>, r1.<3>, R.<2>
130670697 ~6% {5} r3 = SELECT r2 ON r2.<3> <= r2.<4>
130670697 ~0% {4} r4 = SCAN r3 OUTPUT r3.<0>, r3.<1>, r3.<2>, r3.<3>
return r4
(1535s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 130670697 rows (order for disjuncts: delta=<standard>).
(1535s) Starting to evaluate predicate SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta/4[2]@866c14 (iteration 2)
(1626s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
261341394 ~107% {4} r1 = JOIN SSAConstruction::DefUse::definitionReachesRank#ffff#prev_delta AS L WITH SSAConstruction::DefUse::definitionReachesRank#ffff#join_rhs AS R ON FIRST 3 OUTPUT R.<0>, R.<1>, R.<2>, (1 + L.<3>)
261341394 ~107% {4} r2 = r1 AND NOT SSAConstruction::DefUse::definitionReachesRank#ffff#prev AS R(r1.<0>, r1.<1>, r1.<2>, r1.<3>)
130670697 ~0% {5} r3 = SCAN r2 OUTPUT r2.<0>, r2.<1>, (r2.<3> - 1), r2.<2>, r2.<3>
106034590 ~1% {4} r4 = JOIN r3 WITH project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS R ON FIRST 3 OUTPUT r3.<0>, r3.<1>, r3.<3>, r3.<4>
106034590 {4} r5 = MATERIALIZE r4 AS antijoin_rhs
24636107 ~3% {4} r6 = r2 AND NOT r5(r2.<0>, r2.<1>, r2.<2>, r2.<3>)
24636107 ~0% {5} r7 = JOIN r6 WITH SSAConstruction::DefUse::exitRank#fff AS R ON FIRST 2 OUTPUT r6.<0>, r6.<1>, r6.<2>, r6.<3>, R.<2>
2749441 ~0% {5} r8 = SELECT r7 ON r7.<3> <= r7.<4>
2749441 ~4% {4} r9 = SCAN r8 OUTPUT r8.<0>, r8.<1>, r8.<2>, r8.<3>
return r9
(1626s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 2749441 rows (order for disjuncts: delta=<standard>).
After:
(12s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
130670697 ~0% {4} r1 = SCAN project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS I OUTPUT I.<0>, I.<1>, I.<2>, (I.<2> + 1)
return r1
(12s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 130670697 rows (order for disjuncts: delta=<standard>).
(12s) Starting to evaluate predicate SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta/4[2]@fff64c (iteration 2)
(34s) Tuple counts for SSAConstruction::DefUse::definitionReachesRank#ffff#cur_delta:
108784031 ~0% {4} r1 = SSAConstruction::DefUse::definitionReachesRank#ffff#prev_delta AS L AND NOT SSAConstruction::DefUse::exitRank#fff AS R(L.<0>, L.<1>, L.<3>)
2749441 ~5% {4} r2 = r1 AND NOT project#SSAConstruction::DefUse::hasDefinitionAtRank#fffff AS R(r1.<0>, r1.<1>, r1.<3>)
2749441 ~4% {4} r3 = SCAN r2 OUTPUT r2.<0>, r2.<1>, r2.<2>, (r2.<3> + 1)
2749441 ~4% {4} r4 = r3 AND NOT SSAConstruction::DefUse::definitionReachesRank#ffff#prev AS R(r3.<0>, r3.<1>, r3.<2>, r3.<3>)
return r4
(34s) - SSAConstruction::DefUse::definitionReachesRank#ffff_delta has 2749441 rows (order for disjuncts: delta=<standard>).
Note that the row counts are exactly the same before and after.
The charpred of class `GVN` in `ASTValueNumbering.qll` got inlined into
the member predicate `getAnInstruction` and caused a tuple explosion on
Wireshark in the query `StrncpyFlippedArgs.ql`.
I interrupted the predicate after 10 minutes and got these intermediate
tuple counts:
(5208s) Tuple counts for ASTValueNumbering::GVN::getAnInstruction_dispred#ff:
8754900909 ~5% {3} r1 = JOIN ValueNumberingInternal::tvalueNumber#ff_10#join_rhs AS L WITH ValueNumberingInternal::tvalueNumber#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, L.<1>, L.<0>
4390274632 ~150085% {2} r2 = JOIN r1 WITH project#SSAConstruction::Cached::getInstructionUnconvertedResultExpression AS R ON FIRST 1 OUTPUT r1.<2>, r1.<1>
return r2
After this change, the `getAnInstruction` predicate is itself inlined,
like it should be. The new non-inlined charpred takes 2.1s and has these
tuple counts:
(2s) Tuple counts for ASTValueNumbering::GVN#f:
9158442 ~117% {1} r1 = JOIN project#SSAConstruction::Cached::getInstructionUnconvertedResultExpression AS L WITH ValueNumberingInternal::tvalueNumber#ff@staged_ext AS R ON FIRST 1 OUTPUT R.<1>
return r1
The code is now quadratic in the number of statements in a basic block,
whereas before it was quadratic in the number of _control-flow nodes_ in
a basic block.
The `iDominates` relation is directly on control-flow nodes, and its
transitive closure is far too large. It got compiled into a recursion
rather than `fastTC`, and I've observed that recursion to take about an
hour on a medium-size customer snapshot.
The fix is to check for dominance at the basic-block level.
Also rewrites `multi_assignment_points_to` to be a bit more readable.
I'm not entirely sure that we want an unknown instance of `object` rather than
just `UnknownInternal`. The latter gets filtered out in the characteristic
predicate for `Value`, though, so I opted for the slightly more permissive
variant.
The `cpp/missing-case-in-switch` performed badly on some snapshots, to
the extent where it was as slow as the most expensive IR stages
(example: ChakraCore). This commit makes it faster, removing a
`pragma[noopt]` along the way.
The intermediate tuple counts on a customer codebase drop from 84M to
3M, while the content hash of `getAMissingCase` is the same.
Before:
(124s) Tuple counts for Stmt::EnumSwitch::getAMissingCase#ff#antijoin_rhs:
20867789 ~0% {3} r1 = JOIN Stmt::SwitchStmt::getASwitchCase_dispred#ff AS L WITH Stmt::EnumSwitch::getAMissingCase#ff#shared AS R ON FIRST 1 OUTPUT L.<1>, R.<0>, R.<1>
20122830 ~0% {3} r2 = JOIN r1 WITH Stmt::SwitchCase::getExpr_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<2>
20122830 ~0% {3} r3 = JOIN r2 WITH Expr::Expr::getValue_dispred#ff AS R ON FIRST 1 OUTPUT r2.<2>, r2.<1>, R.<1>
83961918 ~0% {4} r4 = JOIN r3 WITH Enum::EnumConstant::getInitializer_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r3.<1>, r3.<0>, r3.<2>
83961918 ~0% {4} r5 = JOIN r4 WITH initialisers AS R ON FIRST 1 OUTPUT R.<2>, r4.<3>, r4.<1>, r4.<2>
234348 ~185% {2} r6 = JOIN r5 WITH Expr::Expr::getValue_dispred#ff AS R ON FIRST 2 OUTPUT r5.<2>, r5.<3>
return r6
...
(124s) Tuple counts for Stmt::EnumSwitch::getAMissingCase#ff:
663127 ~4% {2} r1 = Stmt::EnumSwitch::getAMissingCase#ff#shared AS L AND NOT Stmt::EnumSwitch::getAMissingCase#ff#antijoin_rhs AS R(L.<0>, L.<1>)
return r1
(124s) Registering Stmt::EnumSwitch::getAMissingCase#ff + [] with content 2060ff326cvhihcsvoph6k9divuv4
(124s) >>> Wrote relation Stmt::EnumSwitch::getAMissingCase#ff with 663127 rows and 2 columns.
After:
(5s) Tuple counts for Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs:
746029 ~0% {2} r1 = JOIN Stmt::EnumSwitch::getAMissingCase_dispred#ff#shared AS L WITH Enum::Enum::getAnEnumConstant_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, L.<1>
3116197 ~2% {3} r2 = JOIN r1 WITH Enum::EnumConstant::getInitializer_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<0>
3116197 ~0% {3} r3 = JOIN r2 WITH initialisers AS R ON FIRST 1 OUTPUT R.<2>, r2.<1>, r2.<2>
3116197 ~311% {3} r4 = JOIN r3 WITH Expr::Expr::getValue_dispred#ff AS R ON FIRST 1 OUTPUT r3.<1>, R.<1>, r3.<2>
234348 ~185% {2} r5 = JOIN r4 WITH Stmt::EnumSwitch::matchesValue#ff AS R ON FIRST 2 OUTPUT r4.<0>, r4.<2>
return r5
(5s) Registering Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs + [] with content 173483d71508vl534mvlr1g0ehi12
(5s) >>> Wrote relation Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs with 82902 rows and 2 columns.
(5s) Starting to evaluate predicate Stmt::EnumSwitch::getAMissingCase_dispred#ff/2@ae4c0b
(5s) Tuple counts for Stmt::EnumSwitch::getAMissingCase_dispred#ff:
746029 ~2% {2} r1 = JOIN Stmt::EnumSwitch::getAMissingCase_dispred#ff#shared AS L WITH Enum::Enum::getAnEnumConstant_dispred#ff AS R ON FIRST 1 OUTPUT L.<1>, R.<1>
663127 ~4% {2} r2 = r1 AND NOT Stmt::EnumSwitch::getAMissingCase_dispred#ff#antijoin_rhs AS R(r1.<0>, r1.<1>)
return r2
(5s) Registering Stmt::EnumSwitch::getAMissingCase_dispred#ff + [] with content 2060ff326cvhihcsvoph6k9divuv4
(5s) >>> Wrote relation Stmt::EnumSwitch::getAMissingCase_dispred#ff with 663127 rows and 2 columns.
This turned out to be a fairly simple but easy to make bug. When we want to
figure out the value pointed-to in a multi-assignment, we look at the left hand
side to see what value from the right hand side we should assign. Unfortunately,
we accidentally attempted to look up this information in the _left hand side_ of
the assignment, resulting in no points-to information at all. The only thing
needed to fix this was to properly link up the left and right hand sides: using
the left hand side to figure out what index to look at, and then looking up the
points-to information for the corresponding place in the right hand side.
This query used two fastTC operations that were already somewhat
inefficient on their own but could send the evaluator into an OOM loop
when run in parallel without enough RAM.
The fix is to recurse manually, starting just from the expressions that
are potential candidates for alerts.
Move the stdlib tests from test/{2,3}/library-tests/ into /test/library-tests/,
and deal with version by using sys.version_info (results should be the same for
both versions).
six tests were moved from /library-tests/web/client/stdlib => /library-tests/web/client/six
To ease the rollout of this test, currently we only report missing points-to
information for nodes that either
- appear as an argument in a call to a function named `check`, or
- appear inside a scope where the first line is annotated with a comment ending
in "check".
The idea behind the second version is that once we have points-to running at a
level where no node inside a scope that _ought_ to have points-to is missing
this information, we can simply remove all uses of `check(...)` from inside this
scope, and annotate the entire scope with `# check`. Once this has been done for
the entire file, we can then remove all the comments and just require
_everything_ to be checked.
Note that I don't expect all nodes to have the need for points-to information.
For instance, there are nodes representing scope entry and exit, and for these
it doesn't make sense to require that they "point-to" anything. Similarly,
`NameNode` appearing in a "store" (i.e. as the left hand side of an assignment)
do not strictly need to have points-to information, although it might be more
intuitive if they did.
Thus, the `relevant_node` predicate will almost certainly need to be extended to
exclude these kinds of nodes.
Fixes issue https://github.com/github/codeql-c-analysis-team/issues/20
This change undoes the workaround in https://github.com/Semmle/ql/pull/2736, and replaces it with a fix for the underlying cause. The problem was that the IR construction code for side effects incorrectly assumed that `BufferAccessOpcode` included `SizedBufferAccessOpcode`. I think that was actually a perfectly reasonable assumption to make, so I changed the `Opcode` hierarchy to make it true.
Following the setup I invented for library-tests/taint/unpacking.
TestStep is still a bit annoying, since the output is not easy to eyeball; but
for now I guess we can live with it :)
I honestly didn't get the point of DistinctStringKinds.ql, other than showing we
can handle multiple taint kinds
This is a temporary fix!
Added minimal working example (MWE) as a regression, so it's easier to fix the
real problem.
only Python 3 is facing the problem -- and without --max-import-depth=1 the test
times out at 10 minutes :O
Introduced a regression, since the old code was:
```
predicate is_a_string_type(ClassObject seqtype) {
seqtype = theBytesType() and major_version() = 2
or
seqtype = theUnicodeType()
}
```
but *now* we're good!
+ Extend PropertyInternal.getSetter to handle non-decorator
+ Add PropertyInternal.getDeleter
It seems like a bit hacky way to do things, since we're not using the
PropertySetterOrDeleter class at all, but for now I'll leave it be.
e.(StrConst).isDocString() can only hold if e instanceof StrConst, since we have
that condition on the line above, we can safely remove this condition.
* Adds the...Type() predicates as foresight modernizations.
* Removes predicates that are not currently ported/portable
* Adds range types
* Update python/ql/src/semmle/python/objects/ObjectAPI.qll
Co-Authored-By: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
* Update python/ql/src/semmle/python/objects/ObjectAPI.qll
Co-Authored-By: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
* Swaps xType for just x, at least when it's new
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
The `isInCycle` predicate would take a long time on Wireshark with 6GB
RAM, sometimes OOMing in the fastTC HOP. Analyzing wireshark with 6GB is
important because that's the standard configuration on our Jenkins
workers. With this commit, I can analyze Wireshark with 6GB on my
laptop.
The `getNonPhiOperandDef` predicate on Wireshark is 34M tuples, while
`getDefIfHasNeighbors` is 11M tuples, and the TC of
`getDefIfHasNeighbors` is 23M tuples (487 MB).
When multiple results are available, we usually name the function
`getAnArgument` or `getASomething`. The support for django copied the way bottle
did things, so this commits cleans up both
Without this fix, running the full LGTM suite would get the IR evaluated
twice. That's because we have multiple IPA types and constructors with
the same name (like `TInstruction` and `MkIRFunction`), and the QL
compiler chooses how to disambiguate those names differently depending
on import order.
I've tested that the IR is only evaluated once now by running the whole
suite on a tiny project (jbj/magicrescue) and looking at the output of
perl -ne 'print if /^RESULTS IN:/ .. /^\[/ and not /^\[/' runSnapshotQueries-debug.log | sort |uniq -c |sort -n |less
Instructions that are removed from the normal value numbering recursion
because they have a duplicated type or AST element get unique value
numbers rather than going unnumbered. This ensures comparisons of value
numbers using `!=` hold for filtered instructions.
This entry in `.codeqlmanifest.json` was intended to allow
unpacking the CodeQL CLI as a subdirectory of `ql`, and things
would Just Work.
However, it is not necessary anymore because recent releases of
the CLI will search their own directory as a fallback
_independently_ of the parent directory.
On the contrary, removing this link will make internal testing
easier because you then run a test build of the CLI with
`--search-path` pointing to the `ql` checkout without inadvertently
making extractors in a _different_ build that is unpacked there visible.
On some snapshots, notably ffmpeg, the IR `ValueNumbering` recursion
would generate billions of tuples and eventually run out of space.
It turns out it was fairly common for an `Instruction` to get more than
one `ValueNumber` in the base cases for `VariableAddressInstruction` and
`InitializeParameterInstruction`, and it could also happen in an
instruction with more than one operand of the same `OperandTag`. When a
binary operation was applied to an instruction with `m` value numbers
and another instruction with `n` value numbers, the result would get
`m * n` value numbers. This led to doubly-exponential growth in the
number of value numbers in rare cases.
The underlying reason why a `VariableAddressInstruction` could get
multiple value numbers is that it was keyed on the associated
`IRVariable`, and the `IRVariable` is defined in part by the type of its
underlying `Variable` (or other AST element). If the extractor defines a
variable to have multiple types because of linker ambiguity, this leads
to the creation of multiple `IRVariable`s. That should ideally be solved
in `TIRVariable.qll`, but for now I've put a workaround in
`ValueNumberingInternal.qll` instead.
To remove the problem with instructions having multiple operands, the
construction in `Operand.qll` will now filter out any such operand. It
wasn't enough to apply that filter to the `raw` stage, so I've applied
it to all three stages.
Before I added `--max-import-depth=2`, there was a bit of trouble, where it
would alert on `from pkg_ok import foo2` -- since all the `pkg_ok.foo<n>`
modules were missing, I guess the analysis didn't make any assumptions on
whether `foo2` is a module or a regular attribute.
So I've been thinking a bit about import pkg_ok.foo1 after reading the Python
references for imports of submodules
https://docs.python.org/3/reference/import.html#submodules
> When a submodule is loaded using any mechanism (...) a binding is placed in the
parent module’s namespace to the submodule object. For example, if package spam
has a submodule foo, after importing spam.foo, spam will have an attribute foo
which is bound to the submodule.
That does at least explain what is going on here.
I feel that import pkg_ok.foo1 might be a very contrived example. In principle
it should be an alert, since the module pkg_ok ends up with an import of itself,
but my gut feeling is that in practice it's not a very important piece of code
to give alerts for. if we really care about giving these import related alerts,
we could probably add a new query for this pattern, as it's kind of surprising
that it works when you're just an ordinary python programmer.
This predicate got an unfortunate join order, leading to these tuple
counts on ElektraInitiative/libelektra:
(290s) Tuple counts for AliasedSSA::VariableMemoryLocation::coversEntireVariable_dispred#f:
57117 ~0% {3} r1 = SCAN IRType::IRType::getByteSize_dispred#ff AS I OUTPUT 0, (I.<1> * 8), I.<0>
421445272 ~0% {3} r2 = JOIN r1 WITH AliasedSSA::VariableMemoryLocation#fffffff_5601#join_rhs AS R ON FIRST 2 OUTPUT R.<3>, r1.<2>, R.<2>
103282 ~2% {1} r3 = JOIN r2 WITH AliasConfiguration::Allocation::getIRType_dispred#ff AS R ON FIRST 2 OUTPUT r2.<2>
return r3
With this commit, we get these tuple counts instead:
(0s) Tuple counts for AliasedSSA::VariableMemoryLocation::varIRTypeHasBitRange#bff:
361874 ~0% {3} r1 = SCAN AliasedSSA::VariableMemoryLocation#fffffff AS I OUTPUT I.<1>, 0, I.<0>
361874 ~0% {3} r2 = JOIN r1 WITH AliasConfiguration::Allocation::getIRType_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, 0, r1.<2>
361874 ~1% {3} r3 = JOIN r2 WITH IRType::IRType::getByteSize_dispred#ff AS R ON FIRST 1 OUTPUT r2.<2>, 0, (R.<1> * 8)
return r3
(0s) Tuple counts for AliasedSSA::VariableMemoryLocation::coversEntireVariable_dispred#f:
103282 ~2% {1} r1 = JOIN AliasedSSA::VariableMemoryLocation#fffffff_056#join_rhs AS L WITH AliasedSSA::VariableMemoryLocation::varIRTypeHasBitRange#bff AS R ON FIRST 3 OUTPUT L.<0>
103282 ~2% {1} r2 = STREAM DEDUP r1
return r2
We had these tuple counts on ElektraInitiative/libelektra (note that the
`modelFlow` predicate got inlined into
`simpleInstructionLocalFlowStep`):
(652s) Tuple counts for DataFlowUtil::simpleInstructionLocalFlowStep#ff:
...
19701 ~1% {4} r27 = JOIN r26 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r26.<2>, r26.<1>, r26.<0>
7908 ~0% {3} r28 = JOIN r27 WITH SSAConstruction::Cached::getInstructionIndex#ff@staged_ext AS R ON FIRST 2 OUTPUT r27.<0>, r27.<2>, r27.<3>
4023 ~0% {3} r29 = JOIN r28 WITH Instruction::WriteSideEffectInstruction#class#ff AS R ON FIRST 1 OUTPUT r28.<1>, r28.<2>, r28.<0>
...
1060807009 ~3% {3} r34 = JOIN r33 WITH SSAConstruction::Cached::getInstructionIndex#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r33.<1>, r33.<2>
15670 ~5% {2} r35 = JOIN r34 WITH Instruction::SideEffectInstruction::getPrimaryInstruction_dispred#3#ff AS R ON FIRST 2 OUTPUT r34.<0>, r34.<2>
7973 ~0% {2} r36 = JOIN r35 WITH Instruction::ReadSideEffectInstruction::getSideEffectOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r35.<1>
...
In this predicate there are two cases (`WriteSideEffectInstruction` and
`ReadSideEffectInstruction`) where we need to join on both the call and
the argument index of a side effect. It works well enough for the first
case, `WriteSideEffectInstruction`, where the call is joined on before
the index, but it explodes in the second case,
`ReadSideEffectInstruction`, where the index is joined first. To fix the
second case, and to guard against future optimizer accidents in the
first case, this commit changes both of those cases to use a new helper
predicate that makes it possible to join on both columns at once. The
resulting tuple counts are:
(3s) Tuple counts for DataFlowUtil::simpleInstructionLocalFlowStep#ff:
...
7908 ~0% {3} r27 = JOIN r26 WITH DataFlowUtil::getSideEffectFor#fff AS R ON FIRST 2 OUTPUT R.<2>, r26.<2>, r26.<0>
4023 ~0% {3} r28 = JOIN r27 WITH Instruction::WriteSideEffectInstruction#class#ff AS R ON FIRST 1 OUTPUT r27.<1>, r27.<2>, r27.<0>
...
15670 ~5% {2} r33 = JOIN r32 WITH DataFlowUtil::getSideEffectFor#fff AS R ON FIRST 2 OUTPUT R.<2>, r32.<2>
7973 ~0% {2} r34 = JOIN r33 WITH Instruction::ReadSideEffectInstruction::getSideEffectOperand_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r33.<1>
...
The bulge is now limited to a factor of two, and that's just because I
didn't write separate versions of `getSideEffectFor` for
`ReadSideEffectInstruction` and `WriteSideEffectInstruction`.
Avoids a Cartesian product on nodes:
```
[2020-02-07 11:01:22] (432s) Tuple counts for dom#DataFlowImpl::TPathNodeSink#ff:
0 ~0% {2} r1 = JOIN DataFlowImpl::Configuration::isSource_dispred#ff AS L WITH DataFlowImpl::Configuration::isSink_dispred#ff AS R ON FIRST 2 OUTPUT R.<1>, R.<0>
101611 ~0% {2} r2 = SCAN DataFlowImpl::PathNodeMid#class#ffffff AS I OUTPUT I.<5>, I.<0>
3534537047 ~3% {3} r3 = JOIN r2 WITH DataFlowImpl::Configuration::isSink_dispred#ff AS R ON FIRST 1 OUTPUT r2.<1>, R.<1>, R.<0>
251 ~41% {3} r4 = JOIN r3 WITH project#DataFlowImpl::pathStep#fffff AS R ON FIRST 2 OUTPUT R.<2>, r3.<2>, r3.<1>
251 ~50% {2} r5 = JOIN r4 WITH DataFlowImpl::TNil#ff_1#join_rhs AS R ON FIRST 1 OUTPUT r4.<2>, r4.<1>
251 ~50% {2} r6 = r1 \/ r5
323 ~67% {3} r7 = JOIN r6 WITH DataFlowImpl::flow#ff AS R ON FIRST 1 OUTPUT r6.<1>, r6.<0>, R.<1>
288 ~58% {3} r8 = SELECT r7 ON r7.<2> >= r7.<0>
251 ~53% {3} r9 = SELECT r8 ON r8.<2> <= r8.<0>
251 ~50% {2} r10 = SCAN r9 OUTPUT r9.<1>, r9.<0>
```
Joining on the enclosing callable before the kind is crucial, as witnessed by this pipeline:
```
[2020-02-06 17:58:21] (1086s) Starting to evaluate predicate DataFlowImplCommon::getReturnPosition#ff/2@83c546
[2020-02-06 18:53:16] (4382s) Tuple counts for DataFlowImplCommon::getReturnPosition#ff:
385478 ~1% {3} r1 = SCAN DataFlowImplCommon::Cached::TReturnPosition0#fff@staged_ext AS I OUTPUT I.<2>, I.<0>, I.<1>
385478 ~2% {3} r2 = JOIN r1 WITH DataFlowImplCommon::Cached::TReturnPosition0#fff_2#join_rhs AS R ON FIRST 1 OUTPUT r1.<2>, r1.<1>, r1.<0>
58638116860 ~0% {3} r3 = JOIN r2 WITH DataFlowImplCommon::ReturnNodeExt::getKind_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r2.<1>, r2.<2>
914049 ~0% {2} r4 = JOIN r3 WITH DataFlowImplCommon::returnNodeGetEnclosingCallable#ff AS R ON FIRST 2 OUTPUT r3.<0>, r3.<2>
return r4
```
We can't understand the real `six.py` file, so we have some internal plumbing
that enables us to handle six anyway. While updating that, I had a hell of a lot
of trouble with these tests.
What we actually want, is to see that we can understand what the values imported
from six are (i.e., their points-to information). I added a few more, that I
think would be useful. If we can figure out all of these, I don't actually care
if we're doing it by understanding the real `six.py` file, or by some internal
trick.
I verified that we don't get results with the real `six.py` file by disabling
our internal tricks, and putting a copy of six.py just next to test.py.
We used to have an other file that would list all the properties we knew and
their value, but that turned out to be a fragile and annoying test, since the
results differed from which version of python you ran it with (3.5 vs 3.8) and
which machine you ran it on (my machien vs jenkins). I don't care about the
results in this file, and I can certainly not eyeball it to see if it's correct
or not.
Also fixed a minor bug where we should have been treating `AllNonLocalMemory` as _totally_ overlapping an access to a non-local variable, rather than _partially_ overlapping it. This fix is exhibited both in the new test case and in a couple existing test functions in `ssa.cpp`.
The predicate
```
argumentValueFlowsThrough(ArgumentNode arg, OutNode out, CallContext cc)
```
has been generalized to
```
argumentValueFlowsThrough(
DataFlowCall call, ArgumentNode arg, Node out, ContentOption contentIn,
ContentOption contentOut
)
```
This enables us to summarize normal flow-through (as before), getters, setters,
as well as getter-setters.
Replacing `Value.booleanValue`. We wanted to match `Object.booleanValue` that
only gives a result if it is either `true` or `false`, but also wanted to keep
the flexibility to see if the Value _could_ be `true`/`false`. We don't have a
motivating usecase, so let's see if we ever need it :P
+ fix modernisation regression on py/jinja2/autoescape-false
Since `IntObjectInternal` extends `TInt`, and `TInt` is defined for all
instances of `Builtin.intValue`, and `Builtin.intValue` includes both `int` and
`long`, we don't need to handles Longs in a special manner, as we did in NumericObject.
It turns out that the evaluator will evaluate the GVN stage even when no
predicate from it is needed after optimization of the subsequent stages.
The GVN library is expensive to evaluate, and it'll become even more
expensive when we switch its implementation to IR.
This PR disables the use of GVN in `DataFlow::BarrierNode` for the AST
data-flow library, which should improve performance when evaluating a
single data-flow query on a snapshot with no cache. Precision decreases
slightly, leading to a new FP in the qltests.
There is no corresponding change for the IR data-flow library since IR
GVN is not very expensive.
We were hitting a combinatorial explosion in `hasDefinitionAtRank` for functions that contain a large number of string literals. The problem was that every `Chi` instruction for `AliasedVirtualVariable` was treated as a definition of every string literal. We already mark string literals as `isReadOnly()`, but we were allowing `AliasedVirtualVariable` to define read-only locations so that the `AliasedDefinition` instruction would provide the initial definition for all string literals.
To fix this, I've introduced the new `InitializeNonLocal` instruction, which is inserted in the prologue of every function right after `AliasedDefinition`. It provides the initial definition for every non-stack memory location, including read-only locations, but is never written to anywhere else. It is the conterpart of the `AliasedUse` instruction in the function epilogue, which represents the use of all non-stack memory after the function returns. I considered renaming `AliasedUse` to `ReturnNonLocal`, to match the `InitializeXXX`/`ReturnXXX` pattern we already use for parameters and indirections, but held off to avoid unnecessary churn. Any thoughts on whether I should make this name change?
This change has a significant speedup in evaluation time for a few of our troublesome databases:
`attnam/ivan`: 13%
`awslabs/s2n`: 26%
`SinaMostafanejad/OpenRDM`: 7%
`zcoinofficial/zcoin`: 8%
As a temporary workaround in the `DefaultTaintTracking` library, we
funnel flow across calls by conflating pointer and object both at the
caller and the callee.
The three cases in `adjustedSink` were deleted because they are now
covered by the one case for `ReadSideEffectInstruction` in
`instructionTaintStep`.
When enabling `DefaultTaintTracking`, this commit on top of #2736 has
the effect effect of recovering two lost results:
--- a/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
+++ b/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
@@ -1,2 +1,4 @@
| overflowdestination.cpp:30:2:30:8 | call to strncpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
| overflowdestination.cpp:46:2:46:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
+| overflowdestination.cpp:53:2:53:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
+| overflowdestination.cpp:64:2:64:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
In the internal repo, we recover one lost result. Additionally, there
are two queries that gain an extra source for an existing sink. I'll
classify that as noise. The new results look like this:
foo(argv); // this `argv` is a new source for the sink in `bar`
bar(argv); // this `argv` is the existing source for the sink in `bar`
The type of the source argument to `memcpy` is `void *`, and somehow
that meant that the copied object itself got type `void`. Since that has
size 0, the SSA construction did not model it as reading from the last
write.
This is probably not the right fix, but maybe it's good enough for now.
The right fix would ensure that the type reported by
`hasOperandMemoryAccess` is `UnknownType`.
When `DefaultTaintTracking.qll` is enabled, this commit has the effect
of restoring a lost results:
--- a/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
+++ b/cpp/ql/test/query-tests/Security/CWE/CWE-119/semmle/tests/OverflowDestination.expected
@@ -1 +1,2 @@
| overflowdestination.cpp:30:2:30:8 | call to strncpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
+| overflowdestination.cpp:46:2:46:7 | call to memcpy | To avoid overflow, this operation should be bounded by destination-buffer size, not source-buffer size. |
Adds
- `StringValue` as a new class,
- `Value::booleanValue` which returns the boolean interpretation of the given
value, and
- `ClassValue::str` which returns the value of the `str` class, depending on the
Python version.
This changes the flow to be taint rather than data flow, and it extends
it to include chi instructions with unknown type as long as they're not
for the `AliasedVirtualVariable`.
We're losing three good test results because these tests are not
affected by `DefaultTaintTracking.qll`. The taint step added here can
later be ported to `TaintTrackingUtil.qll` to recover these results, but
we probably want a better API than transitive-closure search through
instructions before doing that.
This change ensures that the diff test will show the difference between
the old and the new library even after we switch the default
implementation of `security.TaintTracking` to be the new one.
The code contained the remains of how `isUserInput` in `Security.qll`
used to be ported to IR. It's wrong to use that port since many queries
call `userInput` directly to get the "cause" string.
This introduces a new type of `MemoryLocation`: `EntireAllocationMemoryLocation`, representing an entire contiguous allocation whose size is not known. This is used to model the memory accesses on `InitializeIndirection` and `ReturnIndirection`.
Instead of tracking `IRVariable`s directly, alias analysis now tracks instances of the `Allocation` type provided by its `Configuration` parameter. For unaliased SSA, an `Allocation` is just an `IRAutomaticVariable`. For aliased SSA, an `Allocation` is either an `IRVariable` or the memory pointed to by an indirect parameter.
Some of the tests currently fail, since they can't reproduce the old tests
results (since the sinks/sources defined in the library code are not
HttpResponseTaintSink/HttpRequestTaintSource)
Our definition of `toString` for the internal tuple objects we create during the
points-to analysis may have been a _tad_ too ambitious. In particular, it can
easily lead to non-termination, e.g. using the following piece of code:
```python
x = ()
while True:
x = (x, x)
```
This commit cuts off the infinite recursion by replacing _nested_ tuples with
the string "...". In particular this means even non-recursive tuples will be cut
off at that point, so that the following tuples
```python
(1, "2")
((3, 4), [5, 6])
(1, 2, 3, 4, 5)
```
Get the following string representations.
```
"(int 1, '2', )"
"(..., List, )"
"(int 1, int 2, int 3, 2 more...)"
```
They are not tainted in assignment, only in use.
I also adopted an attempt at a better test-setup, where it's easy to see if
everything is the way you hoped for, instead of browsing through 100 of lines of
taint-step output :P
We designed the IR's `DataFlow::Node.asExpr` very carefully so that it's
suitable for taint tracking, but then we didn't use it in
`DefaultTaintTracking.qll`. This meant that the sources in
`ArithmeticWithExtremeValues.ql` didn't get associated with any
`Instruction` and thus didn't propagate anywhere.
With this commit, the mapping of `Expr`-based sources to IR data-flow
nodes uses `asExpr`.
This commit undoes the code sharing between `TranslatedAssignExpr` (`=`)
and `TranslatedAssignOperation` (`+=`, `<<=`, ...). In the next commit,
when we change how the `Load` works on the LHS of
`TranslatedAssignOperation`, these classes will become so different that
sharing is no longer helpful.
Previously, the `Load` would be associated with the `CrementOperation`
rather than its operand, which gave surprising results when mapping
taint sinks back to `Expr`.
The changes in `raw_ir.expected` are to add `Copy` operations on the
`x++` in code like `y = x++`. This is now needed because the result that
`x++` would otherwise have (the Load) no longer belongs to the `++`
expression. Copies are inserted to ensure that all expressions are
associated with an `Instruction` result.
The changes in `*aliased_ssa_ir.expected` appear to be just wobble.
There was already a `WriteSideEffectInstruction` class that served as a
superclass for all the specific write side effects. This new class
serves the same purpose for read side effects.
When building SSA, we'll be assuming that stack variables do not escape, at least until we improve our alias analysis. I've added a new `IREscapeAnalysisConfiguration` class to allow the query to control this, and a new `UseSoundEscapeAnalysis.qll` module that can be imported to switch to the sound escape analysis. I've cloned the existing IR and SSA tests to have both sound and unsound versions. There were relatively few diffs in the IR dump tests, and the sanity tests still give the same results after one change described below.
Assuming that stack variables do not escape exposed an existing bug where we do not emit an `Uninitialized` instruction for the temporary variables used by `return` statements and `throw` expressions, even if the initializer is a constructor call or array initializer. I've refactored the code for handling elements that initialize a variable to share a common base class. I added a test case for returning an object initialized by constructor call, and ensured that the IR diffs for the existing `throw` test cases are correct.
The IR data flow library now supports virtual dispatch with a library
that's similar to `security.TaintTracking`. In particular, it should
have the same performance characteristics. The main difference is that
non-recursive callers of `flowsFrom` now pass `_` instead of `true` for
`boolean allowFromArg`. This change allows flow through `return` to
actually work.
The aliased SSA code was assuming that, for every automatic variable, there would be at least one memory access that reads or writes the entire variable. We've encountered a couple cases where that isn't true due to extractor issues. As a workaround, we now always create the `VariableMemoryLocation` for every local variable.
I've also added a sanity test to detect this condition in the future.
Along the way, I had to fix a perf issue in the PrintIR code. When determining the ID of a result based on line number, we were considering all `Instruction`s generated for a particular line, regardless of whether they were all in the same `IRFunction`. In addition, the predicate had what appeared to be a bad join order that made it take forever on large snapshots. I've scoped it down to just consider `Instruction`s in the same function, and outlined that predicate to fix the join order issue. This causes some numbering changes, but they're for the better. I don't think there was actually any nondeterminism there before, but now the numbering won't depend on the number of instantiations of a template, either.
The `overlappingVariableMemoryLocations` predicate was a helper
predicate introduced to fix a join-order issue in
`overlappingIRVariableMemoryLocations`. Unfortunately it caused a
performance issue of its own because it could grow too large. On the
small project (38MB zip) awslabs/s2n there were 181M rows in
`overlappingVariableMemoryLocations`, and it took 134s to evaluate.
The fix is to collapse the two predicates into one and fix join ordering
by including an extra column in the predicates being joined.
In addition, some parameters were reordered to avoid the overhead of
auto-generated `join_rhs` predicates.
Tuple counts of `overlappingVariableMemoryLocations` before:
623285 ~176% {2} r1 = JOIN AliasedSSA::isCoveredOffset#fff_120#join_rhs AS L WITH AliasedSSA::isCoveredOffset#fff_120#join_rhs AS R ON FIRST 2 OUTPUT L.<2>, R.<2>
119138 ~3% {2} r2 = SCAN AliasedSSA::VariableMemoryLocation::getVirtualVariable_dispred#ff AS I OUTPUT I.<1>, I.<0>
172192346 ~0% {2} r3 = JOIN r2 WITH AliasedSSA::hasUnknownOffset#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r2.<1>
172815631 ~0% {2} r4 = r1 \/ r3
172192346 ~0% {2} r5 = JOIN r2 WITH AliasedSSA::hasUnknownOffset#ff_10#join_rhs AS R ON FIRST 1 OUTPUT r2.<1>, R.<1>
345007977 ~87% {2} r6 = r4 \/ r5
return r6
Tuple counts of `overlappingIRVariableMemoryLocations` after:
117021 ~134% {2} r1 = JOIN AliasedSSA::isCoveredOffset#ffff AS L WITH AliasedSSA::isCoveredOffset#ffff AS R ON FIRST 3 OUTPUT L.<3>, R.<3>
201486 ~1% {2} r2 = JOIN AliasedSSA::hasUnknownOffset#fff AS L WITH AliasedSSA::hasVariableAndVirtualVariable#fff AS R ON FIRST 2 OUTPUT L.<2>, R.<2>
318507 ~26% {2} r3 = r1 \/ r2
201486 ~3% {2} r4 = JOIN AliasedSSA::hasUnknownOffset#fff AS L WITH AliasedSSA::hasVariableAndVirtualVariable#fff AS R ON FIRST 2 OUTPUT R.<2>, L.<2>
519993 ~92% {2} r5 = r3 \/ r4
return r5
Previously, we had several predicates on `Instruction` and `Operand` whose values were determined solely by the opcode of the instruction. For large snapshots, this meant that we would populate large tables mapping each of the millions of `Instruction`s to the appropriate value, times three (once for each IR flavor).
This change moves all of these opcode properties onto `Opcode` itself, with inline wrapper predicates on `Instruction` and `Operand` where necessary. On smaller snapshots, like ChakraCore, performance is a wash, but this did speed up Wireshark by about 4%.
Even ignoring the modest performance benefit, having these properties defined on `Opcode` seems like a better organization than having them on `Instruction` and `Operand`.
Original code:
exists(Expr e, For forloop | forloop = loop and e.pointsTo(_, _, capturing) |
not loop.contains(e)
)
The new version will preserve the same semantics. The problem with the first
rewrite was that `not loop.(For).somethingMore` would hold for any AstNode that
was not a For
Our database representation of ASTs does not use assignment patterns, instead encoding the relevant information directly in the associated function/loop/assignment. We convert from an AST with assignment patterns to one without during parsing, so the extractor does not expect any assignment patterns to be present in the AST.
Due to a bug in the parser, this can currently happen for malformed programs. While we should fix that bug once it gets fixed in Acorn, it also makes sense for the extractor to be more robust, so this PR teaches the `ASTExtractor` pass to raise a parse error when it encounters an assignment pattern, and all other passes to simply ignore them.
The `overlappingVariableMemoryLocations` predicate was a helper
predicate introduced to fix a join-order issue in
`overlappingIRVariableMemoryLocations`. Unfortunately it caused a
performance issue of its own because it could grow too large. On the
small project (38MB zip) awslabs/s2n there were 181M rows in
`overlappingVariableMemoryLocations`, and it took 134s to evaluate.
The fix is to collapse the two predicates into one and fix join ordering
by including an extra column in the predicates being joined.
In addition, some parameters were reordered to avoid the overhead of
auto-generated `join_rhs` predicates.
Tuple counts of `overlappingVariableMemoryLocations` before:
623285 ~176% {2} r1 = JOIN AliasedSSA::isCoveredOffset#fff_120#join_rhs AS L WITH AliasedSSA::isCoveredOffset#fff_120#join_rhs AS R ON FIRST 2 OUTPUT L.<2>, R.<2>
119138 ~3% {2} r2 = SCAN AliasedSSA::VariableMemoryLocation::getVirtualVariable_dispred#ff AS I OUTPUT I.<1>, I.<0>
172192346 ~0% {2} r3 = JOIN r2 WITH AliasedSSA::hasUnknownOffset#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r2.<1>
172815631 ~0% {2} r4 = r1 \/ r3
172192346 ~0% {2} r5 = JOIN r2 WITH AliasedSSA::hasUnknownOffset#ff_10#join_rhs AS R ON FIRST 1 OUTPUT r2.<1>, R.<1>
345007977 ~87% {2} r6 = r4 \/ r5
return r6
Tuple counts of `overlappingIRVariableMemoryLocations` after:
117021 ~134% {2} r1 = JOIN AliasedSSA::isCoveredOffset#ffff AS L WITH AliasedSSA::isCoveredOffset#ffff AS R ON FIRST 3 OUTPUT L.<3>, R.<3>
201486 ~1% {2} r2 = JOIN AliasedSSA::hasUnknownOffset#fff AS L WITH AliasedSSA::hasVariableAndVirtualVariable#fff AS R ON FIRST 2 OUTPUT L.<2>, R.<2>
318507 ~26% {2} r3 = r1 \/ r2
201486 ~3% {2} r4 = JOIN AliasedSSA::hasUnknownOffset#fff AS L WITH AliasedSSA::hasVariableAndVirtualVariable#fff AS R ON FIRST 2 OUTPUT R.<2>, L.<2>
519993 ~92% {2} r5 = r3 \/ r4
return r5
Performance on `taers232c/GAMADV-X` (which exhibited pathological behaviour in
the most recent dist upgrade) went from ~670s to ~313s on
`py/hardcoded-credentials`.
There are still a few tuple counts in the 10-100 million range, but this commit
takes care of all of the ones that numbered in the billions. (A single tuple
count in the 100-1000 million range remains, but it appears to be less critical,
taking only two seconds to calculate.)
Ternary conditionals `b ? x : y` mistakenly had taint-tracking steps from both
`b`, `x`, and `y` to the conditional expression itself. Flow from `b` was not
intented, and flow from `x` and `y` is already part of ordinary data flow.
The database type `@xmlparent` is defined a bit too loosely in that it includes all of `@file`, not just XML files. Fixing that would involve fiddling with the extractor/dbscheme, so I have opted to fix it at the QL level instead.
This used to take 30s on `cpython`.
```
Tuple counts for StatementNoEffect::side_effecting_binary#f:
46522 ~0% {2} r1 = ClassObject::ClassObject::hasAttribute_dispred#fb AS L AND NOT StatementNoEffect::side_effecting_binary#f#antijoin_rhs AS R(L.<0>, L.<1>)
46522 ~2% {2} r2 = SCAN r1 OUTPUT r1.<1>, r1.<0>
950960 ~2% {2} r3 = JOIN r2 WITH Operations::Operator::getSpecialMethodName_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r2.<1>
950960 ~2% {2} r4 = JOIN r3 WITH py_operators AS R ON FIRST 1 OUTPUT R.<2>, r3.<1>
950960 ~0% {3} r5 = JOIN r4 WITH AstGenerated::BinaryExpr_::getLeft_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r4.<1>, r4.<0>
122934382 ~0% {2} r6 = JOIN r2 WITH Operations::Cmpop::getSpecialMethodName_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r2.<1>
122934382 ~3% {3} r7 = JOIN r6 WITH project#Operations::Compare::compares_dispred#ffff#3_201#join_rhs AS R ON FIRST 1 OUTPUT R.<2>, r6.<1>, R.<1>
123885342 ~3% {3} r8 = r5 \/ r7
300 ~8% {1} r9 = JOIN r8 WITH project#Exprs::Expr::refersTo_dispred#ffff AS R ON FIRST 2 OUTPUT r8.<2>
return r9
```
With this commit, it takes a few milliseconds.
On wireshark/wireshark, `isInCycle` ran into a low-memory loop on the
`aliased_ssa` stage. It shouldn't be necessary to detect cycles after
the `raw` stage, so this commit moves cycle detection into the
`Construction` modules and makes it a no-op in `SSAConstruction.qll`.
This commit adds type information to data flow paths, by mapping node types onto
the smaller set of GVN types, and implementing `ppReprType()`.
The effect is a mere change in `DataFlow::PathNode::toString()`; no type-based
pruning is done yet.
The `queries.xml` file defines which extractor the `codeql test` runner will use
to extract databases for the tests. In the future one will be able to write this
information in `qlpack.yml`, but we can't do that immediately because the
_existing_ CodeQL tooling would refuse to parse a `qlpack.yml` that has
the new field in it.
The `queries.xml` file defines which extractor the `codeql test` runner will use
to extract databases for the tests. In the future one will be able to write this
information in `qlpack.yml`, but we can't do that immediately because the
_existing_ CodeQL tooling would refuse to parse a `qlpack.yml` that has the new
field in it.
Adding a queries.xml file means that the normalization of file names in the test
output changes even with the old QLTest, so there are a number of consequential
updates of expected output files.
With the coming `codeql test` support, the `predefined_macros` file will not
necessarily be located under a `tools` directory. Change the test to hide more
of its actual path, so it will work in both cases.
This is a test of an internal query for the Semmle repository. It cannot
run against the public QL repository alone, and therefore should not be
tested here.
https://git.semmle.com/Semmle/code/pull/35690 adds the test back to the
internal repo.
Previously, we would consider an HTML file to be a dependent of all scripts embedded in it. Now we instead consider each JavaScript toplevel inside the HTML file to be a dependent, which is more sensible anyway.
It looks like allowing statics in `Nullness.qll` is fine since it's a
"may be null" analysis rather than a "must be null" analysis.
This reverts commit f5b9837e19.
These were originally meant to give you the term that is textually matched right before/right after the receiver. When I introduced support for lookbehinds, I changed the behaviour to give you the term that is _operationally_ matched before/after the receiver (remember that lookbehinds are implemented by reverse-matching).
However, I think that's rarely ever what you want, and is wrong for the only two uses of these predicates, where it's the textual matching order that we are after, not the operational order.
Consequently, I've changed the semantics back and updated the comments to hopefully clarify the intention.
- `false positives` becomes `false positive results`
- Items are listed alphabetically.
- Query IDs are listed.
Also, some of the queries had the wrong name (query message rather than the
actual query name). These have been fixed.
In the IR, some memory accesses are "must" accesses (the entire memory location is always read or written), and some are "may" accesses (some, all, or none of the bits in the location are written). We previously had to special case specific "may" accesses in a few places. This change regularizes our handling of "may" accesses.
The `MemoryAccessKind` enumeration now describes only the extent of the access (the set of locations potentially accessed), but does not distinguish "must" from "may". The new predicates `Operand.hasMayMemoryAccess()` and `Instruction.hasResultMayMemoryAccess()` hold when the access is a "may" access.
Unaliased SSA now correctly ignores variables that are ever accessed via a "may" access.
Aliased SSA now distinguishes `MemoryLocation`s for "may" and "must" accesses. I've refactored `getOverlap()` into the core `getExtentOverlap()`, which considers only the extent, but not the "may" vs. "must", and `getOverlap()`, which tweaks the result of `getExtentOverlap()` based on "may" vs. "must" and read-only locations.
When determining the overlap between a `Phi` operand and its definition, we now use the result of the defining `Chi` instruction, if one exists. This gives exact definitions for `Phi` operands for virtual variables.
Should fix#2426.
Essentially, we disregard expressions used inside annotations, if these
annotations occur in a file that has `from __future__ import annotations`, as
this prevents the annotations from being evaluated.
The `loopVariant` predicate in `ComparisonWithWiderType.ql` is intended
to identify loop counters, but it was too much of a stretch to apply it
to any subexpression of the small side of the comparison.
This change fixes two false positives on arvidn/libtorrent and many
others seen in the wild (on Linux, CoreCLR, ffmpeg, ...).
* Adds fix for __init_subclass__ bug.
* Adds test case.
* Move test on name.
I think it makes more sense here, alongside the other "special" method names.
Fixes#1969.
The points-to analysis does not know that the assignment `input = raw_input`
cannot fail under Python 2, and so there are two possible values that `input`
could point-to after exiting the exception handler: the built-in `input`, or the
built-in `raw_input`. In the latter case we do not want to report the alert, and
so adding a check that the given function does not point-to the built-in
`raw_input` suffices.
Files have strange `:0:0:0:0` locations for... reasons. This makes the predicates inherited from `Locatable` meaningless. A particularly bad case is `getNumLines()`, which will always return one. The right predicate to use is, of course, `getNumberOfLines()`, which is defined in `File` itself.
This fixes a failing qltest and makes the exclusion similar to what's in
`PointerOverflow.ql`. It's possible we should exclude based on both `+`
and `<`, but we can revisit that if false positives show up.
Before this change, evaluating `cpp/constant-comparison` followed by
`cpp/signed-overflow-check` would result in re-evaluation of almost all
the cached stages they share: CFG, basic blocks, SSA, and range
analysis. The same effect could be seen on `cpp/bad-strncpy-size`, which
also uses the GVN library.
We were privately importing `semmle.code.<lang>.ir.internal.Overlap`, but `PrintSSA.qll` was depending on it being public. This is made a little more complicated by the presence of cross-langage pyrameterized modules.
Should fix#1833, #2137, and #2187.
Internally, comprehensions are (at present) elaborated into local functions and
iterators as described in [PEP-289](https://www.python.org/dev/peps/pep-0289/).
That is, something like:
```
g = (x**2 for x in range(10))
```
becomes something akin to
```
def __gen(exp):
for x in exp:
yield x**2
g = __gen(iter(range(10)))
```
In the context of the top-level of a class, this means `__gen` looks as if it is
a method of the class, and in particular `exp` looks like it's the `self`
argument of this method, which leads the points-to analysis to think that `exp`
is an instance of the surrounding class itself.
The fix in this case is pretty simple: we look for occurrences of `exp` (in fact
called `.0` internally -- carefully chosen to _not_ be a valid Python
identifier) and explicitly exclude this parameter from being classified as a
`self` parameter.
The `SignedOverflowCheck.ql` query was very slow on certain snapshots
(jluttine/suitesparse and Chromium) due to bad magic in
`MacroInvocation::getAnAffectedElement_dispred#fb`. This commit doesn't
fix the bad magic but changes the exclusion mechanism to use a predicate
where we can better control the magic and optimization.
The query should also give more good results due to this new exclusion
mechanism, which is the same one used in its sibling,
`PointerOverflow.ql`.
Having `toString()` defined to be `none()` is a major headache when debugging,
as `toString`-less results are silently elided. This PR puts dummy `toString`s
in place of the `none()`s.
(I am mostly creating this to see if it impacts our tests and/or the
performance. If not, we may as well merge it.)
This might cause fewer variables to be analysed because not every use of
`LocalScopeVariable` was constrained by the def-use library. Hopefully
this leads to an improved nullness analysis since it avoids treating
`static T *x = nullptr;` the same as `static T *x; x = nullptr;`.
These changes should not affect semantics since these uses of
`LocalScopeVariable` were already constrained to stack variables by
their use of SSA or def-use.
In these files it was possible to remove calls to `isStatic` by
switching from `LocalScopeVariable` to `StackVariable`. This changes
semantics, hopefully for the better, to treat `thread_local` locals the
same as `static` locals.
This is now a copy of `LocalScopeVariableReachability.qll`, just with
`s/LocalScopeVariable/StackVariable/g`. It can be used as a drop-in
replacement since the `LocalScopeVariableReachability.qll` library
implementation was already restricted to `SemanticStackVariable`.
Fixes#1832.
In the taint sink, we add an additional check that the given control-flow node
can indeed point to a value that is mutable. This takes care of the guard on the
type.
If and when we get around to adding configurations for all of the taint
analyses, we may want to implement this as a barrier instead, pruning any steps
that go through a type test where the type is not mutable.
This commit adds support for (classical) NUnit assertions (see
https://github.com/nunit/docs/wiki/Assertions). Modern constraint-based assertions,
such as `Assert.That(o, Is.Not.Null)` are currently not supported, because they
would require a restructuring of the assertion library.
1. The two last examples were misleading at best. The first of those two
recommended casting to non-negative `int`s to `unsigned int` and then
checking if their addition would overflow, but overflow was
impossible because their sum (on 32-bit two's complement) could be at
most 2^32 - 2. The second example could lead to the wrong condition
(unsigned overflow) being checked if taken literally. Instead of
keeping that example, I reworeded the first paragraph of the
Recommendation section.
2. The assumptions about `delta` being positive was relaxed to
non-negative.
3. There was no need to assume that an unsigned short was non-negative.
4. Some of the suggestions were missing `i >`.
We no longer exclude macros based on their name, which means we can now
find results inside arguments to the `likely` macro in Linux (except
that Linux is compiled with `-fno-strict-overflow`).
This trick for excluding elements from macro bodies but not macro
arguments looks promising and should probably be used much more. With
this commit, it's now easy to use from any query.
Performance is still good because the new predicate gets appropriately
magiced.
Fixes https://github.com/Semmle/ql/issues/1572
Adjust mock so it's more aligned with what the flask code actually does. Tests
were passing before, even though we didn't handle the case in real code :\
Since we are now able to trace shared compilation builds on Linux and macOS
(starting from .NET Core 3), and always were able to on Windows, there is
no need to set `UseSharedCompilation=false` in those cases. This may have a
positive performance impact, as shared compilation is generally faster then
non-shared compilation.
Like the one existing in ControlFlowNode.
This is useful for checking class of value being poitned to, as
expr.pointsTo().getClass() = someClass
Without this you need to do
exists(Value v | v.getClass() = someClass | expr.pointsTo(v))
This adds IndirectMustWriteSideEffects for constructor qualifiers. The
introduced sanity failures result from constructor calls without qualifier
operands in the IR
The `toString` for IR data-flow nodes are now similar to AST data-flow
nodes. This should make it easier to use the IR as a drop-in replacement
in the future. There are still differences because the IR data flow
library takes conversions into account.
I did not attempt to align the new nodes we use for field flow. That can
come later, when we add field flow to IR data flow.
This string looked out of place compared to `ExplicitParameterNode`,
whose string is simply the name of the parameter and therefore
indistinguishable from an access to the parameter without looking at the
location also. This has not been a problem so far, and if we want to
distinguish more clearly between initial values and accesses at some
point, we should do it for `ExplicitParameterNode` and
`UninitializedNode` too.
This code of conduct outlines expectations for participation in the Semmle open source community, including any open source repositories on GitHub.com, as well as steps for reporting unacceptable behavior. We are committed to providing a welcoming and inspiring community for all.
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
People violating this code of conduct may be banned from the community.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
Our community strives to:
* Be friendly and patient: Remember you might not be communicating in someone else’s primary spoken or programming language, and others may not have your level of understanding.
* Be welcoming: Our community welcomes and supports people of all backgrounds and identities. This includes, but is not limited to members of any race, ethnicity, culture, national origin, color, immigration status, social and economic class, educational level, sex, sexual orientation, gender identity and expression, age, size, family status, political belief, religion, and mental and physical ability.
* Be respectful: We are a world-wide community of professionals, and we conduct ourselves professionally. Disagreement is no excuse for poor behavior and poor manners. Disrespectful and unacceptable behavior includes, but is not limited to:
* Violent threats or language.
* Discriminatory or derogatory jokes and language.
* Posting sexually explicit or violent material.
* Posting, or threatening to post, people’s personally identifying information (“doxing”).
* Insults, especially those using discriminatory terms or slurs.
* Behavior that could be perceived as sexual attention.
* Advocating for or encouraging any of the above behaviors.
* Understand disagreements: Disagreements, both social and technical, are useful learning opportunities. Seek to understand others’ viewpoints and resolve differences constructively.
## Our Standards
This code is not exhaustive or complete. It serves to capture our common understanding of a productive, collaborative environment. We expect the code to be followed in spirit as much as in the letter.
Examples of behavior that contributes to a positive environment for our
community include:
# Scope
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
This code of conduct applies to all repositories and communities for Semmle open source projects, regardless of whether or not the repository explicitly calls out its use of this code. The code also applies in public spaces when an individual is representing the Semmle open source community. Examples include using an official project email address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
# Reporting Code of Conduct Issues
We encourage members of the community to resolve issues on their own whenever possible. This builds a broader and deeper understanding and ultimately a healthier interaction. In the event that an issue cannot be resolved locally, please feel free to report your concerns by contacting code-of-conduct@semmle.com.
In your report please include:
* Your contact information.
* Names (real, usernames or pseudonyms) of any individuals involved. If there are additional witnesses, please include them as well.
* Your account of what occurred, and if you believe the incident is ongoing. If there is a publicly available record (e.g. a mailing list archive or a public chat log), please include a link or attachment.
* Any additional information that may be helpful.
## Enforcement Responsibilities
All reports will be reviewed by a multi-person team and will result in a response that is deemed necessary and appropriate to the circumstances. Where additional perspectives are needed, the team may seek insight from others with relevant expertise or experience. The confidentiality of the person reporting the incident will be kept at all times. Involved parties are never part of the review team.
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Anyone asked to stop unacceptable behavior is expected to comply immediately. If an individual engages in unacceptable behavior, the review team may take any action they deem appropriate, including a permanent ban from the community.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
*This text is licensed under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license. It is based on a template established by the [TODO Group](http://todogroup.org/) and variants thereof used by numerous other large communities (e.g., [Microsoft](https://microsoft.github.io/codeofconduct/), [Facebook](https://code.fb.com/codeofconduct/), [Yahoo](https://yahoo.github.io/codeofconduct), [Twitter](https://github.com/twitter/code-of-conduct), [GitHub](https://blog.github.com/2015-07-20-adopting-the-open-code-of-conduct/)) and the Scope section from the [Contributor Covenant version 1.4](http://contributor-covenant.org/version/1/4/).*
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
opensource@github.com.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
We welcome contributions to our standard library and standard checks. Got an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request!
We welcome contributions to our CodeQL libraries and queries. Got an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Contributions to this project are [released](https://help.github.com/articles/github-terms-of-service/#6-contributions-under-repository-license) to the public under the [project's open source license](LICENSE).
Before we accept your pull request, we require that you have agreed to our Contributor License Agreement, this is not something that you need to do before you submit your pull request, but until you've done so, we will be unable to accept your contribution.
There is lots of useful documentation to help you write queries, ranging from information about query file structure to tutorials for specific target languages. For more information on the documentation available, see [CodeQL queries](https://help.semmle.com/QL/learn-ql/writing-queries/writing-queries.html) on [help.semmle.com](https://help.semmle.com).
## Adding a new query
If you have an idea for a query that you would like to share with other Semmle users, please open a pull request to add it to this repository.
Follow the steps below to help other users understand what your query does, and to ensure that your query is consistent with the other Semmle queries.
## Submitting a new experimental query
1.**Consult the documentation for query writers**
If you have an idea for a query that you would like to share with other CodeQL users, please open a pull request to add it to this repository. New queries start out in a `<language>/ql/src/experimental` directory, to which they can be merged when they meet the following requirements.
There is lots of useful documentation to help you write queries, ranging from information about query file structure to tutorials for specific target languages. For more information on the documentation available, see [Writing CodeQL queries](https://help.semmle.com/QL/learn-ql/writing-queries/writing-queries.html) on [help.semmle.com](https://help.semmle.com).
1.**Directory structure**
2.**Format your code correctly**
There are five language-specific query directories in this repository:
All of Semmle's standard queries and libraries are uniformly formatted for clarity and consistency, so we strongly recommend that all contributions follow the same formatting guidelines. If you use QL for Eclipse, you can auto-format your query in the [QL editor](https://help.semmle.com/ql-for-eclipse/Content/WebHelp/ql-editor.html). For more information, see the [CodeQL style guide](https://github.com/Semmle/ql/blob/master/docs/ql-style-guide.md).
* C/C++: `cpp/ql/src`
* C#:`csharp/ql/src`
* Java: `java/ql/src`
* JavaScript: `javascript/ql/src`
* Python: `python/ql/src`
3.**Make sure your query has the correct metadata**
Each language-specific directory contains further subdirectories that group queries based on their `@tags` or purpose.
- Experimental queries and libraries are stored in the `experimental` subdirectory within each language-specific directory in the [CodeQL repository](https://github.com/github/codeql). For example, experimental Java queries and libraries are stored in `java/ql/src/experimental` and any corresponding tests in `java/ql/test/experimental`.
- The structure of an `experimental` subdirectory mirrors the structure of its parent directory.
- Select or create an appropriate directory in `experimental` based on the existing directory structure of `experimental` or its parent directory.
Query metadata is used by Semmle's analysis to identify your query and make sure the query results are displayed properly.
The most important metadata to include are the `@name`, `@description`, and the `@kind`.
Other metadata properties (`@precision`, `@severity`, and `@tags`) are usually added after the query has been reviewed by Semmle staff.
For more information on writing query metadata, see the [Query metadata style guide](https://github.com/Semmle/ql/blob/master/docs/query-metadata-style-guide.md).
2.**Query metadata**
4.**Make sure the `select` statement is compatible with the query type**
- The query `@id` must conform to all the requirements in the [guide on query metadata](docs/query-metadata-style-guide.md#query-id-id). In particular, it must not clash with any other queries in the repository, and it must start with the appropriate language-specific prefix.
- The query must have a `@name` and `@description` to explain its purpose.
- The query must have a `@kind` and `@problem.severity` as required by CodeQL tools.
The `select` statement of your query must be compatible with the query type (determined by the `@kind` metadata property) for alert or path results to be displayed correctly in LGTM and QL for Eclipse.
For more information on `select` statement format, see [Introduction to query files](https://help.semmle.com/QL/learn-ql/writing-queries/introduction-to-queries.html#select-clause) on help.semmle.com.
For details, see the [guide on query metadata](docs/query-metadata-style-guide.md).
5.**Save your query in a `.ql` file in the correct language directory in this repository**
Make sure the `select` statement is compatible with the query `@kind`. See [About CodeQL queries](https://help.semmle.com/QL/learn-ql/writing-queries/introduction-to-queries.html#select-clause) on help.semmle.com.
There are five language-specific directories in this repository:
* C/C++: `ql/cpp/ql/src`
* C#:`ql/csharp/ql/src`
* Java: `ql/java/ql/src`
* JavaScript: `ql/javascript/ql/src`
* Python: `ql/python/ql/src`
3.**Formatting**
Each language-specific directory contains further subdirectories that group queries based on their `@tags` properties or purpose. Select the appropriate subdirectory for your new query, or create a new one if necessary.
- The queries and libraries must be autoformatted, for example using the "Format Document" command in [CodeQL for Visual Studio Code](https://help.semmle.com/codeql/codeql-for-vscode/procedures/about-codeql-for-vscode.html).
6.**Write a query help file**
4.**Compilation**
Query help files explain the purpose of your query to other users. Write your query help in a `.qhelp` file and save it in the same directory as your new query.
For more information on writing query help, see the [Query help style guide](https://github.com/Semmle/ql/blob/master/docs/query-help-style-guide.md).
- Compilation of the query and any associated libraries and tests must be resilient to future development of the [supported](docs/supported-queries.md) libraries. This means that the functionality cannot use internal libraries, cannot depend on the output of `getAQlClass`, and cannot make use of regexp matching on `toString`.
- The query and any associated libraries and tests must not cause any compiler warnings to be emitted (such as use of deprecated functionality or missing `override` annotations).
5.**Results**
- The query must have at least one true positive result on some revision of a real project.
Experimental queries and libraries may not be actively maintained as the [supported](docs/supported-queries.md) libraries evolve. They may also be changed in backwards-incompatible ways or may be removed entirely in the future without deprecation warnings.
After the experimental query is merged, we welcome pull requests to improve it. Before a query can be moved out of the `experimental` subdirectory, it must satisfy [the requirements for being a supported query](docs/supported-queries.md).
## Using your personal data
If you contribute to this project, we will record your name and email
address (as provided by you with your contributions) as part of the code
repositories, which might be made public. We might also use this information
to contact you in relation to your contributions, as well as in the
normal course of software development. We also store records of your
CLA agreements. Under GDPR legislation, we do this
on the basis of our legitimate interest in creating the CodeQL product.
If you contribute to this project, we will record your name and email address (as provided by you with your contributions) as part of the code repositories, which are public. We might also use this information to contact you in relation to your contributions, as well as in the normal course of software development. We also store records of CLA agreements signed in the past, but no longer require contributors to sign a CLA. Under GDPR legislation, we do this on the basis of our legitimate interest in creating the CodeQL product.
Please do get in touch (privacy@semmle.com) if you have any questions about
this or our data protection policies.
## Contributor License Agreement
This Contributor License Agreement (“Agreement”) is entered into between Semmle Limited (“Semmle,” “we” or “us” etc.), and You (as defined and further identified below).
Accordingly, You hereby agree to the following terms for Your present and future Contributions submitted to Semmle:
1.**Definitions**.
* "You" (or "Your") shall mean the Contribution copyright owner (whether an individual or organization) or legal entity authorized by the copyright owner that is making this Agreement with Semmle. For legal entities, the entity making a Contribution and all other entities that control, are controlled by, or are under common control with that entity are considered to be a single Contributor. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
* "Contribution(s)" shall mean the code, documentation or other original works of authorship, including any modifications or additions to an existing work, submitted by You to Semmle for inclusion in, or documentation of, any of the products or projects owned or managed by Semmle (the "Work(s)"). For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to Semmle or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, Semmle for the purpose of discussing and/or improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by You as "Not a Contribution."
2.**Grant of Copyright License**. You hereby grant to Semmle and to recipients of software distributed by Semmle a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works.
3.**Grant of Patent License**. You hereby grant to Semmle and to recipients of software distributed by Semmle a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution(s) alone or by combination of Your Contribution(s) with the Work to which such Contribution(s) was submitted. If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that Your Contribution, or the Work to which You have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this Agreement for that Contribution or Work shall terminate as of the date such litigation is filed.
4.**Ownership**. Except as set out above, You keep all right, title, and interest in Your Contribution. The rights that You grant to us under this Agreement are effective on the date You first submitted a Contribution to us, even if Your submission took place before the date You entered this Agreement.
5.**Representations**. You represent and warrant that: (i) the Contributions are an original work and that You can legally grant the rights set out in this Agreement; (ii) the Contributions and Semmle’s exercise of any license rights granted hereunder, does not and will not, infringe the rights of any third party; (iii) You are not aware of any pending or threatened claims, suits, actions, or charges pertaining to the Contributions, including without limitation any claims or allegations that any or all of the Contributions infringes, violates, or misappropriate the intellectual property rights of any third party (You further agree that You will notify Semmle immediately if You become aware of any such actual or potential claims, suits, actions, allegations or charges).
6.**Employer**. If Your employer(s) has rights to intellectual property that You create that includes Your Contributions, You represent and warrant that Your employer has waived such rights for Your Contributions to Semmle, or that You have received permission to make Contributions on behalf of that employer and that You are authorized to execute this Agreement on behalf of Your employer.
7.**Inclusion of Code**. We determine the code that is in our Works. You understand that the decision to include the Contribution in any project or source repository is entirely that of Semmle, and this agreement does not guarantee that the Contributions will be included in any product.
8.**Disclaimer**. You are not expected to provide support for Your Contributions, except to the extent You desire to provide support. You may provide support for free, for a fee, or not at all. Except as set forth herein, and unless required by applicable law or agreed to in writing, You provide Your Contributions on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND.
9.**General**. The failure of either party to enforce its rights under this Agreement for any period shall not be construed as a waiver of such rights. No changes or modifications or waivers to this Agreement will be effective unless in writing and signed by both parties. In the event that any provision of this Agreement shall be determined to be illegal or unenforceable, that provision will be limited or eliminated to the minimum extent necessary so that this Agreement shall otherwise remain in full force and effect and enforceable. This Agreement shall be governed by and construed in accordance with the laws of the State of California in the United States without regard to the conflicts of laws provisions thereof. In any action or proceeding to enforce rights under this Agreement, the prevailing party will be entitled to recover costs and attorneys’ fees.
Please do get in touch (privacy@github.com) if you have any questions about this or our data protection policies.
This open source repository contains the standard CodeQL libraries and queries that power [LGTM](https://lgtm.com), and the other products that [Semmle](https://semmle.com) makes available to its customers worldwide.
This open source repository contains the standard CodeQL libraries and queries that power [LGTM](https://lgtm.com) and the other CodeQL products that [GitHub](https://github.com) makes available to its customers worldwide. For the queries, libraries, and extractor that power Go analysis, visit the [CodeQL for Go repository](https://github.com/github/codeql-go).
## How do I learn CodeQL and run queries?
There is [extensive documentation](https://help.semmle.com/QL/learn-ql/) on getting started with writing CodeQL.
You can use the [interactive query console](https://lgtm.com/help/lgtm/using-query-console) on LGTM.com or the [QL for Eclipse](https://lgtm.com/help/lgtm/running-queries-ide) plugin to try out your queries on any open source project that's currently being analyzed.
You can use the [interactive query console](https://lgtm.com/help/lgtm/using-query-console) on LGTM.com or the [CodeQL for Visual Studio Code](https://help.semmle.com/codeql/codeql-for-vscode.html) extension to try out your queries on any open source project that's currently being analyzed.
## Contributing
We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our [contributing guidelines](CONTRIBUTING.md). You can also consult our [style guides](https://github.com/Semmle/ql/tree/master/docs) to learn how to format your code for consistency and clarity, how to write query metadata, and how to write query help documentation for your query.
We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our [contributing guidelines](CONTRIBUTING.md). You can also consult our [style guides](https://github.com/github/codeql/tree/master/docs) to learn how to format your code for consistency and clarity, how to write query metadata, and how to write query help documentation for your query.
## License
The code in this repository is licensed under [Apache License 2.0](LICENSE) by [Semmle](https://semmle.com).
The code in this repository is licensed under the [MIT License](LICENSE) by [GitHub](https://github.com).
## Visual Studio Code integration
If you use Visual Studio Code to work in this repository, there are a few integration features to make development easier.
### CodeQL for Visual Studio Code
You can install the [CodeQL for Visual Studio Code](https://marketplace.visualstudio.com/items?itemName=GitHub.vscode-codeql) extension to get syntax highlighting, IntelliSense, and code navigation for the QL language, as well as unit test support for testing CodeQL libraries and queries.
### Tasks
The `.vscode/tasks.json` file defines custom tasks specific to working in this repository. To invoke one of these tasks, select the `Terminal | Run Task...` menu option, and then select the desired task from the dropdown. You can also invoke the `Tasks: Run Task` command from the command palette.
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | reliability, japanese-era | This query is a combination of two old queries that were identical in purpose but separate as an implementation detail. This new query replaces Hard-coded Japanese era start date in call (`cpp/japanese-era/constructor-or-method-with-exact-era-date`) and Hard-coded Japanese era start date in struct (`cpp/japanese-era/struct-with-exact-era-date`). |
| Signed overflow check (`cpp/signed-overflow-check`) | correctness, reliability | Finds overflow checks that rely on signed integer addition to overflow, which has undefined behavior. Example: `a + b < a`. |
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | reliability, japanese-era | This query is a combination of two old queries that were identical in purpose but separate as an implementation detail. This new query replaces Hard-coded Japanese era start date in call (`cpp/japanese-era/constructor-or-method-with-exact-era-date`) and Hard-coded Japanese era start date in struct (`cpp/japanese-era/struct-with-exact-era-date`). Results are not shown on LGTM by default. |
| Pointer overflow check (`cpp/pointer-overflow-check`) | correctness, security | Finds overflow checks that rely on pointer addition to overflow, which has undefined behavior. Example: `ptr + a < ptr`. Results are shown on LGTM by default. |
| Signed overflow check (`cpp/signed-overflow-check`) | correctness, security | Finds overflow checks that rely on signed integer addition to overflow, which has undefined behavior. Example: `a + b < a`. Results are shown on LGTM by default. |
| Comparison of narrow type with wide type in loop condition (`cpp/comparison-with-wider-type`) | Higher precision | The precision of this query has been increased to "high" as the alerts from this query have proved to be valuable on real-world projects. With this precision, results are now displayed by default in LGTM. |
| Hard-coded Japanese era start date in call (`cpp/japanese-era/constructor-or-method-with-exact-era-date`) | Deprecated | This query has been deprecated. Use the new combined query Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) instead. |
| Hard-coded Japanese era start date in struct (`cpp/japanese-era/struct-with-exact-era-date`) | Deprecated | This query has been deprecated. Use the new combined query Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) instead. |
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | More correct results | This query now checks for the beginning date of the Reiwa era (1st May 2019). |
| Non-constant format string (`cpp/non-constant-format`) | Fewer false positive results | Fixed false positive results triggrered by mismatching declarations of a formatting function. |
| Sign check of bitwise operation (`cpp/bitwise-sign-check`) | Fewer false positive results | Results involving `>=` or `<=` are no longer reported. |
| Too few arguments to formatting function (`cpp/wrong-number-format-arguments`) | Fewer false positive results | Fixed false positives resulting from mistmatching declarations of a formatting function. |
| Too many arguments to formatting function (`cpp/too-many-format-arguments`) | Fewer false positive results | Fixed false positives resulting from mistmatching declarations of a formatting function. |
| Unclear comparison precedence (`cpp/comparison-precedence`) | Fewer false positive results | False positives involving template classes and functions have been fixed. |
| Comparison of narrow type with wide type in loop condition (`cpp/comparison-with-wider-type`) | Higher precision | The precision of this query has been increased to "high" as the alerts from this query have proved to be valuable on real-world projects. With this precision, results are now displayed by default in LGTM. |
| Non-constant format string (`cpp/non-constant-format`) | Fewer false positive results | Fixed false positives resulting from mistmatching declarations of a formatting function. |
| Wrong type of arguments to formatting function (`cpp/wrong-type-format-argument`) | More correct results and fewer false positive results | This query now understands explicitly specified argument numbers in format strings, such as the `1$` in `%1$s`. |
| Too few arguments to formatting function (`cpp/wrong-number-format-arguments`) | Fewer false positive results | Fixed false positive results triggered by mismatching declarations of a formatting function. |
| Too many arguments to formatting function (`cpp/too-many-format-arguments`) | Fewer false positive results | Fixed false positive results triggered by mismatching declarations of a formatting function. |
| Unclear comparison precedence (`cpp/comparison-precedence`) | Fewer false positive results | False positive results involving template classes and functions have been fixed. |
| Wrong type of arguments to formatting function (`cpp/wrong-type-format-argument`) | More correct results and fewer false positive results | This query now understands explicitly-specified argument numbers in format strings, such as the `1$` in `%1$s`. |
## Changes to libraries
* The data-flow library has been extended with a new feature to aid debugging.
Instead of specifying `isSink(Node n) { any() }` on a configuration to
explore the possible flow from a source, it is recommended to use the new
`Configuration::hasPartialFlow` predicate, as this gives a more complete
picture of the partial flow paths from a given source. The feature is
disabled by default and can be enabled for individual configurations by
overriding `int explorationLimit()`.
* The data-flow library now supports flow out of C++ reference parameters.
* The data-flow library now allows flow through the address-of operator (`&`).
* The`DataFlow::DefinitionByReferenceNode` class now considers `f(x)` to be a
definition of `x` when `x` is a variable of pointer type. It no longer
considers deep paths such as `f(&x.myField)` to be definitions of `x`. These
changes are in line with the user expectations we've observed.
* The data-flow library now makes it easier to specify barriers/sanitizers
arising from guards by overriding the predicate
`isBarrierGuard`/`isSanitizerGuard` on data-flow and taint-tracking
configurations respectively.
* There is now a `DataFlow::localExprFlow` predicate and a
`TaintTracking::localExprTaint` predicate to make it easy to use the most
common case of local data flow and taint: from one `Expr` to another.
* The data-flow library in `semmle.code.cpp.dataflow.DataFlow` and
`semmle.code.cpp.dataflow.TaintTracking` have had extensive changes:
* Data flow through fields is now more complete and reliable.
* The data-flow library has been extended with a new feature to aid debugging.
Previously, to explore the possible flow from all sources you could specify `isSink(Node n) { any() }` on a configuration.
Now you can use the new `Configuration::hasPartialFlow` predicate,
which gives a more complete picture of the partial flow paths from a given source, including flow that doesn't reach any sink.
The feature is disabled by default and can be enabled for individual configurations by overriding `int explorationLimit()`.
* There is now flow out of C++ reference parameters.
* There is now flow through the address-of operator (`&`).
* The `DataFlow::DefinitionByReferenceNode` class now considers `f(x)` to be a
definition of `x` when `x` is a variable of pointer type. It no longer
considers deep paths such as `f(&x.myField)` to be definitions of `x`. These
changes are in line with the user expectations we've observed.
* It's now easier to specify barriers/sanitizers
arising from guards by overriding the predicate
`isBarrierGuard`/`isSanitizerGuard` on data-flow and taint-tracking
configurations respectively.
* There is now a `DataFlow::localExprFlow` predicate and a
`TaintTracking::localExprTaint` predicate to make it easy to use the most
common case of local data flow and taint: from one `Expr` to another.
* The member predicates of the `FunctionInput` and `FunctionOutput` classes have been renamed for
clarity (e.g.`isOutReturnPointer()` to `isReturnValueDeref()`). The existing member predicates
clarity (for example,`isOutReturnPointer()` to `isReturnValueDeref()`). The existing member predicates
have been deprecated, and will be removed in a future release. Code that uses the old member
predicates should be updated to use the corresponding new member predicate.
* The predicates `Declaration.hasStdName()` and`Declaration.hasGlobalOrStdName`
have been added, simplifying handling of C++ standard library functions.
* The predicate `Declaration.hasGlobalOrStdName` has been added, making it
easier to recognize C library functions called from C++.
* The control-flow graph is now computed in QL, not in the extractor. This can
lead to regressions (or improvements) in how queries are optimized because
lead to changes in how queries are optimized because
optimization in QL relies on static size estimates, and the control-flow edge
relations will now have different size estimates than before.
* Support has been added for non-type template arguments. This means that the
return type of `Declaration::getTemplateArgument()` and
`Declaration::getATemplateArgument` have changed to `Locatable`. See the
documentation for `Declaration::getTemplateArgument()` and
`Declaration::getTemplateArgumentKind()` for details.
`Declaration::getATemplateArgument` have changed to `Locatable`. For details, see the
CodeQL library documentation for `Declaration::getTemplateArgument()` and
| Deserialization of untrusted data (`cs/unsafe-deserialization-untrusted-input`) | security, external/cwe/cwe-502 | Finds flow of untrusted input to calls to unsafe deserializers. |
| Unsafe year argument for 'DateTime' constructor (`cs/unsafe-year-construction`) | reliability, date-time | Finds incorrect manipulation of `DateTime` values, which could lead to invalid dates. |
| Mishandling the Japanese era start date (`cs/mishandling-japanese-era`) | reliability, date-time | Finds hard-coded Japanese era start dates that could be invalid. |
| Deserialized delegate (`cs/deserialized-delegate`) | security, external/cwe/cwe-502 | Finds unsafe deserialization of delegate types. Results are shown on LGTM by default. |
| Deserialization of untrusted data (`cs/unsafe-deserialization-untrusted-input`) | security, external/cwe/cwe-502 | Finds flow of untrusted input to calls to unsafe deserializers. Results are shown on LGTM by default. |
| Mishandling the Japanese era start date (`cs/mishandling-japanese-era`) | reliability, date-time | Finds hard-coded Japanese era start dates that could be invalid. Results are not shown on LGTM by default. |
| Unsafe year argument for 'DateTime' constructor (`cs/unsafe-year-construction`) | reliability, date-time | Finds incorrect manipulation of `DateTime` values, which could lead to invalid dates. Results are not shown on LGTM by default. |
| Unsafe deserializer (`cs/unsafe-deserialization`) | security, external/cwe/cwe-502 | Finds calls to unsafe deserializers. By default, the query is not run on LGTM. |
| Dereferenced variable may be null (`cs/dereferenced-value-may-be-null`) | Fewer false positive results | More `null` checks are now taken into account, including `null` checks for `dynamic` expressions and `null` checks such as `object alwaysNull = null; if (x != alwaysNull) ...`. |
| Missing Dispose call on local IDisposable (`cs/local-not-disposed`) | Fewer false positive results | The query has been rewritten in order to identify more dispose patterns. For example, a local `IDisposable` that is disposed of by passing through a fluent API is no longer reported. |
## Removal of old queries
| Missing Dispose call on local IDisposable (`cs/local-not-disposed`) | Fewer false positive results | The query has been rewritten in order to identify more dispose patterns. For example, a local `IDisposable` that is disposed of by passing through a fluent API is no longer reported as missing a dispose call. |
## Changes to code extraction
@@ -31,22 +27,19 @@ The following changes in version 1.23 affect C# analysis in all applications.
* The new class `NamespaceAccess` models accesses to namespaces, for example in `nameof` expressions.
* The data-flow library now makes it easier to specify barriers/sanitizers
arising from guards by overriding the predicate
arising from guards. You can override the predicate
`isBarrierGuard`/`isSanitizerGuard` on data-flow and taint-tracking
configurations respectively.
* The data-flow library has been extended with a new feature to aid debugging.
Instead of specifying`isSink(Node n) { any() }` on a configuration to
explore the possible flow from a source, it is recommended to use the new
`Configuration::hasPartialFlow` predicate, as this gives a more complete
picture of the partial flow paths from a given source. The feature is
disabled by default and can be enabled for individual configurations by
Previously, to explore the possible flow from all sources you could specify `isSink(Node n) { any() }` on a configuration.
Now you can use the new `Configuration::hasPartialFlow` predicate,
which gives a more complete picture of the partial flow paths from a given source, including flow that doesn't reach any sink.
The feature is disabled by default and can be enabled for individual configurations by
overriding `int explorationLimit()`.
*`foreach` statements where the body is guaranteed to be executed at least once, such as `foreach (var x in new string[]{ "a", "b", "c" }) { ... }`, are now recognized by all analyses based on the controlflow graph (such as SSA, data flow and taint tracking).
* Fixed the controlflow graph for `switch` statements where the `default` case was not the last case. This had caused the remaining cases to be unreachable. `SwitchStmt.getCase(int i)` now puts the `default` case last.
*`foreach` statements where the body is guaranteed to be executed at least once, such as `foreach (var x in new string[]{ "a", "b", "c" }) { ... }`, are now recognized by all analyses based on the control-flow graph (such as SSA, data flow and taint tracking).
* Fixed the control-flow graph for `switch` statements where the `default` case was not the last case. This had caused the remaining cases to be unreachable. `SwitchStmt.getCase(int i)` now puts the `default` case last.
* There is now a `DataFlow::localExprFlow` predicate and a
`TaintTracking::localExprTaint` predicate to make it easy to use the most
common case of local data flow and taint: from one `Expr` to another.
* Data is now tracked through null-coalescing expressions (`??`).
* A new library `semmle.code.csharp.Unification` has been added. This library exposes two predicates `unifiable` and `subsumes` for calculating type unification and type subsumption, respectively.
| Continue statement that does not continue (`java/continue-in-false-loop`) | correctness | Finds `continue` statements in `do { ... } while (false)` loops. |
| Continue statement that does not continue (`java/continue-in-false-loop`) | correctness | Finds `continue` statements in `do { ... } while (false)` loops. Results are shown on LGTM by default. |
| Disabled Netty HTTP header validation (`java/netty-http-response-splitting`) | security, external/cwe/cwe-113 | Finds response-splitting vulnerabilities due to Netty HTTP header validation being disabled. Results are shown on LGTM by default. |
| Dereferenced variable may be null (`java/dereferenced-value-may-be-null`) | Fewer false positives | Certain indirect null guards involving two auxiliary variables known to be equal can now be detected. |
| Non-synchronized override of synchronized method (`java/non-sync-override`) | Fewer false positives | Results are now only reported if the immediately overridden method is synchronized. |
| Query built from user-controlled sources (`java/sql-injection`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as SQL expressions sinks. |
| Query built from local-user-controlled sources (`java/sql-injection-local`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as SQL expressions sinks. |
| Query built without neutralizing special characters (`java/concatenated-sql-query`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as SQL expressions sinks. |
| Useless comparison test (`java/constant-comparison`) | Fewer false positives | Additional overflow check patterns are now recognized and no longer reported. |
| Dereferenced variable may be null (`java/dereferenced-value-may-be-null`) | Fewer false positive results | Additional indirect null guards are detected, where two auxiliary variables are known to be equal. |
| Non-synchronized override of synchronized method (`java/non-sync-override`) | Fewer false positive results | Results are now only reported if the immediately overridden method is synchronized. |
| Query built from local-user-controlled sources (`java/sql-injection-local`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as sinks for SQL expressions. |
| Query built from user-controlled sources (`java/sql-injection`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as sinks for SQL expressions. |
| Query built without neutralizing special characters (`java/concatenated-sql-query`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as sinks for SQL expressions. |
| Useless comparison test (`java/constant-comparison`) | Fewer false positive results | Additional overflow check patterns are now recognized and no longer reported. Also, a few bug fixes in the range analysis for floating-point variables gives a further reduction in false positive results. |
## Changes to libraries
*The data-flow library has been extended with a new feature to aid debugging.
Instead of specifying`isSink(Node n) { any() }` on a configuration to
explore the possible flow from a source, it is recommended to use the new
`Configuration::hasPartialFlow` predicate, as this gives a more complete
picture of the partial flow paths from a given source. The feature is
disabled by default and can be enabled for individual configurations by
overriding `int explorationLimit()`.
The data-flow library has been extended with a new feature to aid debugging.
Previously, to explore the possible flow from all sources you could specify `isSink(Node n) { any() }` on a configuration.
Now you can use the new `Configuration::hasPartialFlow` predicate,
which gives a more complete picture of the partial flow paths from a given source, including flow that doesn't reach any sink.
The feature is disabled by default and can be enabled for individual configurations by overriding `int explorationLimit()`.
| Unused index variable (`js/unused-index-variable`) | correctness | Highlights loops that iterate over an array, but do not use the index variable to access array elements, indicating a possible typo or logic error. Results are shown on LGTM by default. |
| Loop bound injection (`js/loop-bound-injection`) | security, external/cwe/cwe-834 | Highlights loops where a user-controlled object with an arbitrary .length value can trick the server to loop indefinitely. Results are shown on LGTM by default. |
| Suspicious method name (`js/suspicious-method-name-declaration`) | correctness, typescript, methods | Highlights suspiciously named methods where the developer likely meant to write a constructor or function. Results are shown on LGTM by default. |
| Ignoring result from pure array method (`js/ignore-array-result`) | maintainability, correctness | Highlights calls to array methods without side effects where the return value is ignored. Results are shown on LGTM by default. |
| Incomplete URL scheme check (`js/incomplete-url-scheme-check`) | security, correctness, external/cwe/cwe-020 | Highlights checks for `javascript:` URLs that do not take `data:` or `vbscript:` URLs into account. Results are shown on LGTM by default. |
| Loop bound injection (`js/loop-bound-injection`) | security, external/cwe/cwe-834 | Highlights loops where a user-controlled object with an arbitrary `.length` value can trick the server into looping indefinitely. Results are shown on LGTM by default. |
| Shell command built from environment values (`js/shell-command-injection-from-environment`) | correctness, security, external/cwe/cwe-078, external/cwe/cwe-088 | Highlights shell commands that may change behavior inadvertently depending on the execution environment, indicating a possible violation of [CWE-78](https://cwe.mitre.org/data/definitions/78.html). Results are shown on LGTM by default.|
| Suspicious method name (`js/suspicious-method-name-declaration`) | correctness, typescript, methods | Highlights suspiciously named methods where the developer likely meant to write a constructor or function. Results are shown on LGTM by default. |
| Unreachable method overloads (`js/unreachable-method-overloads`) | correctness, typescript | Highlights method overloads that are impossible to use from client code. Results are shown on LGTM by default. |
| Unused index variable (`js/unused-index-variable`) | correctness | Highlights loops that iterate over an array, but do not use the index variable to access array elements, indicating a possible typo or logic error. Results are shown on LGTM by default. |
| Use of returnless function (`js/use-of-returnless-function`) | maintainability, correctness | Highlights calls where the return value is used, but the callee never returns a value. Results are shown on LGTM by default. |
| Useless regular expression character escape (`js/useless-regexp-character-escape`) | correctness, security, external/cwe/cwe-20 | Highlights regular expression strings with useless character escapes, indicating a possible violation of [CWE-20](https://cwe.mitre.org/data/definitions/20.html). Results are shown on LGTM by default. |
| Unreachable method overloads (`js/unreachable-method-overloads`) | correctness, typescript | Highlights method overloads that are impossible to use from client code. Results are shown on LGTM by default. |
| Ignoring result from pure array method (`js/ignore-array-result`) | maintainability, correctness | Highlights calls to array methods without side effects where the return value is ignored. Results are shown on LGTM by default. |
| Incomplete string escaping or encoding (`js/incomplete-sanitization`) | Fewer false-positive results | This rule now recognizes additional ways delimiters can be stripped away. |
| Client-side cross-site scripting (`js/xss`) | More results, fewer false-positive results | More potential vulnerabilities involving functions that manipulate DOM attributes are now recognized, and more sanitizers are detected. |
| Client-side cross-site scripting (`js/xss`) | More results, fewer falsepositive results | More potential vulnerabilities involving functions that manipulate DOM attributes are now recognized, and more sanitizers are detected. |
| Code injection (`js/code-injection`) | More results | More potential vulnerabilities involving functions that manipulate DOM event handler attributes are now recognized. |
| Hard-coded credentials (`js/hardcoded-credentials`) | Fewer false-positive results | This rule now flags fewer password examples. |
| Illegal invocation (`js/illegal-invocation`) | Fewer false-positive results | This rule now correctly handles methods named `call` and `apply`. |
| Incorrect suffix check (`js/incorrect-suffix-check`) | Fewer false-positive results | The query recognizes valid checks in more cases. |
| Network data written to file (`js/http-to-file-access`) | Fewer false-positive results | This query has been renamed to better match its intended purpose, and now only considers network data untrusted. |
| Password in configuration file (`js/password-in-configuration-file`) | Fewer false-positive results | This rule now flags fewer password examples. |
| Hard-coded credentials (`js/hardcoded-credentials`) | Fewer falsepositive results | This rule now flags fewer password examples. |
| Illegal invocation (`js/illegal-invocation`) | Fewer falsepositive results | This rule now correctly handles methods named `call` and `apply`. |
| Incomplete string escaping or encoding (`js/incomplete-sanitization`) | Fewer falsepositive results | This rule now recognizes additional ways delimiters can be stripped away. |
| Incorrect suffix check (`js/incorrect-suffix-check`) | Fewer falsepositive results | The query recognizes valid checks in more cases. |
| Network data written to file (`js/http-to-file-access`) | Fewer falsepositive results | This query has been renamed to better match its intended purpose, and now only considers network data untrusted. |
| Password in configuration file (`js/password-in-configuration-file`) | Fewer false positive results | This rule now flags fewer password examples. |
| Prototype pollution (`js/prototype-pollution`) | More results | The query now highlights vulnerable uses of jQuery and Angular, and the results are shown on LGTM by default. |
| Reflected cross-site scripting (`js/reflected-xss`) | Fewer false-positive results | The query now recognizes more sanitizers. |
| Stored cross-site scripting (`js/stored-xss`) | Fewer false-positive results | The query now recognizes more sanitizers. |
| Reflected cross-site scripting (`js/reflected-xss`) | Fewer falsepositive results | The query now recognizes more sanitizers. |
| Stored cross-site scripting (`js/stored-xss`) | Fewer falsepositive results | The query now recognizes more sanitizers. |
| Uncontrolled command line (`js/command-line-injection`) | More results | This query now treats responses from servers as untrusted. |
| Uncontrolled data used in path expression (`js/path-injection`) | Fewer false-positive results | This query now recognizes calls to Express `sendFile` as safe in some cases. |
| Uncontrolled data used in path expression (`js/path-injection`) | Fewer falsepositive results | This query now recognizes calls to Express `sendFile` as safe in some cases. |
| Unknown directive (`js/unknown-directive`) | Fewer false positive results | This query no longer flags uses of ":", which is sometimes used like a directive. |
## Changes to libraries
*`Expr.getDocumentation()` now handles chain assignments.
* String literals are now parsed as regular expressions.
Consequently, a `RegExpTerm` may occur as part of a string literal or
as a regular expression literal. Queries that search for regular expressions may need to
use `RegExpTerm.isPartOfRegExpLiteral` or `RegExpTerm.isUsedAsRegExp` to restrict the search.
A regular expression AST can be obtained from a string literal using `StringLiteral.asRegExp`.
## Removal of deprecated queries
The following queries (deprecated since 1.17) are no longer available in the distribution:
| Unreachable code | Fewer false positives | Analysis now accounts for uses of `contextlib.suppress` to suppress exceptions. |
| `__iter__` method returns a non-iterator | Better alert message | Alert now highlights which class is expected to be an iterator. |
| Explicit export is undefined (`py/undefined-export`) | Fewer false positive results | Instances where an exported value may be defined in a module that lacks points-to information are no longer flagged. |
| Module-level cyclic import (`py/unsafe-cyclic-import`) | Fewer false positive results | Instances where one of the links in an import cycle is never actually executed are no longer flagged. |
| Non-iterable used in for loop (`py/non-iterable-in-for-loop`) | Fewer false positive results | `__aiter__` is now recognized as an iterator method. |
| Unreachable code (`py/unreachable-statement`) | Fewer false positive results | Analysis now accounts for uses of `contextlib.suppress` to suppress exceptions. |
| Unreachable code (`py/unreachable-statement`) | Fewer false positive results | Unreachable `else` branches that do nothing but `assert` their non-reachability are no longer flagged. |
| Unused import (`py/unused-import`) | Fewer false positive results | Instances where a module is used in a forward-referenced type annotation, or only during type checking are no longer flagged. |
| `__iter__` method returns a non-iterator (`py/iter-returns-non-iterator`) | Better alert message | Alert now highlights which class is expected to be an iterator. |
| `__init__` method returns a value (`py/explicit-return-in-init`) | Fewer false positive results | Instances where the `__init__` method returns the value of a call to a procedure are no longer flagged. |
## Changes to QL libraries
* Django library now recognizes positional arguments from a `django.conf.urls.url` regex (Django version 1.x)
* Instances of the `Value` class now support the `isAbsent` method, indicating
whether that `Value` lacks points-to information, but inference
suggests that it exists. For instance, if a file contains `import
django`, but `django` was not extracted properly, there will be a
`ModuleValue` corresponding to this "unknown" module, and the `isAbsent`
method will hold for this `ModuleValue`.
* The `Expr` class now has a nullary method `pointsTo` that returns the possible
instances of `Value` that this expression may have.
* Asynchronous generator methods are now parsed correctly and no longer cause a spurious syntax error.
* Files in `node_modules` and `bower_components` folders are no longer extracted by default. If you still want to extract files from these folders, you can add the following filters to your `lgtm.yml` file (or add them to existing filters):
```yaml
extraction:
javascript:
index:
filters:
- include:"**/node_modules"
- include:"**/bower_components"
```
```yaml
extraction:
javascript:
index:
filters:
- include: "**/node_modules"
- include: "**/bower_components"
```
* Additional [Flow](https://flow.org/) syntax is now supported.
* Recognition of CommonJS modules has improved. As a result, some files that were previously extracted as
global scripts are now extracted as modules.
* Top-level `await` is now supported.
*A bug was fixed in how the TypeScript extractor handles default-exported anonymous classes.
* A bug was fixed in how the TypeScript extractor handles computed instance field names.
*Bugs were fixed in how the TypeScript extractor handles default-exported anonymous classes and computed-instance field names.
| Implicit function declarations (`cpp/Likely Bugs/Underspecified Functions/ImplicitFunctionDeclaration.ql`) | correctness, maintainability | This query finds calls to undeclared functions that are compiled by a C compiler. Results are shown on LGTM by default. |
## Changes to existing queries
A new taint-tracking library is used by all the security queries that track tainted values
| Boost\_asio TLS Settings Misconfiguration (`cpp/boost/tls-settings-misconfiguration`) | Query id change | The identifier was updated to use dashes in place of underscores (previous identifier `cpp/boost/tls_settings_misconfiguration`). |
| Buffer not sufficient for string (`cpp/overflow-calculated`) | More true positive results | This query now identifies a wider variety of buffer allocations using the `semmle.code.cpp.models.interfaces.Allocation` library. |
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | | This query is no longer run on LGTM. |
| Memory is never freed (`cpp/memory-never-freed`) | More true positive results | This query now identifies a wider variety of buffer allocations using the `semmle.code.cpp.models.interfaces.Allocation` library. |
| Memory may not be freed (`cpp/memory-may-not-be-freed`) | More true positive results | This query now identifies a wider variety of buffer allocations using the `semmle.code.cpp.models.interfaces.Allocation` library. |
| Missing return statement (`cpp/missing-return`) | Fewer false positive results and more accurate locations | Functions containing `asm` statements are no longer highlighted by this query. The locations reported by this query are now more accurate in some cases. |
| No space for zero terminator (`cpp/no-space-for-terminator`) | More results with greater precision | The query gives more precise results for a wider variety of buffer allocations. String arguments to formatting functions are now (usually) expected to be null terminated strings. Use of the `semmle.code.cpp.models.interfaces.Allocation` library identifies problems with a wider variety of buffer allocations. This query is also more conservative when identifying which pointers point to null-terminated strings. |
| Overflow in uncontrolled allocation size (`cpp/uncontrolled-allocation-size`) | Fewer false positive results | The query now produces fewer, more accurate results. Cases where the tainted allocation size is range checked are more reliably excluded. |
| Overloaded assignment does not return 'this' (`cpp/assignment-does-not-return-this`) | Fewer false positive results | This query no longer reports incorrect results in template classes. |
| Pointer overflow check (`cpp/pointer-overflow-check`),<br> Possibly wrong buffer size in string copy (`cpp/bad-strncpy-size`),<br> Signed overflow check (`cpp/signed-overflow-check`) | More correct results | A new library is used for determining which expressions have identical value, giving more precise results. There is a performance cost to this, and the LGTM suite will overall run slower than before. |
| Unsafe array for days of the year (`cpp/leap-year/unsafe-array-for-days-of-the-year`) | | This query is no longer run on LGTM. |
| Unsigned comparison to zero (`cpp/unsigned-comparison-zero`) | More correct results | This query now also looks for comparisons of the form `0 <= x`. |
## Changes to libraries
* The built-in C++20 "spaceship operator" (`<=>`) is now supported via the QL
class `SpaceshipExpr`. Overloaded forms are modeled as calls to functions
named `operator<=>`.
* The data-flow library (`semmle.code.cpp.dataflow.DataFlow` and
`semmle.code.cpp.dataflow.TaintTracking`) has been improved, which affects
and improves some security queries. The improvements are:
- Track flow through functions that combine taint tracking with flow through fields.
- Track flow through clone-like functions, that is, functions that read contents of a field from a
parameter and stores the value in the field of a returned object.
* The security pack taint tracking library
(`semmle.code.cpp.security.TaintTracking`) uses a new intermediate
representation. This provides a more precise analysis of flow through
parameters and pointers. For new queries, however, we continue to recommend
using `semmle.code.cpp.dataflow.TaintTracking`.
* The global value numbering library
(`semmle.code.cpp.valuenumbering.GlobalValueNumbering`) uses a new
intermediate representation to provide a more precise analysis of
heap-allocated memory and pointers to stack variables.
* New libraries have been created to provide a more consistent and useful interface
for modeling allocation and deallocation. These replace the old
`semmle.code.cpp.commons.Alloc` library.
* The new `semmle.code.cpp.models.interfaces.Allocation` library models
allocations, such as `new` expressions and calls to `malloc`.
* The new `semmle.code.cpp.models.interfaces.Deallocation` library
models deallocations, such as `delete` expressions and calls to `free`.
* The predicate `freeCall` in `semmle.code.cpp.commons.Alloc` has been
deprecated. The `Allocation` and `Deallocation` models in
`semmle.code.cpp.models.interfaces` should be used instead.
* The new class `StackVariable` should be used in place of `LocalScopeVariable`
in most cases. The difference is that `StackVariable` does not include
variables declared with `static` or `thread_local`.
* As a rule of thumb, custom queries about the _values_ of variables should
be changed from `LocalScopeVariable` to `StackVariable`, while queries
about the _name or scope_ of variables should remain unchanged.
* The `LocalScopeVariableReachability` library is deprecated in favor of
`StackVariableReachability`. The functionality is the same.
* Taint tracking and data flow now features better modeling of commonly-used
| Assembly path injection (`cs/assembly-path-injection`) | security, external/cwe/cwe-114 | Finds user-controlled data used to load an assembly. Results are shown on LGTM by default. |
| Insecure configuration for ASP.NET requestValidationMode (`cs/insecure-request-validation-mode`) | security, external/cwe/cwe-016 | Finds where this attribute has been set to a value less than 4.5, which turns off some validation features and makes the application less secure. By default, the query is not run on LGTM. |
| Insecure SQL connection (`cs/insecure-sql-connection`) | security, external/cwe/cwe-327 | Finds unencrypted SQL connection strings. Results are not shown on LGTM by default. |
| Page request validation is disabled (`cs/web/request-validation-disabled`) | security, frameworks/asp.net, external/cwe/cwe-016 | Finds where ASP.NET page request validation has been disabled, which could make the application less secure. By default, the query is not run on LGTM. |
| Serialization check bypass (`cs/serialization-check-bypass`) | security, external/cwe/cwe-20 | Finds where data is not validated in a deserialization method. Results are not shown on LGTM by default. |
| XML injection (`cs/xml-injection`) | security, external/cwe/cwe-091 | Finds user-controlled data that is used to write directly to an XML document. Results are shown on LGTM by default. |
| Dereferenced variable may be null (`cs/dereferenced-value-may-be-null`) | More results | Results are reported from parameters with a default value of `null`. |
| Information exposure through an exception (`cs/information-exposure-through-exception`) | More results | The query now recognizes writes to cookies, writes to ASP.NET (`Inner`)`Text` properties, and email contents as additional sinks. |
| Information exposure through transmitted data (`cs/sensitive-data-transmission`) | More results | The query now recognizes writes to cookies and writes to ASP.NET (`Inner`)`Text` properties as additional sinks. |
| Potentially dangerous use of non-short-circuit logic (`cs/non-short-circuit`) | Fewer false positive results | Results have been removed when the expression contains an `out` parameter. |
| Useless assignment to local variable (`cs/useless-assignment-to-local`) | Fewer false positive results | Results have been removed when the value assigned is an (implicitly or explicitly) cast default-like value. For example, `var s = (string)null` and `string s = default`. Results have also been removed when the variable is named `_` in a `foreach` statement. |
| XPath injection (`cs/xml/xpath-injection`) | More results | The query now recognizes calls to methods on `System.Xml.XPath.XPathNavigator` objects. |
## Changes to code extraction
* Tuple expressions, for example `(int,bool)` in `default((int,bool))` are now extracted correctly.
* Expression nullability flow state is extracted.
* Implicitly typed `stackalloc` expressions are now extracted correctly.
* The difference between `stackalloc` array creations and normal array creations is extracted.
## Changes to libraries
* The data-flow library has been improved, which affects and improves most security queries. The improvements are:
- Track flow through methods that combine taint tracking with flow through fields.
- Track flow through clone-like methods, that is, methods that read the contents of a field from a
parameter and store the value in the field of a returned object.
* The taint tracking library now tracks flow through (implicit or explicit) conversion operator calls.
* [Code contracts](https://docs.microsoft.com/en-us/dotnet/framework/debug-trace-profile/code-contracts) are now recognized, and are treated like any other assertion methods.
* Expression nullability flow state is given by the predicates `Expr.hasNotNullFlowState()` and `Expr.hasMaybeNullFlowState()`.
*`stackalloc` array creations are now represented by the QL class `Stackalloc`. Previously they were represented by the class `ArrayCreation`.
* A new class `RemoteFlowSink` has been added to model sinks where data might be exposed to external users. Examples include web page output, emails, and cookies.
| Disabled Spring CSRF protection (`java/spring-disabled-csrf-protection`) | security, external/cwe/cwe-352 | Finds disabled Cross-Site Request Forgery (CSRF) protection in Spring. Results are shown on LGTM by default. |
| Failure to use HTTPS or SFTP URL in Maven artifact upload/download (`java/maven/non-https-url`) | security, external/cwe/cwe-300, external/cwe/cwe-319, external/cwe/cwe-494, external/cwe/cwe-829 | Finds use of insecure protocols during Maven dependency resolution. Results are shown on LGTM by default. |
| LDAP query built from user-controlled sources (`java/ldap-injection`) | security, external/cwe/cwe-090 | Finds LDAP queries vulnerable to injection of unsanitized user-controlled input. Results are shown on LGTM by default. |
| Left shift by more than the type width (`java/lshift-larger-than-type-width`) | correctness | Finds left shifts of ints by 32 bits or more and left shifts of longs by 64 bits or more. Results are shown on LGTM by default. |
| Suspicious date format (`java/suspicious-date-format`) | correctness | Finds date format patterns that use placeholders that are likely to be incorrect. Results are shown on LGTM by default. |
| Dereferenced variable may be null (`java/dereferenced-value-may-be-null`) | Fewer false positive results | Final fields with a non-null initializer are no longer reported. |
| Expression always evaluates to the same value (`java/evaluation-to-constant`) | Fewer false positive results | Expressions of the form `0 * x` are usually intended and no longer reported. Also left shift of ints by 32 bits and longs by 64 bits are no longer reported as they are not constant, these results are instead reported by the new query `java/lshift-larger-than-type-width`. |
| Useless null check (`java/useless-null-check`) | More true positive results | Useless checks on final fields with a non-null initializer are now reported. |
## Changes to libraries
* The data-flow library has been improved, which affects and improves most security queries. The improvements are:
- Track flow through methods that combine taint tracking with flow through fields.
- Track flow through clone-like methods, that is, methods that read contents of a field from a
parameter and stores the value in the field of a returned object.
* Identification of test classes has been improved. Previously, one of the
match conditions would classify any class with a name containing the string
"Test" as a test class, but now this matching has been replaced with one that
looks for the occurrence of actual unit-test annotations. This affects the
general file classification mechanism and thus suppression of alerts, and
also any security queries using taint tracking, as test classes act as
default barriers stopping taint flow.
* Parentheses are now no longer modeled directly in the AST, that is, the
`ParExpr` class is empty. Instead, a parenthesized expression can be
identified with the `Expr.isParenthesized()` member predicate.
| Cross-site scripting through exception (`js/xss-through-exception`) | security, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities where an exception is written to the DOM. Results are not shown on LGTM by default. |
| Missing await (`js/missing-await`) | correctness | Highlights expressions that operate directly on a promise object in a nonsensical way, instead of awaiting its result. Results are shown on LGTM by default. |
| Polynomial regular expression used on uncontrolled data (`js/polynomial-redos`) | security, external/cwe/cwe-730, external/cwe/cwe-400 | Highlights expensive regular expressions that may be used on malicious input. Results are shown on LGTM by default. |
| Prototype pollution in utility function (`js/prototype-pollution-utility`) | security, external/cwe/cwe-400, external/cwe/cwe-471 | Highlights recursive assignment operations that are susceptible to prototype pollution. Results are shown on LGTM by default. |
| Regular expression always matches (`js/regex/always-matches`) | correctness, regular-expressions | Highlights regular expression checks that trivially succeed by matching an empty substring. Results are shown on LGTM by default. |
| Unsafe jQuery plugin (`js/unsafe-jquery-plugin`) | | Highlights potential XSS vulnerabilities in unsafely designed jQuery plugins. Results are shown on LGTM by default. |
| Unnecessary use of `cat` process (`js/unnecessary-use-of-cat`) | correctness, security, maintainability | Highlights command executions of `cat` where the fs API should be used instead. Results are shown on LGTM by default. |
| Clear-text logging of sensitive information (`js/clear-text-logging`) | More results | More results involving `process.env` and indirect calls to logging methods are recognized. |
| Duplicate parameter names (`js/duplicate-parameter-name`) | Fewer results | This query now ignores additional parameters that reasonably can have duplicated names. |
| Expression has no effect (`js/useless-expression`) | Fewer false positive results | The query now recognizes block-level flow type annotations and ignores the first statement of a try block. |
| Identical operands (`js/redundant-operation`) | Fewer results | This query now excludes cases where the operands change a value using ++/-- expressions. |
| Incomplete string escaping or encoding (`js/incomplete-sanitization`) | Fewer false positive results | This query now recognizes and excludes additional cases where a single replacement is likely to be intentional. |
| Incomplete URL scheme check (`js/incomplete-url-scheme-check`) | More results | This query now recognizes additional variations of URL scheme checks. |
| Missing CSRF middleware (`js/missing-token-validation`) | Fewer false positive results | The query reports fewer duplicates and only flags handlers that explicitly access cookie data. |
| Superfluous trailing arguments (`js/superfluous-trailing-arguments`) | Fewer results | This query now excludes cases where a function uses the `Function.arguments` value to process a variable number of parameters. |
| Syntax error (`js/syntax-error`) | Lower severity | This results of this query are now displayed with lower severity. |
| Unbound event handler receiver (`js/unbound-event-handler-receiver`) | Fewer false positive results | This query now recognizes additional ways event handler receivers can be bound. |
| Uncontrolled command line (`js/command-line-injection`) | More results | This query now recognizes additional ways of constructing arguments to `cmd.exe` and `/bin/sh`. |
| Uncontrolled data used in path expression (`js/path-injection`) | More results | This query now recognizes additional ways dangerous paths can be constructed and used. |
| Use of call stack introspection in strict mode (`js/strict-mode-call-stack-introspection`) | Fewer false positive results | The query no longer flags expression statements. |
| Use of password hash with insufficient computational effort (`js/insufficient-password-hash`) | Fewer false positive results | This query now recognizes and excludes additional cases that do not require secure hashing. |
| Useless regular-expression character escape (`js/useless-regexp-character-escape`) | Fewer false positive results | This query now distinguishes between escapes in strings and regular expression literals. |
## Changes to libraries
* The predicates `RegExpTerm.getSuccessor` and `RegExpTerm.getPredecessor` have been changed to reflect textual, not operational, matching order. This only makes a difference in lookbehind assertions, which are operationally matched backwards. Previously, `getSuccessor` would mimick this, so in an assertion `(?<=ab)` the term `b` would be considered the predecessor, not the successor, of `a`. Textually, however, `a` is still matched before `b`, and this is the order we now follow.
* An extensible model of the `EventEmitter` pattern has been implemented.
* Taint-tracking configurations now interact differently with the `data` flow label, which may affect queries
that combine taint-tracking and flow labels.
- Sources added by the 1-argument `isSource` predicate are associated with the `taint` label now, instead of the `data` label.
- Sanitizers now only block the `taint` label. As a result, sanitizers no longer block the flow of tainted values wrapped inside a property of an object.
To retain the old behavior, instead use a barrier, or block the `data` flow label using a labeled sanitizer.
| Arbitrary file write during tarfile extraction (`py/tarslip`) | Fewer false negative results | Negations are now handled correctly in conditional expressions that may sanitize tainted values. |
| First parameter of a method is not named 'self' (`py/not-named-self`) | Fewer false positive results | `__class_getitem__` is now recognized as a class method. |
| Import of deprecated module (`py/import-deprecated-module`) | Fewer false positive results | Deprecated modules that are used to provide backwards compatibility are no longer reported.|
| Module imports itself (`py/import-own-module`) | Fewer false positive results | Imports local to a given package are no longer classified as self-imports. |
| Uncontrolled command line (`py/command-line-injection`) | More results | We now model the `fabric` and `invoke` packages for command execution. |
### Web framework support
The CodeQL library has improved support for the web frameworks: Bottle, CherryPy, Falcon, Pyramid, TurboGears, Tornado, and Twisted. They now provide a proper `HttpRequestTaintSource`, instead of a `TaintSource`. This will enable results for the following queries:
- `py/path-injection`
- `py/command-line-injection`
- `py/reflective-xss`
- `py/sql-injection`
- `py/code-injection`
- `py/unsafe-deserialization`
- `py/url-redirection`
The library also has improved support for the web framework Twisted. It now provides a proper
`HttpResponseTaintSink`, instead of a `TaintSink`. This will enable results for the following
queries:
- `py/reflective-xss`
- `py/stack-trace-exposure`
## Changes to libraries
### Taint tracking
- The `urlsplit` and `urlparse` functions now propagate taint appropriately.
- HTTP requests using the `requests` library are now modeled.
| Uncontrolled format string (`cpp/tainted-format-string`) | | This query is now displayed by default on LGTM. |
| Uncontrolled format string (through global variable) (`cpp/tainted-format-string-through-global`) | | This query is now displayed by default on LGTM. |
## Changes to libraries
* The library `VCS.qll` and all queries that imported it have been removed.
* The data-flow library has been improved, which affects most security queries by potentially
adding more results. Flow through functions now takes nested field reads/writes into account.
For example, the library is able to track flow from `taint()` to `sink()` via the method
`getf2f1()` in
```c
struct C {
int f1;
};
struct C2
{
C f2;
int getf2f1() {
return f2.f1; // Nested field read
}
void m() {
f2.f1 = taint();
sink(getf2f1()); // NEW: taint() reaches here
}
};
```
* The security pack taint tracking library (`semmle.code.cpp.security.TaintTracking`) now considers that equality checks may block the flow of taint. This results in fewer false positive results from queries that use this library.
* The length of a tainted string (such as the return value of a call to `strlen` or `strftime` with tainted parameters) is no longer itself considered tainted by the `models` library. This leads to fewer false positive results in queries that use any of our taint libraries.
| DOM text reinterpreted as HTML (`js/xss-through-dom`) | security, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities where existing text from the DOM is used as HTML. Results are shown on LGTM by default. |
| Incomplete HTML attribute sanitization (`js/incomplete-html-attribute-sanitization`) | security, external/cwe/cwe-20, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities due to incomplete sanitization of HTML meta-characters. Results are shown on LGTM by default. |
| Unsafe expansion of self-closing HTML tag (`js/unsafe-html-expansion`) | security, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights potential XSS vulnerabilities caused by unsafe expansion of self-closing HTML tags. |
| Unsafe shell command constructed from library input (`js/shell-command-constructed-from-input`) | correctness, security, external/cwe/cwe-078, external/cwe/cwe-088 | Highlights potential command injections due to a shell command being constructed from library inputs. Results are shown on LGTM by default. |
| Download of sensitive file through insecure connection (`js/insecure-download`) | security, external/cwe/cwe-829 | Highlights downloads of sensitive files through an unencrypted protocol. Results are shown on LGTM by default. |
| Exposure of private files (`js/exposure-of-private-files`) | security, external/cwe/cwe-200 | Highlights servers that serve private files. Results are shown on LGTM by default. |
| Creating biased random numbers from a cryptographically secure source (`js/biased-cryptographic-random`) | security, external/cwe/cwe-327 | Highlights mathematical operations on cryptographically secure numbers that can create biased results. Results are shown on LGTM by default. |
| Storage of sensitive information in build artifact (`js/build-artifact-leak`) | security, external/cwe/cwe-312 | Highlights storage of sensitive information in build artifacts. Results are shown on LGTM by default. |
| Improper code sanitization (`js/bad-code-sanitization`) | security, external/cwe/cwe-094, external/cwe/cwe-079, external/cwe/cwe-116 | Highlights string concatenation where code is constructed without proper sanitization. Results are shown on LGTM by default. |
| Disabling certificate validation (`js/disabling-certificate-validation`) | security, external/cwe-295 | Highlights locations where SSL certificate validation is disabled. Results are shown on LGTM by default. |
| Incomplete multi-character sanitization (`js/incomplete-multi-character-sanitization`) | correctness, security, external/cwe/cwe-20, external/cwe/cwe-116 | Highlights sanitizers that fail to remove dangerous substrings completely. Results are shown on LGTM by default. |
| Client-side cross-site scripting (`js/xss`) | Fewer results | This query now recognizes additional safe patterns of constructing HTML. |
| Client-side URL redirect (`js/client-side-unvalidated-url-redirection`) | Fewer results | This query now recognizes additional safe patterns of doing URL redirects. |
| Code injection (`js/code-injection`) | More results | More potential vulnerabilities involving NoSQL code operators are now recognized. |
| Exception text reinterpreted as HTML (`js/exception-xss`) | Rephrased and changed visibility | Rephrased name and alert message. Severity lowered from error to warning. Results are now shown on LGTM by default. |
| Expression has no effect (`js/useless-expression`) | Fewer results | This query no longer flags an expression when that expression is the only content of the containing file. |
| Hard-coded credentials (`js/hardcoded-credentials`) | More results | This query now recognizes hard-coded credentials sent via HTTP authorization headers. |
| Incomplete URL scheme check (`js/incomplete-url-scheme-check`) | More results | This query now recognizes additional url scheme checks. |
| Insecure randomness (`js/insecure-randomness`) | Fewer results | This query now recognizes when an insecure random value is used as a fallback when secure random values are unsupported. |
| Misspelled variable name (`js/misspelled-variable-name`) | Message changed | The message for this query now correctly identifies the misspelled variable in additional cases. |
| Non-linear pattern (`js/non-linear-pattern`) | Fewer duplicates and message changed | This query now generates fewer duplicate alerts and has a clearer explanation in case of type annotations used in a pattern. |
| Prototype pollution in utility function (`js/prototype-pollution-utility`) | More results | This query now recognizes additional utility functions as vulnerable to prototype polution. |
| Uncontrolled command line (`js/command-line-injection`) | More results | This query now recognizes additional command execution calls. |
| Uncontrolled data used in path expression (`js/path-injection`) | More results | This query now recognizes additional file system calls. |
| Uncontrolled data used in path expression (`js/path-injection`) | Fewer results | This query no longer flags paths that have been checked to be part of a collection. |
| Unknown directive (`js/unknown-directive`) | Fewer results | This query no longer flags directives generated by the Babel compiler. |
| Unneeded defensive code (`js/unneeded-defensive-code`) | Fewer false-positive results | This query now recognizes checks meant to handle the `document.all` object. |
| Unused property (`js/unused-property`) | Fewer results | This query no longer flags properties of objects that are operands of `yield` expressions. |
| Zip Slip (`js/zipslip`) | More results | This query now recognizes additional vulnerabilities. |
The following low-precision queries are no longer run by default on LGTM (their results already were not displayed):
-`js/angular/dead-event-listener`
-`js/angular/unused-dependency`
-`js/bitwise-sign-check`
-`js/comparison-of-identical-expressions`
-`js/conflicting-html-attribute`
-`js/ignored-setter-parameter`
-`js/jsdoc/malformed-param-tag`
-`js/jsdoc/missing-parameter`
-`js/jsdoc/unknown-parameter`
-`js/json-in-javascript-file`
-`js/misspelled-identifier`
-`js/nested-loops-with-same-variable`
-`js/node/cyclic-import`
-`js/node/unused-npm-dependency`
-`js/omitted-array-element`
-`js/return-outside-function`
-`js/single-run-loop`
-`js/too-many-parameters`
-`js/unused-property`
-`js/useless-assignment-to-global`
## Changes to libraries
* A library `semmle.javascript.explore.CallGraph` has been added to help write queries for exploring the call graph.
* Added data flow for `Map` and `Set`, and added matching type-tracking steps that can accessed using the `CollectionsTypeTracking` module.
* The data-flow node representing a parameter or destructuring pattern is now always the `ValueNode` corresponding to that AST node. This has a few consequences:
-`Parameter.flow()` now gets the correct data flow node for a parameter. Previously this had a result, but the node was disconnected from the data flow graph.
-`ParameterNode.asExpr()` and `.getAstNode()` now gets the parameter's AST node, whereas previously it had no result.
-`Expr.flow()` now has a more meaningful result for destructuring patterns. Previously this node was disconnected from the data flow graph. Now it represents the values being destructured by the pattern.
* The global data-flow and taint-tracking libraries now model indirect parameter accesses through the `arguments` object in some cases, which may lead to additional results from some of the security queries, particularly "Prototype pollution in utility function".
* The predicates `Type.getProperty()` and variants of `Type.getMethod()` have been deprecated due to lack of use-cases. Looking up a named property of a static type is no longer supported, favoring faster extraction times instead.
* Importing `semmle.python.web.HttpRequest` will no longer import `UntrustedStringKind` transitively. `UntrustedStringKind` is the most commonly used non-abstract subclass of `ExternalStringKind`. If not imported (by one mean or another), taint-tracking queries that concern `ExternalStringKind` will not produce any results. Please ensure such queries contain an explicit import (`import semmle.python.security.strings.Untrusted`).
* @description A function that uses more functions and variables from another file than functions and variables from its own file. This function might be better placed in the other file, to avoid exposing internals of the file it depends on.
* @description Two files share too much information about each other (accessing many operations or variables in both directions). It would be better to invert some of the dependencies to reduce the coupling between the two files.
select op, "This '==' operator has no effect. The assignment ('=') operator was probably intended."
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.