The `AliasedUse` instruction is supposed to represent future uses of aliased memory after the function returns. Since local variables from that function are no longer allocated after the function returns, the `AliasedUse` instruction should access only the set of aliased locations that does not include locals from the current stack frame.
It's not used anywhere outside `VoidContext.qll`, where it was defined.
The use in `VoidContext.qll` is 10 years old and was a workaround for an
extractor bug that no longer exists.
This new instruction is the dual of the existing `AliasedDefinition` instruction. Whereas that instruction defines the contents of aliased memory before the function was called, `AliasedUse` represents the potential use of all aliased memory after the function returns. This ensures that writes to aliased memory do not appear "dead", even if there are no further reads from aliased memory within the function itself.
The one interesting piece that needed to be fixed up was the type of an `Indirect[Read|Write]SideEffect` operand/result. If the parameter type is a pointer or reference to an incomplete type, we need to set the type of the side effect memory access to `Unknown`, because we don't model incomplete types in the IR type system.
I also added minimal support for `__assume` (generated as a `NoOp`), because lack of `__assume` support got in the way of debugging the other issue above.
This change gives a slight performance improvement and makes the QL code
shorter. It introduces some magic numbers in the code, but those are
confined to the `Pos` and `Spec` classes.
We get a speed-up because the evaluator has built-in support for integer
literals in the `OUTPUT` of `JOIN` operations, whereas `newtype`s have
to be explicitly joined on. As a result, a predicate like
`CFG::straightLineSparse#ffff` drops from 262 pipeline nodes to 242.
I measured performance on https://github.com/jluttine/suitesparse, which
is one of the projects that had the biggest slowdown when enabling the
QL CFG on lgtm.com. I took two measurements before this change and two
after. The `CFG.qll` stage took 117s and 112s before, and it took 106s
and 107s after.
Eventually these files will subsume the current `queries.xml` files
at the top of query-containing and library directories. For now they're
just here to support internal testing of the tooling support for them
we're writing on.
Format and contents is a work in progress. If you're not in Semmle,
don't depend on anything here making sense (or staying stable) until
you see the version tags increase to something nonzero.
Added some examples and reworded portions of guards.rst. There's still
more to do - examples for ensures and compares predicates, and possibly
rewording the description of the compares predicates
We now detect patterns like
f(bool cond){
if(cond)
then A
else B
and prune branches for calls like f(true) or f(false).
This pruning is done both in the local (bigstep) flow graph
as well as in the inter-procedural dataflow graph.
This test used `getAQlClass`, which caused it to break when new classes
were added anywhere in the libraries. That's now avoided by switching to
`getCanonicalQLClass`. It turns out that `getCanonicalQLClass` didn't
support arithmetic expressions on complex numbers, so that support had
to be added.
This fixes the test `cpp/ql/test/library-tests/constexpr_if/cfg.ql`,
which broke when the QL CFG was enabled.
The new cases are just copy-pastes of the `IfStmt` cases (they don't
share a useful common superclass) with added checks for whether their
constant value equals 0.
Hopefully it does not make a difference in practice whether
uninstantiated template functions are considered to have control flow
through initializers of their static variables.
This change is needed when enabling the QL CFG on certain snapshots such
as notaz/picodrive. It removes the `bbNotInLoop` predicate, which was
always a liability because it's inherently quadratic. The real slowdown
came in `skipLoop`, where all true-upon-entry loops were crossed with
all definitions of variables that should take their definition from the
loop body.
The `Operation` class is abstract, and extending it caused cached stages
to be recomputed all the way down to the AST. This meant that the leap
year queries evaluated their own copy of SSA and data flow.
The translation of static fields now uses `VariableAddress` instead of `FieldAddress`. This fixes the logic as well as the "field address without qualifier address" sanity check.
Fixed a problem when the translating the compiler generated constructors that caused some sanity errors (since they have no body, when translating the constructor block fragmentation happened). Fixed this by skipping the translation of the body, if it does not exist (when translating a function).
This commit implements the language-neutral IR type system for C#. It mostly follows the same pattern as C++, modified to fit the C# type system. All object references, pointers, and lvalues are represented as `IRAddress` types. All structs and generic parameters are implemented as `IRBlobType`. Function addresses get a single `IRFunctionAddressType`.
I had to fix a couple places in the original IR type system where I didn't realize I was still depending on language-specific types. As part of this, `CSharpType` and `CppType` now have a `hasUnspecifiedType()` predicate, which is equivalent to `hasType()`, except that it holds only for the unspecified version of the type. This predicate can go away once we remove the IR's references to the underlying `Type` objects.
All C# IR tests pass without modification, but only because this commit continues to print the name of `IRUnknownType` as `null`, and `IRFunctionAddressType` as `glval<null>`. These will be fixed separately in a subsequent commit in this PR.
A constructed type, `C<T>`, where `T` is the type parameter of `C`, is represented
in the database as the corresponding unbound generict type `C<>`. Consequently, the
type conversion library, which only considers `ConstructedType`s, does not handle
all implicit conversions. For example, in
```
interface I<in T1, T2> where T1 : C
```
there should be an implicit conversion from `I<C, T2>` to `I<T1, T2>` (=`I<>`).
We could find no reason for using `Builtin::builtin` instead of
`Builtin::special`. Since all the other base types use `special`, and the old
Object API is using `special`, let's also do that :)
The C++ IR currently has a very clunky way of specifying the type of an IR entity (`Instruction`, `Operand`, `IRVariable`, etc.). There are three separate predicates: `getType()`, `isGLValue()`, and `getSize()`. All three are necessary, rather than just having a `getType()` predicate, because some IR entities have types that are not represented via an existing `Type` object in the AST. Examples include the type for an lvalue returned from a `VariableAddress` instruction, the type for an array slice being zero-initialized in a variable initializer, and several others. It is very easy for QL code to just check the `getType()` predicate, while forgetting to use `isGLValue()` to determine if that type is the actual type of the entity (the prvalue case) or the type referred to by a glvalue entity. Furthermore, the C++ type system creates potentially many different `Type` objects for the same underlying type (e.g. typedefs, using declarations, `const`/`volatile` qualifiers, etc.), making it more difficult to tell when two entities have semantically equivalent types.
In addition, other languages for which we want to enable the IR have somewhat different type systems. The various language type systems differ in their structure, although they tend to share the basic building blocks necessary for the IR.
To address all of the above problems, I've introduced a new class hierarchy, rooted at the class `IRType`, that represents a bare-bones type system that is independent of source language (at least across C/C++/C#/Java). A type's identity is based on its kind (signed integer, unsigned integer, floating-point, Boolean, blob, etc.), size and in the case of blob types, a "tag" to differentiate between different classes and structs. No distinction is made between, say `signed int` and plain `int`, or between different language integer types that have the same signedness and size (e.g. `unsigned int` vs. `wchar_t` on Linux). `IRType` is intended for use by language-agnostic IR-based analyses, including range analysis, dataflow, SSA construction, and alias analysis. The set of available `IRType`s is determined by predicate provided by the language library implementation (e.g. `hasSignedIntegerType(int byteSize)`.
In addition to `IRType`, each language now defines a type alias named `LanguageType`, representing the type of an IR entity in more language-specific terms. The only predicate requried on `LanguageType` is `getIRType()`, which returns the single `IRType` object for the language-neutral representation of that `LanguageType`. All other predicates on and subclasses of `LanguageType` are language-specific. There may be many instances of `LanguageType` that map to a given `IRType`, to allow for typedefs, etc.
Most of the changes are mechanical changes in the IR construction code, to return the correct type for each IR entity. SSA construction has also been updated to avoid dependencies on language-specific types.
I have not yet removed the original `getType()` predicates that just return `Type`. These can be removed once we move the remaining existing libraries to use `IRType`.
Test results are, by design, pretty much unchanged. Once case changed for inline asm, because the previously IR generation for it played a little fast and loose with the input/output expressions. The test case now includes both input and output variables. The generated IR for `Conditional_LValue` is now more correct, because we now have a way to represent an lvalue of an lvalue. `syntax-zoo` is still a hot mess. Most of the changed outputs are due to wobble from having multiple functions with the same name, but with a slightly different order of evaluation due to the type changes. Others are wobble from already-invalid IR. A couple non-wobbly places have improved slightly, though.
The C# part of this change is waiting for #2005 to be merged, since that has some of the necessary C# implementation.
The translation of `IsExpr` created a sanity check to fail since it generated
a Phi node that had only one source: if a variable was declared as part of the `IsExpr`, a conditional branch was generated, and the variable was defined only in the true successor; this has been changes so that the declaration happens before the conditional branch, and the variable is uninitialized (this removed the need for the `isInitializedByElement` predicate from `TranslatedDeclarationBase`, so that has been removed) and only the assignment happens in the true successor block (so now the two inputs of the Phi node are the result of the `Uninitialized` instruction and the `Store` instruction from the true successor block).
Generation of IDs for namespace members has been fixed to generate
unique IDs for variables of the same name but in different namespaces.
Update the same_name test to validate this.
More accurate type sizes using language specific predicates from `IRCSharpLanguage.qll`.
Added immediate operands for some `PointerX` (add, sub) instructions.
Some other minor consistency fixes.
This fixes false positives that arise when a call such as `f.apply` can either be interpreted as a reflective invocation of `f`, or a normal call to method `apply` of `f`.
The key insight here is that `HC_FieldCons` and `HC_Array` are
functionally determined by the things that arise in another
recursive call. Lifting them to their own predicate, therefore,
reduces nonlinearity and constrains the join order in a way that
cannot be asymptotically bad -- and, indeed, makes quite a big
difference in practice.
Switching `security.TaintTracking` to use `DefaultTaintTracking` causes
us to lose a result from `UnboundedWrite.ql`, while this commit restores
it:
diff --git a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
index 1eba0e52f0e..d947b33b9d9 100644
--- a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
+++ b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
@@ -1,2 +1,3 @@
+| main.c:54:7:54:12 | call to strcat | This 'call to strcat' with input from $@ may overflow the destination. | main.c:93:15:93:18 | argv | argv |
| main.c:99:9:99:12 | call to gets | This 'call to gets' with input from $@ may overflow the destination. | main.c:99:9:99:12 | call to gets | call to gets |
| main.c:213:17:213:19 | buf | This 'scanf string argument' with input from $@ may overflow the destination. | main.c:213:17:213:19 | buf | buf |
This is one step toward implementing the taint-tracking wrapper in terms
of `Instruction` rather than `Expr`.
This leads to a few duplicate results in `TaintedAllocationSize.ql`
because the library now considers `sizeof(int)` to be just as
predictable as `4`, whereas the `security.TaintTracking` library does
not consider `sizeof` to be predictable. I think it's simpler to accept
the duplicate results since they are ultimately a quirk of the query,
not the library.
The following is the diff between (a) replacing `TaintTracking.qll` with
a link to `DefaultTaintTracking.qll` and (b) additionally applying this
commit.
diff --git a b
--- a/cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/TaintedAllocationSize/TaintedAllocationSize.expected
+++ b/cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/TaintedAllocationSize/TaintedAllocationSize.expected
@@ -1,5 +1,8 @@
| test.cpp:42:31:42:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:43:31:43:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:43:38:43:63 | ... * ... | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:45:31:45:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:48:25:48:30 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:49:17:49:30 | new[] | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:52:21:52:27 | call to realloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:52:35:52:60 | ... * ... | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
--- a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-190/CERT/INT04-C/int04.expected
+++ b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-190/CERT/INT04-C/int04.expected
@@ -1 +1,2 @@
| int04c.c:21:29:21:51 | ... * ... | This allocation size is derived from $@ and might overflow | int04c.c:14:30:14:35 | call to getenv | user input (getenv) |
+| int04c.c:22:33:22:38 | call to malloc | This allocation size is derived from $@ and might overflow | int04c.c:14:30:14:35 | call to getenv | user input (getenv) |
Added support for pointers and pointer operations and made sure all loads are correct.
Added support for the unsafe stmt.
Added basic support for the fixed stmt (for now we ignore the pinning).
Fixed a bug where assignments of the form `Object obj1 = obj2` would not generate a load instruction for `obj2` (see `raw_ir.expected`).
Added an extra `Load` for object creations that involve structs. This is because the variable that represents the struct should hold the actual struct, not a reference to it.
Refactored the piece of code that decided if a particular expr needs a load instruction and improved the code sharing between `TranslatedExpr.qll` and `TranslatedElement.qll` by creating 2 predicates that tell if a certain expr does or does not need a load.
Added support for `in`, `out` and `ref` parameters.
The user knows that an expression functionally determines its
hashCons value, and that an expression functionally determines
its number of children, but this is not provable from the
definitions, and so not usable by the optimiser. By storing
the result of those known-functional calls in a variable,
rather than repeating the call, we enable better join orders.
In theory this bug could associated CaptureOutNodes with the wrong transitively called
callable. However, in practice I could not create a test case that revealed incorrect
behaviour. I've included one such test case in the commit.
I believe that the cause of this is that OutNode::getACall() is not actually used in the
data flow libraries. Instead, DataFlowDispatch::Cached::getAnOutNode is the predicate
which is used to associated OutNode's with DataFlowCall's in practice, and that is always
used in a context that correctly binds the runtime target of the call.
The most common reason for the C# autobuilder to fail is because it
cannot determine a single unique .sln or .proj file to build, instead
reporting multiple sln or proj files at the same shortest depth. This
commit changes this to build all such files, rather than reporting an
error.
Added basic support for the using stmt, checked stmt, unchecked stmt
Note that the translations do not use the compiler generated element framework and hence they are just rough approximations. For accuracy, in the future their translation should use it.
These new results seem better than the previous ones, but the previous
ones are still there. Perhaps the `Buffer.qll` library could use some
adjustment, but this seems like an improvement in isolation.
The data flow library conflates pointers and their objects in some
places but not others. For example, a member function call `x.f()` will
cause flow from `x` of type `T` to `this` of type `T*` inside `f`. It
might be ideal to avoid that conflation, but that's not realistic
without using the IR.
We've had good experience in the taint tracking library with conflating
pointers and objects, and it improves results for field flow, so perhaps
it's time to try it out for all data flow.
TTransitiveCaptureCall represents a control flow node that may
transitively call many different callables which capture a variable from
the current scope. Captured variables are represented as synthetic
parameters to the callable, at negative indices. However, each of the
different targets may capture a different subset of variables from the
enclosing scope, so we must include the target along side the CFN in
order to prevent incorrect capture flow.
This is for symmetry with `exprNode` etc., and it should be handy for
the same reasons. I found one caller of `asInstruction` that got simpler
by using the new predicate instead.
Added support for continue stmt.
Minimal refactoring of the `TranslatedSpecificJump` classes.
Added a new test file, `jumps.cs` and updated the expected output.
This currently relies on the fact that qltest includes `ql/csharp/ql/src/Metrics` in addition to `ql/csharp/ql/src` on its search path when run internally, which is inconsistent with the other languages. Since this is the only test that relies on it, I'd like to update it and get rid of the extra search root eventually.
This commit adds a `semmle.code.cpp.ir.dataflow.DefaultTaintTracking`
library that's API-compatible with the
`semmle.code.cpp.security.TaintTracking` library. The new library is
implemented on top of the IR data flow library.
The idea is to evolve this library until it can replace
`semmle.code.cpp.security.TaintTracking` without decreasing our SAMATE
score. Then we'll have the IR in production use, and we will have one
less taint-tracking library in production.
The way object creation was translated has been changed: now creations are treated as expressions.
The main motivation for this was the inability to have creation expressions as arguments to
function calls (a test case has been added to showcase this).
All code that dealt with creation expressions has been moved from `TranslatedInitialization.qll` to `TranslatedExpr.qll`.
Some light refactoring has also been done, mainly removing code that was useless after the changes mentioned above.
These partial defs don't do any harm, but they could hurt performance.
In typical C++ snapshots, between 5% and 20% of all calls are to `const`
functions.
Comments like these will make the autoformatter produce bad indentation.
For the record (not for explainability), these issues were found with
git grep -P -A1 '^( */\*| +\*( |$))(.(?!\*/))*$' cpp/ql/src/'**/*.ql*' |grep -B10 'qll\?- [^*]*$'
It superficially looks like `@param` is supported in QLDoc, but this is
mostly an accident of how its parser works. Attributes starting with `@`
are only intended to be used in the top-level QLDoc of a query, and
there can only be one of each attribute. If there are multiple `@param`
entries, the QLDoc parser will only keep the first one.
Even though `parseConvSpec` in `Scanf.qll` documented multiple
parameters, only the first one would be shown in an IDE. The
corresponding predicate in `Print.qll` documented only its first
parameter, perhaps because of an autoformatting accident earlier in
time. I've attempted to reconstruct documentation for its other
parameters based on its sibling in `Scanf.qll`.
These functions were overly complicated, and the comments explaining the
complications did not auto-format well. A reference type cannot have
specifiers on it, so it's fine to call `getUnspecifiedType` before
checking if it's a reference type.
The autoformatter is opinionated about comment styles and assumes that
"short" comments attach to the following item while "long" comments are
items themselves. I found top-level short comments with the following
two commands and then searched the output for empty lines that came
after the comment.
git grep -A1 '^/\* .*\*/' cpp/ql/src
git grep -A1 '^//' 'cpp/ql/src/**/*.ql*'
The foreach was erroneously labelling the `True` and `False` edges as backedges.
Added a case for the compiler generated while in the predicate `getInstructionBackEdgeSuccessor/2`
from the file `IRConstruction.qll` so that only the edges from inside the body are labeled as back edges.
Added a framework for the translation of compiler generated elements, so that the process of adding a new desugaring process is almost mechanical.
The files in `internal` serve as the superclasses for all the compiler generated elements.
The file `Common.qll` captures common patterns for the compiler generated code to improve code sharing (by pattern I mean an element that appears in multiple desugarings). For example the `try...finally` pattern appears in the desugaring process of both the `lock` and the `foreach` stmts, so a class the provides a blueprint for this pattern is exposed. Several other patterns are present.
The expected output has also been updated (after a rebase) and it should be ignored.
The `raw/internal` folder has been restructured to better enhance code sharing between compiler generated elements and AST generated elements.
The translated calls classes have been refactored to better fit the C# library.
A new folder has been added, `common` that provides blueprints for the classes that deal with translations of calls, declarations, exprs and conditions.
Several `TranslatedX.qll` files have been modified so that they use those blueprint classes.
We haven't come to a conclusion on whether these two types will remain
identical forever. To make sure we're able to change it in the future,
this change makes it impossible to cast between the two types. Callers
must use the `asInstruction` member predicate to convert.
g++ doesn't support this code:
sorry, unimplemented: non-trivial designated initializers not supported
twoIntFields sSwapped = { .m2 = source(), .m1 = 0 };
so we need to build it in clang mode.
Data flow probably never worked when a variable declared in a
`ConditionDeclExpr` was modeled with `BlockVar`. That combination did
not come up in testing before the last commit.
The data flow library conflates pointers and objects enough for the
`definitionByReference` predicate to be too strict in some cases. It was
too permissive in other cases that are now (or will be) handled better
by field flow.
See also the change note entry.
Partial data flow had a semantic merge conflict with this branch. The
problem is that partial data flow doesn't (and shouldn't) cause the
initial pruning steps to run, but the length-2 access paths depend on
the `consCand` information that comes from that initial pruning. The
solution is to restore the old `AccessPath` class, now called
`PartialAccessPath` for use only by partial data flow.
With this change, partial data flow will in some cases allow more field
flow than non-partial data flow.
This commit removes fields from the responsibilities of `FlowVar.qll`.
The treatment of fields in that file was slow and imprecise.
It then adds another copy of the shared global data flow library, used
only to find local field flow, and it exposes that local field flow
through `localFlow` and `localFlowStep`.
This has a performance cost. It adds two cached stages to any query that
uses `localFlow`: the stage from `DataFlowImplCommon`, which is shared
with all queries that use global data flow, and a new stage just for
`localFlowStep`.
Fixed a bug in `TranslatedArrayExpr` that would prevent the element to produce the correct instruction result, hence creating problems with loads and stores.
`ElementsAddress` opcode now inherits from the `UnaryOpcode`, as it should.
Began working o inheritance, polymorphism and constructor init. Correct code is produced for them (though some more work is needed to accurately treat conversions between classes).
Removed commented code.
Added classes to properly deal with constructor init and modified and refactored TranslatedFunction to accomodate for the changes.
Properties and property access produce correct code.
Fixed a function qualifier bug in `TranslatedCall.qll`.
Added a new class to translate `ExprStmt`s whose expr is an `AssignExpr` whose lvalue is an accessor call: we translate only the accessor call in for the translated AST.
Broke down the class `TranslatedJump` to have more control on the IR control flow.
Now GotoLabelStmt, GotoCaseStmt, GotoDefaultStmt and BreakStmt are translated separately.
This also fixes an issue when having a switch as the last statement of a void function would create an incorrect CFG.
Correct code is now produced for increment and decrement expressions
Modified producesExprResult() and TTranslatedLoad() so that no loads are done from outside the crement exprs and that the VariableAddress generated from the access of the operator variable is recognized as an expr that produces result.
Throw statements now give correct code, apart from the case of rethrows: need to make explicit the fact that a finally block is executed even if stack unwinding happens.
Added 2 new classes to TranslatedStmt.qll, one for throws that have an exception, one for rethrows.
Fixed a bug in TranslatedDeclarationEntry.qll where some local declaration would be missed.
Changed toString into getQualifiedName for more clarity when generating the instructions in Instruction.qll.
Some general refactoring in TranslatedExpr.qll and TranslatedStmt.qll.
Added tests to showcase the instructions generated for object creation and object initialization
Updated raw_ir.expected
PrintIR now uses the qualified name (with types) when printing the IR for more clarity
Correct code is now generated from ObjectCreation exprs and ObjectInitializer exprs.
Removed TranslatedFieldInitialization and its subclasses and further refactored TranslatedInitialization
Introduced 2 new tags to support multidimensional arrays
Multidimensional arrays produce correct code
All types of initializations for arrays work correctly
Correct code is now generated for array initialization and element access.
Created a new binary Opcode, `IndexedElementAddress`, used to get the address of an array element, similar to how CIL does it.
Fixed simple variable initialization.
Now variables addressing correctly gets translated
Added a new test case to showcase this
Changed VoidType to ObjectType for the type of the 2 instructions
generated by as the prelude of a translated function
(UnmodeledDefinition and AliasedDefinition)
Ported basic functionalities from the C++ IR
Added a simple test that passes the IR sanity check and produces
sensible IR (together with the .expected files) to the C# test folder
This groups the change notes that concern the `DataFlow` library and
clarifies the change notes that concern the two different
`TaintTracking` libraries.
When this query was run as an upgrade script, the optimizer picked a bad
join order, making the upgrade very slow on large databases. It picked a
bad join order because upgrade scripts are run with no stats.
Lack of support for the GCC vector extensions was causing a bunch of sanity failures in the syntax zoo. This PR adds minimal IR generation support for these types.
Added `VectorAggregateLiteral`, and factored most of `ArrayAggregateLiteral` out into the common base class `ArrayOrVectorAggregateLiteral`. I'd be happy to merge these all into `ArrayAggregateLiteral` if we don't care about the distinction.
Made a few tweaks to `TranslatedArrayExpr` to compute the element type by looking at the result type of the `ArrayExpr`, not the type of the base operand. Note that this means that for `T a[10]; a[i] = foo;`, the result of the `PointerAdd` for `a[i]` will now be `glvalue<T>`, not `T*`. This is actually more faithful to the source language, and has no semantic difference on the IR.
Added some missing `getInstructionElementSize()` overrides.
Added the new `BuiltIn` opcode, renamed the existing `BuiltInInstruction` to `BuiltInOperationInstruction`, and made any `BuiltInOperation` that we don't specifically handle translate to `BuiltIn`. `BuiltInOperationInstruction` now has a way to get the specific `BuiltInOperation`.
Added `getCanonicalQLClass()` overrides for `GNUVectorType` and `BuiltInOperation`.
Added a simple IR test for vector types.
docs: better responsive behaviour
docs: improve c/c++ slides
docs: titles and fonts
docs: tidy up layout and css
docs: update layout to scale font-size by slide height
docs: tidy up templates and fix font headings
This commit adds field initializers to the CFG for non-static constructors. For
example, in
```
class C
{
int Field1 = 0;
int Field2 = Field1 + 1;
int Field3;
public C()
{
Field3 = 2;
}
public C(int i)
{
Field3 = 3;
}
}
```
the initializer expressions `Field1 = 0` and `Field2 = Field1 + 1` are added
to the two constructors, mimicking
```
public C()
{
Field1 = 0;
Field2 = Field1 + 1;
Field3 = 2;
}
```
and
```
public C()
{
Field1 = 0;
Field2 = Field1 + 1;
Field3 = 3;
}
```
respectively. This means that we no longer have to synthesize calls, callables,
parameters, and arguments in the data flow library, so much of the work from
d1755500e4 can be simplified.
This file is now identical in all languages. Unifying this file led to
the following changes:
- The documentation spelling fixes and example from the C++ version
were copied to the other versions and updated.
- The steps through `NonLocalJumpNode` from C# were abstracted into a
`globalAdditionalTaintStep` predicate that's empty for C++ and Java.
- The `defaultTaintBarrier` predicate from Java is now present but empty
on C++ and C#.
- The C++ `isAdditionalFlowStep` predicate on
`TaintTracking::Configuration` no longer includes `localFlowStep`.
That should avoid some unnecessary tuple copying.
This class was copy-pasted in all `DataFlowN.qll` files without using
the identical-files system to keep the copies in sync. The class is now
moved to the `DataFlowImplN.qll` files.
This also has the effect of preventing recursion through first data flow
library copy for C/C++. Such recursion has been deprecated for over a
year, and some forms of recursions are already ruled out by the library
implementation.
To compensate for the lack of field flow, the taint tracking library has
previously considered taint to flow from fields to their containing
structs and back again from the structs to any of their fields. This
leads to false flow between unrelated fields and is not needed now that
we have proper flow through fields.
There's now a `localFlowStep` predicate for use directly in queries and
other libraries and a `simpleLocalFlowStep` for use only by the global
data flow library. The former predicate is intended to include field
flow, but the latter may not.
This will let Java and C# (and possibly C++ IR) avoid getting two kinds
of field flow at the same time, both from SSA and from the global data
flow library. It should let C++ AST add some form of field flow to
`localFlowStep` without making it an input to the global data flow
library.
To keep the code changes minimal, and to keep the implementation similar
to C++ and Java, the `TaintTracking{Public,Private}` files are now
imported together through `TaintTrackingUtil`. This has the side effect
of exposing `localAdditionalTaintStep`. The corresponding predicate for
Java was already exposed.
This is more consistent with the other JSDoc queries. Results are still not shown on LGTM by default, but the query can now be enabled selectively for projects that care about JSDoc.
Initial implementation of data flow through fields, using the algorithm of the
shared data flow implementation. Fields (and field-like properties) are covered,
and stores can be either
- ordinary assignments, `Foo = x`,
- object initializers, `new C() { Foo = x }`, or
- field initializers, `int Foo = x`.
For field initializers, we need to synthesize calls (`SynthesizedCall`),
callables (`SynthesizedCallable`), parameters (`InstanceParameterNode`), and
arguments (`SynthesizedThisArgumentNode`), as the C# extractor does not (yet)
extract such entities. For example, in
```
class C
{
int Field1 = 1;
int Field2 = 2;
C() { }
}
```
there is a synthesized call from the constructor `C`, with a synthesized `this`
argument, and the targets of that call are two synthesized callables with bodies
`this.Field1 = 1` and `this.Field2 = 2`, respectively.
A consequence of this is that `DataFlowCallable` is no longer an alias for
`DotNet::Callable`, but instead an IPA type.
Because `ConstructorFieldInit` (member initializer lists) are not part
of the control flow graph, there was no data flow from the initial value
of parameters to their uses in member initializers. This commit adds the
necessary flow under the assumption that parameters are not overwritten
in member initializers.
This allows a member initializer list to be seen as a sequence of field
assignments. For example, the constructor
C() : a(taint()) { }
now has data flow similar to
C() { this.a = taint(); }
This brings the annotation style in sync with how we annotate new tests
these days. I also changed a few annotations to have different expected
outcome based on my understanding of the code.
This commit changes C++ `ConstructorCall` to behave like
`new`-expressions in Java: they are both `ExprNode`s and
`PostUpdateNodes`, and there's a "pre-update node" (here called
`PreConstructorCallNode`) to play the role of the qualifier argument
when calling a constructor.
On a Postgres snapshot, where the `getAReachedBlockVarSBB` predicate
performs badly because of a Yacc-generated 20,000-line parser loop, that
predicate is reduced from 4m22s to 1m32s plus 5.2s for the live
variables analysis.
This change removes 17,142 rows from `BlockVar.getAnAccess` on Postgres.
I sampled some of them, and they were all of the following form:
while (...) {
T x;
f1(&x); // access
f2(&x); // definition
}
Such accesses are ruled out now because we deliberately lose track of
variables when they go out of scope.
- Speedup the `varBlockReaches()` predicate, by restricting to basic blocks
in which a given SSA definition may still be live, in constrast to just
being able to reach *any* access (read or write) to the underlying source
variable.
- Account for some missing cases in the `lastRead()` predicate.
This removes a lot of flow steps, but it all seems to be flow that was
present twice: both exiting a `PartialDefNode` and a
`DefinitionByReferenceNode`. All `DefinitionByReferenceNode`s are now
`PartialDefNode`s.
There were two problems here.
1. The inline predicates `isInitialized` and `isValueInitialized` on
`ArrayAggregateLiteral` caused their callers to materialize every
`int` that was a valid index into the array. This was slow on huge
value-initialized arrays.
2. The `isInitialized` predicate was used in the `TInstructionTag` IPA
type, creating a numbered tuple for each integer in it. This seemed
to be entirely unnecessary since the `TranslatedElement`s using those
tags were already indexed appropriately.
Mostly just adding backticks in QLDoc comments. I'm trying out the edit-in-github workflow @jbj showed me, which seems like it will be a quicker way to do minor changes like these.
The predicate `maxSplits()` was previously applied dynamically to ensure that
any control flow node would keep track of at most `maxSplits()` number of splits.
However, there was no guarantee that two different copies of the same AST element
wouldn't contain different splits, so in general the number of copies for a given
AST element `e` could be on the order `$\binom{n}{k}c^k$`, where `n` is the total
number of splits that apply to `e`, `k = maxSplits()`, and `c` is a constant.
With this change, the relevant splits for `e` are instead computed statically,
meaning that the order is instead `$c^k$`.
With flow labels it often makes more sense to use a `DataFlow::Configuration` rather than a `TaintTracking::Configuration`, so flow summaries should support both.
My original fix in https://github.com/Semmle/ql/pull/1661 fixed my minimal test case, but did not fix the original failure in a Linux snapshot. The real fix is to simply not create a `TranslatedDeclarationEntry` for an extern declaration, and have `TranslatedDeclStmt` skip any such declarations. I've added a regression test for that case (multiple extern declarations with same location in a macro expansion, with control flow between them). I did verify that it generates correct IR, and that it fixes all of the "use not dominated by definition" failures in Linux.
The underlying extractor bug, that caused the above issue also caused PrintAST to print garbage. I've worked around the bug in PrintAST.qll.
I've also fixed a bug in the control flow for `try`/`catch`, where there was missing flow from the `CatchByType` of the last handler of a `try` to the enclosing handler (or `Unwind`). Hat tip to @AndreiDiaconu1 for spotting this bug.
Previously, where we had a function-scoped `DeclarationEntry` for an extern variable or function, we would generate a `NoOp` instruction for it. There's nothing wrong with this by itself, although it was unnecessary. However, I've hit an extractor issue (Jira ticket already opened) that commonly causes multiple `DeclStmt`s to share a single `DeclarationEntry` child on extern declarations, so removing the `NoOp` instructions is an easy way to work around the extractor issue.
We informally deprecated them in 1.21, this commit deprecates them properly and removes support from the implementation. The predicates themselves will be removed in a future release.
Rather than have IR.qll (which depends on the flavor) import EdgeKind.qll (which does not) with an non-relative import, I've moved these imports into internal.IRImports relative to IR.qll. These imports files can be shared across flavors within one language, but are different between C# and C++ due to the difference in paths.
This is still quadratic in the number of MemoryLocations for a vvar, but
only for a single pipeline step, which is not materialized. It seems to be
fast enough in practice for the IR.
Now that `PhiInstruction.getAnInput` only has results for congruent
operands, a previous optimization I made to `getConstantValue` is no
longer sound. We have to check that all phi inputs give the same value,
not just the congruent ones. After this change, if there are any
non-congruent operands on a phi instruction, the whole aggregate will
have no result.
In the `SignAnalysis` abstract interpretation, "unknown sign"
corresponds to the set of _all_ `Sign`, but using `getDef` leads to the
operand having _no_ `Sign`. To fix that, we assign all signs to inexact
operands.
The issue is now fixed in the extractor, and I've confirmed that the
workaround is no longer needed for g/an-tao/drogon.
This reverts commit 48a3385809.
This query seems to have been de-optimized by recent optimizer or stats
changes. On libretro/libretro-uae, the query took 1 second on a warm
cache with dist 89ad5f1 but took 9979 seconds with dist a3b9b6eb9.
The slowness was due to a Cartesian product in
`illDefined{Decr,Incr}ForStmt` between all the definitions and all the
uses of `Variable v`. This would be no problem with the right join
order, but that has apparently been lost. This commit factors out a pair
of `pragma[noinline]` helper predicates to make sure the definitions
(`v.getAnAssignedValue()`) and the uses (`v.getAnAccess()`) are queried
and filtered in separate predicates.
The performance problem can be seen in the tuple counts of this pipeline
I interrupted during evaluation of
`inconsistentLoopDirection::illDefinedDecrForStmt#ffff#shared`:
89716 ~3% {2} r1 = SCAN Variable::Variable::getAnAssignedValue_dispred#ff OUTPUT FIELDS {Variable::Variable::getAnAssignedValue_dispred#ff.<1>,Variable::Variable::getAnAssignedValue_dispred#ff.<0>}
89716 ~0% {3} r2 = JOIN r1 WITH DataFlowUtil::TExprNode#ff@staged_ext ON r1.<0>=DataFlowUtil::TExprNode#ff@staged_ext.<0> OUTPUT FIELDS {r1.<1>,DataFlowUtil::TExprNode#ff@staged_ext.<0>,DataFlowUtil::TExprNode#ff@staged_ext.<1>}
502539405 ~0% {4} r3 = JOIN r2 WITH Variable::Variable::getAnAccess_dispred#fb ON r2.<0>=Variable::Variable::getAnAccess_dispred#fb.<0> OUTPUT FIELDS {Variable::Variable::getAnAccess_dispred#fb.<1>,r2.<1>,r2.<2>,r2.<0>}
return r3
The last commit introduced calls to two predicates that did not exist. I
created `Instruction.getResultAddress` so it now exists and changed the
other call back to use the predicate that does exist.
This query used `getASuccessor()` on the CFG, which worked in many cases
but became quadratic on certain projects including PostgreSQL and
MySQL. The problem was that there was just enough context for magic to
apply to the transitive closure, but the use of magic meant that the
fast transitive closure algorithm wasn't used. In projects where the
magic had little effect, that led to the
`#ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus`
predicate taking quadratic time and space.
This commit changes the query to use basic blocks to find successors,
which is much faster because (1) there are many more `ControlFlowNode`s
than `BasicBlocks`, and (2) the optimizer does not apply magic but uses
fast transitive closure instead.
Behavior changes slightly in the `isUsedInCorrectLeapYearCheck` case: we
now accept a `yfacheck` that comes _before_ `yfa` if they are in the
same basic block. I don't think that matters in practice.
Without this `pragma[noopt]`, `isInCycle` gets compiled into RA that
unpacks every tuple of the fast TC:
0 ~0% {2} r1 = SELECT #Operand::getNonPhiOperandDef#3#ffPlus ON FIELDS #Operand::getNonPhiOperandDef#3#ffPlus.<0>=#Operand::getNonPhiOperandDef#3#ffPlus.<1>
0 ~0% {1} r2 = SCAN r1 OUTPUT FIELDS {r1.<0>}
return r2
With this change, it just becomes one lookup in the fast TC data
structure per instruction.
Just like `TInstruction` is cached to prevent re-numbering its tuples in
every IR query, I think `TOperand` should be cached too. I tested it on
the small comdb2 snapshot, where it only saves one second of work when
running a second IR query, but the savings should grow when snapshots
are larger and when there are more IR queries in a suite. Tuple
numbering is mildly quadratic, so it should be good to avoid repeating
it.
Adding these annotations adds three cached stages to the existing four
cached stages of the IR. The new cached stages are small and do not
appear to repeat any work from the other stages, so I see no advantage
to merging them with the existing stages.
The previous version of the test used `0 = 1;` to test an lvalue-typed
`ErrorExpr`, but the extractor replaced the whole assignment expression
with `ErrorExpr` instead of just the LHS. This variation of the test
only leads to an `ErrorExpr` for the part of the syntax that's supposed
to be an lvalue-typed expression, so that's an improvement.
Unfortunately it still doesn't demonstrate that we can `Store` into an
address computed by an `ErrorExpr`.
This doesn't fix the underlying problem that for some reason there are
cycles in the operand graph on our snapshots of the Linux kernel, but it
ensures that the cycles don't lead to non-termination of
`ConstantAnalysis` and `ValueNumbering`.
The `ValueNumbering` library is supposed to propagate value numberings
through a `CopyInstruction` only when it's _congruent_, meaning it must
have exact overlap with its source. A `CopyInstruction` can be a
`LoadInstruction`, a `StoreInstruction`, or a `CopyValueInstruction`.
The latter is also a `UnaryInstruction`, and the value numbering rule
for `UnaryInstruction` applied to it as well.
This meant that value numbering would propagate even through a
non-congruent `CopyValueInstruction`. That's semantically wrong but
probably only an issue in very rare circumstances, and it should get
corrected when we change the definition of `getUnary` to require
congruence.
What's worse is the performance implications. It meant that the value
numbering IPA witness could take two different paths through every
`CopyValueInstruction`. If multiple `CopyValueInstruction`s were
chained, this would lead to an exponential number of variable numbers
for the same `Instruction`, and we would run out of time and space
while performing value numbering.
This fixes the performance of `ValueNumbering.qll` on
https://github.com/asterisk/asterisk, although this project might also
require a separate change for fixing an infinite loop in the IR constant
analysis.
Before this change, `delete` and `delete[]` expressions had no control
flow after them, which caused the reachability analysis to remove all
code after a delete expression. This commit adds placeholder support for
delete expression by translating them to `NoOp` instructions so their
presence doesn't cause large chunks of the program to be removed.
We were previously missing a data-flow edge from reflected calls to the corresponding reflective call, that is, for `f.call(...)` we didn't have a flow edge from the implicit call to `f` to the result of `f.call(...)`.
This changes all the getters on `Instruction` to use `getDef` instead of
`getAnyDef`, with the result that these getters now only have a result
if the definition is exact.
This is a backwards-INCOMPATIBLE change.
These are the hand-written changes that complete the automatic changes
from the previous commit.
- Add deprecated compatibility wrappers for the renamed predicates.
- Add a new `Operand.getDef` predicate.
- Clarify the QLDoc for all these predicates.
This renames `getDefinitionInstruction` to `getAnyDef`, reflecting that
it includes definitions without exact overlap. It renames
`getUseInstruction` to `getUse` for consistency.
perl -p -i -e 's/\bgetUseInstruction\b/getUse/g; s/\bgetDefinitionInstruction\b/getAnyDef/g' \
cpp/ql/src/semmle/code/cpp/ir/**/*.ql* \
cpp/ql/test/**/*.ql* \
cpp/ql/src/semmle/code/cpp/rangeanalysis/**/*.ql*
The predicates `getter` and `setter` in `StructLikeClass.qll` were very
slow on some snapshots. On https://github.com/dotnet/coreclr they had
this performance:
StructLikeClass::getter#fff#antijoin_rhs ........... 3m55s
Variable::Variable::getAnAssignedValue_dispred#bb .. 3m36s
StructLikeClass::setter#fff#antijoin_rhs ........... 20.5s
The `getAnAssignedValue_dispred` predicate in the middle was slow due to
magic propagated from `setter`.
With this commit, performance is instead:
StructLikeClass::getter#fff#antijoin_rhs ........... 497ms
Variable::Variable::getAnAssignedValue_dispred#ff .. 617ms
StructLikeClass::setter#fff#antijoin_rhs ........... 158ms
Instead of hand-optimizing the QL for performance, I simplified `setter`
and `getter` to require slightly stronger conditions. Previously, a
function was only considered a setter if it had no writes to other
fields on the same class. That requirement is now relaxed by dropping
the "on the same class" part. I made the corresponding change for what
defines a getter. I think that still captures the spirit of what getters
and setters are.
I also changed the double-negation with `exists` into a `forall`.
The `sameBaseType` predicate was fundamentally quadratic, and this blew
up on large C++ code bases. Replacing it with calls to `Type.stripType`
fixes performance and does not affect the qltests. It looks like
`sameBaseType` was used purely an ad hoc heuristic, so I'm not worried
about the slight semantic difference between `sameBaseType` and
`stripType`.
In particular, `await`, `yield` and dynamic `import` expressions are now source nodes, as well as a few other experimental and legacy language features involving non-local flow.
The constraint `exists(callback.getParameter(i))` was getting pushed into `higherOrderCall`, which isn't a bad thing to do. However, this then led to a join on `i`, which is a very bad thing to do.
When completions are inherited by elements inside `finally` blocks, we previously
threw away the underlying completion. For example, in
```
try
{
if (b)
throw new Exception();
}
finally
{
if (b)
...
}
```
the completions for `b` inside the `finally` block are `true` and `throw(Exception)`,
where the latter is inherited from the `try` block, with an underlying `false`
completion. Throwing away the `false` completion meant that we were unable to prune
the `false` edge (Boolean CFG splitting).
- Make `InstructionViolation` abstract to avoid computing `getInstructionsUpTo()`
for all instructions in the database.
- Enable `consistency.ql`, which reports all consistency violations, and remove
all other specialized tests.
This is a workaround for an extractor issue where expressions in a
defaulted function are not always marked as generated. I haven't yet been
able to reproduce the issue in a test case.
To speed up the taint analysis in `NonConstantFormat.ql` and to remove
FPs that were due to taint spreading from `i` to `a[i]`, this commit
stops the taint tracking in `NonConstantFormat.ql` at every node that
could not possibly contain a string.
I tested performance on Wireshark, and it's fine. Pulling out the
`isSanitizerNode` prevented `isSanitizer` from turning into four
half-slow RA predicates due to both CPE and `#antijoin_rhs`
transformations happening.
- Split up the `last` predicate into a non-recursive part `lastNonRec` and a recursive
part `last`.
- Almost all syntactic constructs have a very simple `last` definition; a set of
disjuncts with exactly one recursive call -- those are defined in `lastNonRec`.
- `try` statements and (last) `catch` clauses require multiple recursive calls in
the same disjunct, and are therefore handled in the `last` predicate (as before).
- The benefit is that we only need to take care of the join order in the recursive
call (for non-`try`/`catch` statements) in one place (the predicate `lastRec`),
so we can get rid of many `nomagic`'ed `last`-specialisations.
- Add `Caching.qll` for controlling caching across multiple files.
- Move `isUncertainRefCall()` out of cached module in `Assignable.qll` to avoid
collapsing with CFG stage.
- Remove dependency on `AlwaysNullExpr` in `NullValue::getAnExpr()` to avoid
collapsing with CFG stage.
- Avoid caching pre-SSA library as it should only be used during the CFG construction
stage.
It's sometimes faster but sometimes up to 2x slower to use plain
recursion here. On the other hand, plain recursion won't run out of Java
heap space, and it won't make unrelated computation slower by forcing
all RAM data out to disk.
The use of transitive closure for BB index calculation has been the
cause of an out-of-memory error. This commit switches the calculation to
use the `shortestDistances` HOP, which still has the problem that the
result needs to fit in RAM, but at least the RAM requirements are sure
to be linear in the size of the result. The `shortestDistances` HOP is
already used for BB index calculation for the C++ IR and for C#.
We could guard even better against OOM by switching the calculation to
use manual recursion, but that would undo the much-needed performance
improvements we got from #123.
This change improves performance on Wireshark, which is notorious for
having long basic blocks. When I benchmarked `shortestDistances`
for #123, it was slower than TC. With the current evaluator, it looks
like `shortestDistances` is faster. Performance before was:
PrimitiveBasicBlocks::Cached::getMemberIndex#ff ................... 9.7s (executed 8027 times)
#PrimitiveBasicBlocks::Cached::member_step#ffPlus ................. 6.6s
PrimitiveBasicBlocks::Cached::primitive_basic_block_entry_node#f .. 3.5s
PrimitiveBasicBlocks::Cached::primitive_basic_block_member#fff .... 2.3s
Performance with this commit is:
PrimitiveBasicBlocks::Cached::primitive_basic_block_entry_node#f ................................................................... 3.5s
shortestDistances@PrimitiveBasicBlocks::Cached::primitive_basic_block_entry_node#1@PrimitiveBasicBlocks::Cached::member_step#2#fff . 3s
PrimitiveBasicBlocks::Cached::primitive_basic_block_member#fff ..................................................................... 963ms
This query is not producing good enough results to justify `@precision
high`. It's fundamentally looking for a pattern that should correlate
with memory management errors, but it doesn't look for the errors
themselves.
This query is supposed to look for constructors that unintentionally qualify as copy constructors due to default arguments. There are quite a few real-world projects that define such constructors intentionally. I've reduced the severity to "warning" and the precision to "low" due to the high false positive rate.
This works around BDD node exhaustion we get due to the complex type
hierarchy caused by importing many configurations at once. I've also
renamed the library accordingly.
Querying for overlap type wasn't possible when this library was first
written. This change fixes FPs in `RedundantNullCheckSimple.ql` on
Wireshark and other real-world projects.
IR construction was missing support for C++ 11 range-based `for` loops. The extractor generates ASTs for the compiler-generated implementation already, so I had enough information to generate IR. I've expanded on some of the predicates in `RangeBasedForStmt` to access the desugared information.
One complication was that the `DeclStmt`s for the compiler-generated variables seem to have results for `getDeclaration()` but not for `getDeclarationEntry()`. This required handling these slightly differently than we do for other `DeclStmt`s.
The flow for range-based `for` is actually easier than for a regular `for`, because all three components (init, condition, and update) are always present.
It turns out we didn't have to move the `getName` implementation into
the mirror classes in `QualifiedName`. Doing so only made it harder for
the optimiser to specialize calls to `getName` on various kinds of
`Declaration`.
In particular, the autobuilder will no longer succeed for projects that
contain HTML or YAML files but no JS/TS code. Further down the line,
this prevents LGTM.com from classifying such projects as "JavaScript"
projects.
For methods compiled without optimization (and possibly also with optimization),
it is possible for a variable update to have multiple possible assigned values.
For example, the non-optimized CIL for
```
return cond ? null : "not null"
```
is
```
0: nop
1: ldarg.0
2: ldfld cond
3: brtrue.s 6:
4: ldstr "not null"
5: br.s 7:
6: ldnull
7: stloc.0 L0 // stores either `null` or "not null"
8: br.s 9:
9: ldloc.0
10: ret
```
Consequently, an existential in `CallableReturns.qll` must be a `forex`.
- Extract names of properties in a propery match, using the `exprorstmt_name` relation.
- Simplify extraction of properties by not distinguishing between top-level patterns
and nested patterns.
- Introduce `PatternExpr` to capture patterns in `is` expressions, `case` statements,
and `switch` expression arms.
- Generalize `IsTypeExpr`, `IsPatternExpr`, `IsRecursivePatternExpr`, and `IsConstantExpr`
to just `IsExpr` with a member predicate `PatternExpr getPattern()`.
- Generalize `TypeCase`, `RecursivePatternCase`, and `ConstCase` to just `CaseStmt` with
a member predicate `PatternExpr getPattern()`.
- Introduce classes `Switch` and `Case` as base classes of switch statements/expressions
and case statements/switch expression arms, respectively.
- Simplify CFG logic using the generalized classes.
- Generalize guards library to cover `switch` expressions tests.
- Generalize data flow library to cover `switch` expression assignments.
This gains some performance by not computing locations for all
expressions since we are only interested in calls and variable accesses.
The `Top::hasLocationInfo` predicate goes from 2m28s to 1m32s on
Chromium.
This predicate was slow, mostly because it's just very large. A manual
join order cuts the run time on Chromium from
definitions::Top::hasLocationInfo_dispred#ffffff ..................... 3m23s
definitions::MacroAccessWithHasLocationInfo::hasLocationInfo#ffffff .. 1m56s
to
definitions::Top::hasLocationInfo#ffffff .... 2m28s
The main slowdown was the two uses of `SCAN` to reorder columns in the
RA.
In its present form, `getAnUndefinedReturn` does not handle `finally`
blocks correctly. For example, in this snippet
```
try {
return 42;
} finally {
cleanup();
}
```
the call to `cleanup` is erroneously considered an undefined return.
We currently don't use the predicate anywhere, so it seems best to back
it out for the time being.
The `reachable` predicate is large and slow to compute. It's part of a
mutual recursion that's non-linear, meaning it has a recursive call on
both sides of an `and`.
This change removes a part of the base case that has no effect on
recursive cases. The removed part is added back after the recursion has
finished.
Before, on Wireshark:
ControlFlowGraph::Cached::reachable#f .......... 20.8s (executed 9800 times)
ConstantExprs::successors_adapted#ff ........... 4.2s (executed 615 times)
ConstantExprs::potentiallyReturningFunction#f .. 3.9s (executed 9799 times)
ConstantExprs::possiblePredecessor#f ........... 2.9s (executed 788 times)
After, on Wireshark:
ConstantExprs::reachableRecursive#f ............ 13.2s (executed 9800 times)
ConstantExprs::successors_adapted#ff ........... 4.2s (executed 615 times)
ConstantExprs::potentiallyReturningFunction#f .. 4.3s (executed 9799 times)
ConstantExprs::possiblePredecessor#f ........... 2.6s (executed 788 times)
I've verified that this change doesn't change what's computed by
checking that the output of the following query is unchanged:
import cpp
import semmle.code.cpp.controlflow.internal.ConstantExprs
select
strictcount(ControlFlowNode n | reachable(n)) as reachable,
strictcount(ControlFlowNode n1, ControlFlowNode n2 | n2 = n1.getASuccessor()) as edges,
strictcount(FunctionCall c | aborting(c)) as abortingCall,
strictcount(Function f | abortingFunction(f)) as abortingFunction
This change only moves code around -- there are no changes to predicate
bodies or signatures.
The predicates that go in `ConstantExprs.Cached` after this change were
already cached in the same stage or, in the case of the `aborting*`
predicates, did not need to be cached. This is a fortunate consequence
of how the mutual recursion between the predicates happens to work, and
it's not going to be the case after the next commit.
The `addressConstantVariable` predicate was the slowest single predicate
when running the full LGTM suite on Chromium. Fortunately it's only
executed once, but it could be easily made faster by using the new
`Variable.isConstexpr` predicate instead of the slow workaround that was
in its place.
- General refactoring to fit with the shared data flow implementation.
- Move CFG splitting logic into `ControlFlowReachability.qll`.
- Replace `isAdditionalFlowStepIntoCall()` with `TaintedParameterNode`.
- Redefine `ReturnNode` to be the actual values that are returned, which should
yield better path information.
- No longer consider overrides in CIL calls.
Since the types in `QualifiedName.qll` are raw db types, callers need to
use `underlyingElement` and `unresolveElement` as appropriate. This has
no effect right now but will be needed when we switch the AST type
hierarchy to `newtype`s.
This removes all uses of `Declaration.getQualifiedName` that I think can
be removed without changing any behaviour. The following uses in the
LGTM default suite remain:
* `cpp/ql/src/Security/CWE/CWE-121/UnterminatedVarargsCall.ql` (in `select`).
* `cpp/ql/src/semmle/code/cpp/dataflow/internal/DataFlowDispatch.qll` (needs template args).
* `cpp/ql/src/semmle/code/cpp/security/FunctionWithWrappers.qll` (used for alert messages).
This imports `QualifiedName.qll` from
2f74a456290b9e0850b7308582e07f5d68de3a36 and makes minimal changes so it
compiles.
Original author: Ian Lynagh <ian@semmle.com>
I kept forgetting which operand on a Chi instruction was which, so I added dump labels. I added labels for the function target of a `Call`, for positional arguments, and for address operands as well.
Using `variableAccessedAsValue` fixes a FP because we can now
distinguish modifications to the parameter from modifications to data
_reachable from_ the parameter.
Now that the query has both tests and qhelp, we can use it on LGTM. This
commit also adds a change note.
I renamed the query to reduce confusion from the lower-case unquoted
word "alloca".
The `IRBlock::backEdgeSuccessor` predicate, in its three copies, had
become slow:
6:IRBlock::Cached::backEdgeSuccessor#fff ...... 1m1s
7:IRBlock::Cached::backEdgeSuccessor#2#fff .... 52.3s
8:IRBlock::Cached::backEdgeSuccessor#3#fff .... 26.4s
The slow part was finding all the nodes involved in cycles in the
`forwardEdgeRaw` graph. This was done with `forwardEdgeRaw+(pred, pred)`,
but that got compiled into a materialization of `forwardEdgeRaw+`, which
is a huge relation with 1,816,752,107 rows on Wireshark:
(1474s) Starting to evaluate predicate IRBlock::Cached::backEdgeSuccessor#3#fff
(1501s) Tuple counts:
0 ~0% {2} r1 = SELECT #IRBlock::Cached::forwardEdgeRaw#3#ffPlus ON FIELDS #IRBlock::Cached::forwardEdgeRaw#3#ffPlus.<0>=#IRBlock::Cached::forwardEdgeRaw#3#ffPlus.<1>
0 ~0% {1} r2 = SCAN r1 OUTPUT FIELDS {r1.<0>}
0 ~0% {3} r3 = JOIN r2 WITH IRBlock::Cached::blockSuccessor#6#fff ON r2.<0>=IRBlock::Cached::blockSuccessor#6#fff.<0> OUTPUT FIELDS {r2.<0>,IRBlock::Cached::blockSuccessor#6#fff.<1>,IRBlock::Cached::blockSuccessor#6#fff.<2>}
12411 ~7% {3} r4 = IRBlock::Cached::backEdgeSuccessorRaw#3#fff \/ r3
return r4
(1501s) >>> Relation IRBlock::Cached::backEdgeSuccessor#3#fff: 12411 rows using 0 MB
The problem is the `SELECT`. It's fast to join on a fastTC result once
we know what we're looking for, so this fix materializes the identity
relation on `IRBlock` and joins with that so the fastTC ends up on the
RHS of a join, where it's fast. I had to introduce a helper predicate
because even with `noopt` I couldn't get `pred = pred2` to come _before_
`forwardEdgeRaw+(pred, pred2)`. The predicate now takes less than a
second to evaluate:
(539s) Starting to evaluate predicate IRBlock::Cached::backEdgeSuccessor#fff
(539s) >>> Relation IRBlock::Cached::blockImmediatelyDominates#ff: 574677 rows using 0 MB
(539s) ... created with 574677 rows and 2 columns.
(539s) Tuple counts:
702445 ~1% {2} r1 = SELECT IRBlock::Cached::blockIdentity#ff ON FIELDS IRBlock::Cached::blockIdentity#ff.<0>=IRBlock::Cached::blockIdentity#ff.<1>
702445 ~1% {2} r2 = SCAN r1 OUTPUT FIELDS {r1.<0>,r1.<0>}
0 ~0% {1} r3 = JOIN r2 WITH #IRBlock::Cached::forwardEdgeRaw#ffPlus ON r2.<0>=#IRBlock::Cached::forwardEdgeRaw#ffPlus.<0> AND r2.<1>=#IRBlock::Cached::forwardEdgeRaw#ffPlus.<1> OUTPUT FIELDS {r2.<0>}
0 ~0% {3} r4 = JOIN r3 WITH IRBlock::Cached::blockSuccessor#2#fff ON r3.<0>=IRBlock::Cached::blockSuccessor#2#fff.<0> OUTPUT FIELDS {r3.<0>,IRBlock::Cached::blockSuccessor#2#fff.<1>,IRBlock::Cached::blockSuccessor#2#fff.<2>}
20487 ~0% {3} r5 = IRBlock::Cached::backEdgeSuccessorRaw#fff \/ r4
return r5
(539s) >>> Relation IRBlock::Cached::backEdgeSuccessor#fff: 20487 rows using 0 MB
Use the iterated dominance frontier algorithm to speed up dominance
frontier calculations. The implementation is copied from d310338c9b.
Before this change, the SSA calculations for unaliased and aliased SSA
used 169.9 seconds in total on these predicates:
7:Dominance::getDominanceFrontier#2#ff .. 49s
7:Dominance::blockDominates#2#ff ........ 47.5s
8:Dominance::getDominanceFrontier#ff .... 44.4s
8:Dominance::blockDominates#ff .......... 29s
After this change, the above predicates are replaced by two copies of
`getDominanceFrontier`, each of which takes less than a second.
This fixes `PointlessComparison.ql` on https://github.com/an-tao/drogon.
The QL is a bit obfuscated because it looks for a pattern that's
impossible according to the dbscheme. There is no accompanying test
because we haven't been able to boil this problem down to a simple test
case. If we could, we'd fix it directly in the extractor instead.
In addition to treating comparisons with literals as sanitisers, we now
also treat comparisons with variables that have a single assignment as
sanitisers.
Proving that such a variable is actually a constant is not easy, but for
this use case a simple approximation works fine.
`SEMMLE_TYPESCRIPT_NODE_RUNTIME` can be used to provide the path to the Node.js runtime executable.
If this is omitted, the extractor defaults to the current behaviour of looking for `node` on the PATH.
`SEMMLE_TYPESCRIPT_NODE_RUNTIME_EXTRA_ARGS` can be used to provide additional arguments to the
Node.js runtime. These are passed first, before the arguments supplied by the extractor.
These changes are designed to allow TypeScript extraction in controlled customer environments where
we cannot control the PATH, or must use custom Node.js executables with certain arguments set.
This test case exposes two bugs in our data flow library (fixed by the
two previous commits):
- the charpreds of `SourcePathNode` and `SinkPathNode` only ensured
that they were on a path from a source to a sink, not that they
actually were the source/sink themselves;
- function summarization would allow for non-level paths; in the
test case, this meant that one of the summaries for `source`
represented the path returning from `source` on line 13 and then
flowing back into the call on line 15, in the process transforming
the parity of the flow label and hence causing a spurious flow.
Their charpreds previously only ensured that they were on a path from a
source to a sink, not that they actually were the source and sink,
respectively. See two commits further for a test case.
It now also has `step` and `smallstep` predicates. In the usual case,
however, I think I prefer the `SourceNode::track` API, so I left the
recommended style in the qldoc alone (and adjusted the one for
`TypeBackTracker` to match).
We avoid putting a variable into SSA if its address is ever taken in a way that could allow mutation of the variable via indirection. We currently just look to see if the address is either "pointer to non-const" or "reference to non-const". However, if the address was cast to an integral type (e.g. `uintptr_t n = (uintptr_t)&x;`), we were treating it as unescaped. This change makes the conservative assumption that casting a pointer to an integer may result in the pointed-to value being modified later.
This fixes a customer-reported false positive (#2 from https://discuss.lgtm.com/t/2-false-positives-in-c-for-comparison-is-always-same/1943)
Requiring strict inclusion between types turned out to cause false
positives in `SnprintfOverflow`, which relied indirectly on
`RangeAnalysisUtils::linearAccessImpl` to identify acceptable bounds
checks. This query was particularly affected because `snprintf` returns
`int` (signed) but takes `size_t` (unsigned), so conversions are bound
to happen.
This query uses data flow for nullness analysis, which is always going
to be a large overapproximation. The overapproximation became too big
for one of the test cases after the recent change to make data flow go
across assignment by reference.
To make this query more conservative, it will now only report that the
`pDacl` argument can be null if there isn't also evidence that it can be
non-null.
This commit changes how data flow works in the following code.
MyType x = source();
defineByReference(&x);
sink(x);
The question here is whether there should be flow from `source` to
`sink`. Such flow is desirable if `defineByReference` doesn't write to
all of `x`, but it's undesirable if `defineByReference` is a typical
init function in `C` that writes to every field or if
`defineByReference` is `memcpy` or `memset` on the full range.
Before 1.20.0, there would be flow from `source` to `sink` in case `x`
happened to be modeled with `BlockVar` but not in case `x` happened to
be modelled with SSA. The choice of modelling depends on an analysis of
how `x` is used elsewhere in the function, and it's supposed to be an
internal implementation detail that there are two ways to model
variables. In 1.20.0, I changed the `BlockVar` behavior so it worked the
same as SSA, never allowing that flow. It turns out that this change
broke a customer's query.
This commit reverts `BlockVar` to its old behavior of letting flow
propagate past the `defineByReference` call and then regains consistency
by changing all variables that are ever defined by reference to be
modelled with `BlockVar` instead of SSA. This means we now get too much
flow in certain cases, but that appears to be better overall than
getting too little flow. See also the discussion in CPP-336.
compilers do. Various integral and floating-point types
are treated as mutually implicitly convertible. Remaining
warnings deal with misuse of pointer and array types.
This makes sure we can conclude from `(int)myShort == 0` that `myShort`
is 0 even though we can no longer conclude from `(short)myInt == 0` that
`myInt` is 0. Without this, we lost a good result in the test for
`InfiniteLoopWithUnsatisfiableExitCondition.ql`.
Reported by an LGTM customer here: https://discuss.lgtm.com/t/2-false-positives-in-c-for-comparison-is-always-same/1943.
Even though the comparison is pointless in the preprocessor configuration in effect during extraction, it is not pointless in other preprocessor configurations. Similar to ExprHasNoEffect, we'll now exclude results in functions that contain preprocessor-excluded code. I factored the similar code already used in ExprHasNoEffect in a non-recursive version into Preprocessor.qll, leaving the recursive version in ExprHasNoEffect.ql. I believe the recursive version is too aggressive for PointerlessComparison, which does no interprocedural analysis.
C promotion rules. The following issues are now flagged:
(1) passing a larger type than the receiver can accept
(e.g., long long -> int)
(2) passing a type of different signedness than the
parameter specified.
The IR construction code wasn't handling lambda expressions, so I added `TranslatedLambdaExpression`. It's pretty straightforward: it creates a temporary variable, initializes it with an `Uninitialized` instruction, then initializes the individual captured fields with the initializer list supplied in the AST.
When testing the case of a lambda with no captures, I noticed that we weren't handling initialization of empty structs with an initializer list correctly, so I fixed that along the way.
I was getting confused by the bad indentation for wrapped lines in
TranslatedInitialization.qll, so I fixed that up in a separate commit.
- Cache predicates in the same stage using a cached module.
- Introduce `DefUse::defUseVariableUpdate()` and use in `CallableReturns.qll`.
The updated file `csharp/ql/test/library-tests/cil/dataflow/Nullness.expected`
demonstrates why this is needed.
- Utilize CIL analysis in `Guards::nonNullValue()`.
- Analyze SSA definitions in `AlwaysNullExpr`, similar to `NonNullExpr`.
The `Expr.getType` predicate returns a pointer type since that's the
type of the `new`-expression as a whole. To find the class type, we use
`NewExpr.getAllocatedType`.
This commit reduces the number of alerts in a Qt snapshot from 229 to
51, and it removes the two false positives in
https://github.com/Subsurface-divelog/subsurface.
Previously, `AllConfigurations.qll` would pull in (almost) all taint
tracking configurations, which has started causing OOMEs during
compilation.
I've pruned it down to only the most interesting configurations. Since
flow summaries are experimental at this point and require a bit of manual
configuration anyway, this shouldn't be much of an issue in practice.
My recent changes to suppress FPs in `ReturnStackAllocatedMemory.ql`
caused us to lose all results where there was a `Conversion` at the
initial address escape. We cannot handle conversions in general, but
this commit restores the good results for the trivial types of
conversion that we can handle.
The main source of slowness in `BrokenCryptoAlgorithm.ql` was that the
regexp on function (macro) names was evaluated once per call
(invocation) instead of once per name. Factoring out separate predicates
for the problematic functions (macros) fixes this.
On https://github.com/ericniebler/range-v3, this change reduces the run
time of the two slowest predicates from
BrokenCryptoAlgorithm::InsecureMacroSpec#class#f .... 35.1s
BrokenCryptoAlgorithm::InsecureFunctionCall#class#f . 12.8s
to
BrokenCryptoAlgorithm::getAnInsecureFunction#f . 1.2s
BrokenCryptoAlgorithm::getAnInsecureMacro#f .... 12ms
Instead of `algorithmBlacklistRegex` having 2 * 5 results, it now has
only one result, which is a single regex that represents the union of
the previous 2 * 5 regexes. This means that `BrokenCryptoAlgorithm.ql`
has much less regex matching to do.
On https://github.com/ericniebler/range-v3, this change reduces the run
time of the two slowest predicates from
BrokenCryptoAlgorithm::InsecureMacroSpec#class#f .... 2m21s
BrokenCryptoAlgorithm::InsecureFunctionCall#class#f . 54.5s
to
BrokenCryptoAlgorithm::InsecureMacroSpec#class#f .... 35.1s
BrokenCryptoAlgorithm::InsecureFunctionCall#class#f . 12.8s
Since we cannot track data flow from a fully-converted expression but
only the unconverted expression, we should check whether the address
initially escapes into the unconverted expression, not the
fully-converted one.
This fixes most of the false positives observed on lgtm.com.
`yield import(...)` previously caused a syntax error, now it is parsed
correctly.
`parseYield` is the only place where the value of `startsExpr` matters,
so this change should not affect anything else.
- Restrict `OverridableCallable::getAnOverrider(ValueOrRefType t)` to types `t`
that are sub types of the callable's declaring type.
- Use explicit recursion in `OverridableCallable::getInherited()`.
Also added `PhiInstruction::getAnInputOperand()`, and renamed `PhiInstruction::getAnOperandDefinitionInstruction()` to `getAnInput()` for consistency with other `Instruction` classes.
Please let em know if this pattern works, and I can add other mechanisms to start new threads with a shared object.
Please also let me know what other mechanisms would you like me to add, I would like to focus on the most commonly used ones first. Thanks
Performance has been improved via suitable predicate folding, and correctness
has been improved as the line
```
result = getType().(RefType).getAnIndexer().getAnAccessor().getACall()
```
was missing a `getABaseType*()` (now using the simpler `hasMember()` predicate
instead).
The `FlowVar::toString` predicate is purely a debugging aid, but
unfortunately it has to be `cached` because it's in a `cached` class.
Before this commit, it caused `Expr::toString` to be evaluated in full.
The `automaticVariableAddressEscapes` predicate got join-ordered badly
in its `unaliased_ssa` version. These are the tuple counts on Wireshark,
where one pipeline step is seen to have 716 million tuples:
```
[2019-03-02 11:29:41] (42s) Starting to evaluate predicate AliasAnalysis::automaticVariableAddressEscapes#2#f
[2019-03-02 11:30:06] (67s) Tuple counts:
353419 ~0% {1} r1 = JOIN project#Instruction::VariableAddressInstruction#class#2#ff WITH AliasAnalysis::resultEscapesNonReturn#2#f ON project#Instruction::VariableAddressInstruction#class#2#ff.<0>=AliasAnalysis::resultEscapesNonReturn#2#f.<0> OUTPUT FIELDS {AliasAnalysis::resultEscapesNonReturn#2#f.<0>}
353419 ~0% {2} r2 = JOIN r1 WITH IRConstruction::Cached::getInstructionEnclosingFunctionIR#ff@staged_ext ON r1.<0>=IRConstruction::Cached::getInstructionEnclosingFunctionIR#ff@staged_ext.<0> OUTPUT FIELDS {IRConstruction::Cached::getInstructionEnclosingFunctionIR#ff@staged_ext.<1>,r1.<0>}
353419 ~0% {2} r3 = JOIN r2 WITH FunctionIR::FunctionIR::getFunction_dispred#3#ff ON r2.<0>=FunctionIR::FunctionIR::getFunction_dispred#3#ff.<0> OUTPUT FIELDS {FunctionIR::FunctionIR::getFunction_dispred#3#ff.<1>,r2.<1>}
716040298 ~0% {2} r4 = JOIN r3 WITH IRVariable::IRVariable#class#3#ff_10#join_rhs ON r3.<0>=IRVariable::IRVariable#class#3#ff_10#join_rhs.<0> OUTPUT FIELDS {IRVariable::IRVariable#class#3#ff_10#join_rhs.<1>,r3.<1>}
4480139 ~0% {2} r5 = JOIN r4 WITH IRVariable::IRAutomaticVariable#class#3#ff ON r4.<0>=IRVariable::IRAutomaticVariable#class#3#ff.<0> OUTPUT FIELDS {r4.<1>,r4.<0>}
66760 ~91% {1} r6 = JOIN r5 WITH Instruction::VariableInstruction::getVariable_dispred#2#ff ON r5.<0>=Instruction::VariableInstruction::getVariable_dispred#2#ff.<0> AND r5.<1>=Instruction::VariableInstruction::getVariable_dispred#2#ff.<1> OUTPUT FIELDS {r5.<1>}
return r6
[2019-03-02 11:30:06] (67s) >>> Relation AliasAnalysis::automaticVariableAddressEscapes#2#f: 35531 rows using 0 MB
```
The predicate contained a cyclic join, which is always hard to optimize.
I couldn't see a reason to join the `FunctionIR`, so I removed that
part. The predicate is now fast, and there are no changes in the tests.
Before this change,
```
flowOutOfCallableStep(CallNode call, ReturnNode ret, OutNode out, CallContext cc)
```
would compute all combinations of call sites `call` and returned expressions `ret`
up front.
Now, we instead introduce explicit return nodes, so each callable has exactly
one return node (as well as one for each `out`/`ref` parameter). There is then
local flow from a returned expression to the relevant return node, and
`flowOutOfCallableStep()` computes combinations of call sites and return nodes.
Not only does this result in better performance, it also makes `flowOutOfCallableStep()`
symmetric to `flowIntoCallableStep()`, where each argument is mapped to a parameter,
and not to all reads of that parameter.
Javadoc claims not to be able to resolve this link, while Eclipse manages to do so without any problems, failing an internal PR check.
It's only in a test, though, so I just removed it.
When determining which core library a "tried control flow element" is compiled against,
first look at exceptions caught by the surrounding `try` block, then look at assembly
attributes, and finally choose (randomly) the core library with the highest lexicographic
order.
This implementation is borrowed from Java's QL library and offers a
great performance improvement. For example, on Wireshark the performance
goes from
Dominance::bbDominates#ff ....... 40.3s
SSAUtils::dominanceFrontier#ff .. 30s
to
SSAUtils::dominanceFrontier#ff .. 418ms (executed 67 times)
The big performance problem before was the need to materialize
`bbDominates`, which is the reflexive-transitive "basic block dominates"
relation. It had 79 million rows on Wireshark.
This was previously flagged if `exports` wasn't used any further. While it's true that the assignment to `exports` is redundant in this case, the assignment is also flagged by DeadStorOfLocal, so there is no point in InvalidExport flagging it as well.
This query is only appropriate for setuid programs. Since such programs
are at most 0.1% of all code we analyse, I would say this query has a
precision of at most 0.1%.
Instead of having many small independent tests, we now just have a single test that pulls in all the individual tests and runs them together.
Concretely, each `.ql` file has been turned into a `.qll` file with a query predicate corresponding to the original `select` clause and named after the original `.ql` file, plus a prefix `test_`.
The newly added `tests.ql` imports all these `.qll`s.
The individual `.expected` files have been concatenated together into `tests.expected`, each prefixed with the name of the corresponding query predicate. (This is the format that qltest produces for tests with multiple query predicates.)
Some of these stubs were quite slow to evaluate. It's possible they
could be optimised, but it seems pointless as long as we don't have
call-context-sensitive virtual dispatch in the C++ library.
These two elements weren't cached, which meant that local data flow was
recalculated in every query that used data flow. They are also cached in
the Java version of `DataFlowUtil.qll`.
This commits also adds a test that uses `getParameter`. The new tests
demonstrate that support for array-to-pointer decay works, but we get
data flow to the array rather than its contents.
This accessor may not be forward-compatible with an IR-based version,
and it's unclear whether it has any use. The `VariableAccess` remains in
the `TDefinitionByReferenceNode` constructor since it's used to
implement `getType`.
Every once in a while we encounter projects using some custom file extension for files that we could in principle extract, but since the extractor doesn't know about the extension the files are skipped.
To handle this, the legacy extractor has a `--file-type` option that one can use to specify a file type to use for all files in that particular extraction. So far, `AutoBuild` has nothing of the sort.
This PR proposes to introduce an environment variable `LGTM_INDEX_FILETYPES` to allow a similar customisation. In the fullness of time, this variable would be set through `lgtm.yml` in the usual way, but for now it is undocumented and for internal use only.
Specifically, `LGTM_INDEX_FILETYPES` is a newline-separated list of ".extension:filetype" pairs, specifying that files with the given `.extension` should be extracted as type `filetype`, where
`filetype` is one of `js`, `html`, `json`, `typescript` or `yaml`.
For example, `.jsm:js` causes all `.jsm` files to be extracted as JavaScript.
This can also be used to override default file types: for example, by specifying `.js:typescript` all JavaScript files will be extracted as TypeScript.
The configuration in `DefaultOptions.qll` assumed that a call to any
top-level function named `error` would exit the program. This is not
true.
The assumption was probably about `error(3)`, which is a GNU extension.
It only exits if its first argument it not 0. Furthermore, projects such
as openssh may define their own function named `error` with different
behaviour. Because the GNU `error` function is non-standard, it's
perfectly fine to shadow it with a project-specific definition.
This change removes two FPs from `PointlessComparison.qll` on
https://github.com/openssh/openssh-portable.
In the precense of multiple core libraries, `getAThrownException()` would return
multiple copies of the same exception, say `System.OverflowException`, one for each
core library. With this change we try to identify which core library a given control
flow element was compiled against, and only return the corresponding version.
A PR check was failing because this query was enabled on LGTM but had no
qhelp. I'm removing the `@precision` for now to take it off LGTM, and
then we can add it back when it has qhelp, tests, and change note.
I've eliminated the clumsily worded "client-side code" and "server-side code" distinction, not least because Electron fits neither of those categories.
While this file is part of the project used in the tutorial, it isn't necessary for the queries to work. It also specifies a dependency on a vulnerable version of Express, causing it to be (spuriously) flagged by security scanners.
Thanks to Sam Lanning (@samlanning) and Robert Marsh for taking the time to help
to make it possible. In fact, it was Robert Marsh who effectively
wrote the query and figured out that __builtin_alloca should be
used to also take functions like strdupa into account. I just
filled out the metadata :-)
Before this change, all the cached predicates in `IRGuards.qll` were in
separate cached stages, resulting in recomputation of most of the
library for each stage. This change groups the cached predicates in two
cached classes. A better grouping may be possible, but this grouping was
easy to do and seems to solve the problem.
Before this change, the `IRGuards` library accounted for five cached
stages when using the `RangeAnalysis` library. After this change, it
only accounts for one.
This suite isn't referenced from anywhere yet, but it'll be included in
a standard ODASA dist because the dist includes all files in the `c` and
`cpp` directories. We can modify the nightly test jobs to include the
experimental suite.
This new query is not written because it's the most interesting query we
could write but because it's an IR-based query whose results are easy to
verify.
After a `queries.xml` was added to the test directory,
`Container.getRelativePath` now considers source files to be relative to
the `cpp/test` directory rather than the directory of the `*.ql*` file.
This caused some benign test output changes, and it also caused an
unwanted alert for `test3.c:14` to appear in
`cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/tainted/IntegerOverflowTainted.expected`.
This alert came about because `inSystemMacroExpansion` holds for files
that don't have a relative path, but the pretend system header in
`../system_header` now does have a relative path because it's below the
`cpp/test` directory. The fix is to add another `queries.xml` just for
the directory with the affected test.
Write accesses in assignments, such as the access to `x` in `x = 0` are not
evaluated, so they should not have entries in the control flow graph. However,
qualifiers (and indexer arguments) should still be evaluated, for example in
```
x.Foo.Bar = 0;
```
the CFG should be `x --> x.Foo --> 0 --> x.Foo.Bar = 0` (as opposed to
`x --> x.Foo --> x.Foo.Bar --> 0 --> x.Foo.Bar = 0`, prior to this change).
A special case is assignments via acessors (properties, indexers, and event
adders), where we do want to include the access in the control flow graph,
as it represents the accessor call:
```
x.Prop = 0;
```
But instead of `x --> x.set_Prop --> 0 --> x.Prop = 0` the CFG should be
`x --> 0 --> x.set_Prop --> x.Prop = 0`, as the setter is called *after* the
assigned value has been evaluated.
An even more special case is tuple assignments via accessors:
```
(x.Prop1, y.Prop2) = (0, 1);
```
Here the CFG should be
`x --> y --> 0 --> 1 --> x.set_Prop1 --> y.set_Prop2 --> (x.Prop1, y.Prop2) = (0, 1)`.
This change fixes a few key problems with the existing SSA implementations:
For unaliased SSA, we were incorrectly choosing to model a local variable that had accesses that did not cover the entire variable. This has been changed to ensure that all accesses to the variable are at offset zero and have the same type as the variable itself. This was only possible to fix now that every `MemoryOperand` has its own type.
For aliased SSA, we now correctly track the offset and size of each memory access using an interval of bit offsets covered by the access. The offset interval makes the overlap computation more straightforward. Again, this is only possible now that operands have types.
The `getXXXMemoryAccess` predicates are now driven by the `MemoryAccessKind` on the operands and results, instead of by specific opcodes.
This change does fix an existing false negative in the IR dataflow tests.
I added a few simple test cases to the SSA IR tests, covering the various kinds of overlap (MustExcactly, MustTotally, and MayPartially).
I added "PrintSSA.qll", which can dump the SSA memory accesses as part of an IR dump.
For function parameters that are subject to "pointer decay", the database contains the type as originally declared (e.g. `T[]` instead of `T*`). The IR needs the actual type. Similarly, for variable declared as an array of unknown size, the actual size needs to be inferred from the initializer (e.g. `char a[] = "blah";` needs to have the type `char[5]`).
I've opened a ticket to have the extractor emit the actual type alongside the declared type, but for now, this workaround is enough to unblock progress for typical code.
Previously, we only excluded attributes where the value of the attribute itself suggests templating happening. Now we exclude all attributes in documents where _any_ attribute value suggests templating.
This change does some shuffling to make the distinction between memory operands and register operands more clear in the IR API. First, any given type that extends `Operand` is now either always a `MemoryOperand` or always a `RegisterOperand`. This required getting rid of `CopySourceOperand`, which was used for both the `CopyValue` instruction (as a `RegisterOperand`) and for the `Load` instruction (as a `MemoryOperand`). `CopyValue` is now just a `UnaryInstruction`, `Store` has a `StoreValueOperand` (`RegisterOperand`), and all of the instructions that read a value from memory indirectly (`Load`, `ReturnValue`, and `ThrowValue`) all now have a `LoadOperand` (`MemoryOperand`).
There are no diffs in the IR output for this commit, but this change is required for a subsequent commit that will make each `MemoryOperand` have a `Type`, which in turn is needed to fix a critical bug in aliased SSA construction.
The IR tests were getting kind of unwieldy. We were using "ir.cpp" to contain test cases that covered both IR construction (every language construct imaginable) and SSA construction. We would then build and dump all three flavors of IR. For IR construction tests, examining the SSA dumps when you add a new test case is tedious.
To make this easier to manage, I've split the SSA-specific test cases out into a separate directory. "ir.cpp" should now contain only IR construction test cases, and "ssa.cpp" should contain only SSA construction test cases. We dump just the raw IR for "ir.cpp", and just the two SSA flavors for "ssa.cpp". We still run all three flavors of the IR sanity tests for "ir.cpp", though.
I also removed the "ssa_block_count.ql" test, which wasn't really adding any coverage, because any change to the block count would be reflected in the dump as well.
As its first application, this library makes it possible for `StoredXss` to reuse the `Source` classes of `DomBasedXss` and `ReflectedXss` without having to pull in their libraries (which contain their `Configuration` classes, causing `StoredXss` to recompute all flow information for the other two queries).
We now highlight the `replace` call (instead of the regular expression), and the alert message for the case of missing backslash escapes clarifies that it is talking about failure to escape backslashes in the input, not in the replacement text.
We now classify sensitive expressions into four categories (secret, id, password, certificate). This allows queries more fine-grained control over what kinds of sensitive data they want to deal with: for clear-text storage, for instance, user ids aren't so much of a problem.
This PR adds new predicates to `Declaration` and `Type` to get a fully-qualified canonical name for the element, suitable for debugging and dumps. It includes template parameters, cv qualifiers, function parameter and return types, and fully-qualified names for all symbols. These strings are too large to compute in productions queries, so they should be used only for dumps and debugging. Feel free to suggest better names for these predicates.
I've updated PrintAST and PrintIR to use these instead of `Function.getFullSignature()`. The biggest advantage of the new predicates is that they handle lambdas and local classes, which `getQualifiedName` and `getFullSignature` do not. This makes IR and AST dumps much more usable for real-world snapshots.
Along the way, I cleaned up some of our handling of `IntegralType` to use a single table for tracking the signed, unsigned, and canonical versions of each type. The canonical part is new, and was necessary for `getTypeIdentityString` so that `signed int` and `int` both appear as `int`.
Methods that are mentioned in a member reference expression should count
as rootdefs for the unused parameter query. Such methods have to match
the functional interface of the reference expression, so it is to be
expected that they will sometimes have to declare parameters that they
don't actually use.
Using the class `ExceptionClass` in combination with a deliberate Cartesian
product can lead to bad join orders, for example
```
EVALUATE NONRECURSIVE RELATION:
Completion::TriedControlFlowElement::getAThrownException_dispred#ff(int this, int result) :-
{1} r1 = JOIN Expr::Expr::getType_dispred#ff_10#join_rhs WITH @integral_type#f ON Expr::Expr::getType_dispred#ff_10#join_rhs.<0>=@integral_type#f.<0> OUTPUT FIELDS {Expr::Expr::getType_dispred#ff_10#join_rhs.<1>}
{1} r2 = JOIN r1 WITH @un_op#f ON r1.<0>=@un_op#f.<0> OUTPUT FIELDS {r1.<0>}
{1} r3 = JOIN r2 WITH Stmt::TryStmt::getATriedElement#ff_1#join_rhs ON r2.<0>=Stmt::TryStmt::getATriedElement#ff_1#join_rhs.<0> OUTPUT FIELDS {r2.<0>}
{2} r4 = JOIN r3 WITH Stmt::ExceptionClass#f CARTESIAN PRODUCT OUTPUT FIELDS {Stmt::ExceptionClass#f.<0>,r3.<0>}
{2} r5 = JOIN r4 WITH System::SystemOverflowExceptionClass#class#f ON r4.<0>=System::SystemOverflowExceptionClass#class#f.<0> OUTPUT FIELDS {r4.<1>,r4.<0>}
```
where the CP is made with `ExceptionClass` rather than `SystemOverflowExceptionClass`
directly.
These predicates currently take a pair of `IRBlock`s - as it stands, at
most one edge can exist from one `IRBlock` to a given other `IRBlock`.
We may need to revisit that assumption and create an `IREdge` IPA type
at some future date
There are a few IR APIs that we've found to be confusingly named. This PR renames them to be more consistent within the IR and with the AST API:
`Instruction.getFunction` -> `Instruction.getEnclosingFunction`: This was especially confusing when you'd call `FunctionAddressInstruction.getFunction` to get the function whose address was taken, and wound up with the enclosing function instead.
`Instruction.getXXXOperand` -> `Instruction.getXXX`. Now that `Operand` is an exposed type, we want a way to get a specific `Operand` of an `Instruction`, but more often we want to get the definition instruction of that operand. Now, the pattern is that `getXXXOperand` returns the `Operand`, and `getXXX` is equivalent to `getXXXOperand().getDefinitionInstruction()`.
`Operand.getInstruction` -> `Operand.getUseInstruction`: More consistent with the existing `Operand.getDefinitionInstruction` predicate.
This adds `IntegerPartial.qll`, which is similar to
`IntegerConstant.qll` except that it contains partial functions on
integers instead of total functions on optional integers. This speeds up
the constant analysis so it takes 6.5s instead of 10.3s on comdb2.
This does to `SSAConstruction` what the previous commit did to
`IRConstruction`. An instruction in `SSAConstruction` is now defined in
terms of how it was created rather than what it can be queried for.
Effectively, this defines `TInstruction` as `TInstructionTag` was
defined before and then removes `TInstructionTag` from
`SSAConstruction`. This also has the benefit of removing the concept of
an instruction tag from the public predicates on `Instruction`.
This definition was denormalized to the extent that an instruction was
defined in terms of the six main attributes it could be queried for.
This made it possible to do multi-column joins on those six attributes,
but it doesn't appear that this feature was useful in practice. The main
multi-column join that was in use was on the pair of
(`TranslatedElement, InstructionTag`), but the `TranslatedElement` was
not part of the `TInstruction`.
This commit changes `TInstruction` to be defined in terms of what it's
_built from_ (`TranslatedElement, InstructionTag`) instead. This makes
it possible to do multi-column joins on those two components, and then
there are separate predicates (usually with two columns) to query
instruction attributes, replacing the many uncached projections from
`MkInstruction` that were generated before.
An immediate advantage is that an `Expr` with multiple types will no
longer give rise to multiple `Instruction`s, fixing most of the errors
from the sanity query `ambiguousSuccessors`. The code inside
`IRConstruction.qll` becomes simpler and hopefully faster as there is no
longer a translation from `TranslatedElement` to `Locatable` and back
again.
The previous commit had the side effect that `IRVariable`s were created
for all `Functions`, including those that did not have IR. This commit
restricts all `TIRVariable` constructors to functions that have IR.
The recent change to `AccessorCall` on dd99525566 resulted
in some bad join-orders, so I have (partly) reverted them. This means that the issues
orignally addressed by that change are now reintroduced, and I plan to instead apply a
fix to the CFG, which--unlike the original fix--should be able to handle multi-property-tuple
assignments.
This doesn't make it much faster, but it reduces the debug output
volume. It also simplifies the code.
I've found this change necessary when I compute the full IR on a
Wireshark snapshot in QL4E. Without it, Eclipse runs out of memory
because the console log is too large.
The new predicate `isOrphan` gets inlined into
`ignoreExprAndDescendants`, whose performance improves from
TranslatedElement::ignoreExprAndDescendants#f .. 23.4s (executed 9 times)
to
TranslatedElement::ignoreExprAndDescendants#f ... 4.3s (executed 9 times)
This dramatic improvement is not only due to eliminating a type check in
the recursive case. Removing the type check from the other base cases
also enabled them to get better join orders.
The previous reccomentation changed the behaviour of the code.
A user following the advice might have broken her/his code:
With call-by-value, the original parameter is not changed.
With a call-by-reference, however, it may be changed. To be sure,
nothing breaks by blindly following the advice, suggest to pass a
const reference.
The `SSAConstruction.getNewIRVariable` was very slow on Wireshark. This
was probably because it couldn't join on multiple columns
simultaneously. Instead of improving the join, I observed that the
`TIRVariable` type was the same between all three IR stages except for a
few occurrences of `FunctionIR` that could easily be changed to
`Function`. By sharing `TIRVariable` between all the stages, we avoid
recomputing it and translating it between every stage, turning the slow
`getNewIRVariable` predicate into a no-op.
This change means that later stages of the IR can't introduce new
variables, but that was already the case because
`config/identical-files.json` forced all three `IRVariable.qll` files to
be identical.
This predicate computed a local CP between all defs and uses of the same
virtual variable in a basic block. This wasn't a problem in
`unaliased_ssa`, but it became a huge problem in `aliased_ssa`, probably
because many variables can be modelled with a single virtual variable
there.
Before this commit, evaluation of `aliased_ssa`'s
`variableLiveOnEntryToBlock#ff#antijoin_rhs` on Wireshark took 80
_minutes_. After this commit, that predicate and its immediate
dependencies take around 5 _seconds_.
This relation was almost 40x the size it needed to be on Wireshark
because it lacked a restriction on the `tag` parameter. To implement
that restriction efficiently, I had to split the relation in two to
dictate the join order.
With the fix, `getInstruction` now computes the same as
`getInstructionTranslatedElementAndTag`, so the latter could be
simplified.
I made a corresponding change to `TranslatedElement.getTempVariable` for
the sake of consistency.
A part of `SSAConstruction.getInstructionOperandDefinition` was more
expensive than it had to be. On a ChakraCore snapshot, this changes the
tuple counts from
3020569 ~2% {3} r40 = JOIN OperandTag::TUnmodeledUseOperand#f WITH Instruction::Instruction::getFunction_dispred#ff CARTESIAN PRODUCT OUTPUT FIELDS {Instruction::Instruction::getFunction_dispred#ff.<0>,OperandTag::TUnmodeledUseOperand#f.<0>,Instruction::Instruction::getFunction_dispred#ff.<1>}
62405 ~0% {3} r41 = JOIN r40 WITH Instruction::UnmodeledUseInstruction#class#fffffff ON r40.<0>=Instruction::UnmodeledUseInstruction#class#fffffff.<0> OUTPUT FIELDS {r40.<2>,r40.<1>,r40.<0>}
2868421 ~1% {3} r42 = JOIN r41 WITH Instruction::Instruction::getFunction_dispred#ff_10#join_rhs ON r41.<0>=Instruction::Instruction::getFunction_dispred#ff_10#join_rhs.<0> OUTPUT FIELDS {Instruction::Instruction::getFunction_dispred#ff_10#join_rhs.<1>,r41.<1>,r41.<2>}
62405 ~0% {3} r43 = JOIN r42 WITH Instruction::UnmodeledDefinitionInstruction#class#fffffff ON r42.<0>=Instruction::UnmodeledDefinitionInstruction#class#fffffff.<0> OUTPUT FIELDS {r42.<2>,r42.<1>,r42.<0>}
to
(0s) Starting to evaluate predicate SSAConstruction::Cached::getUnmodeledUseInstruction#ff
(0s) Tuple counts:
62405 ~0% {2} r1 = JOIN Instruction::UnmodeledUseInstruction#class#fffffff WITH Instruction::Instruction::getFunction_dispred#ff ON Instruction::UnmodeledUseInstruction#class#fffffff.<0>=Instruction::Instruction::getFunction_dispred#ff.<0> OUTPUT FIELDS {Instruction::Instruction::getFunction_dispred#ff.<1>,Instruction::Instruction::getFunction_dispred#ff.<0>}
return r1
...
75716 ~0% {3} r40 = JOIN OperandTag::TUnmodeledUseOperand#f WITH FunctionIR::FunctionIR::getUnmodeledDefinitionInstruction#ff CARTESIAN PRODUCT OUTPUT FIELDS {FunctionIR::FunctionIR::getUnmodeledDefinitionInstruction#ff.<0>,OperandTag::TUnmodeledUseOperand#f.<0>,FunctionIR::FunctionIR::getUnmodeledDefinitionInstruction#ff.<1>}
62405 ~0% {3} r41 = JOIN r40 WITH FunctionIR::FunctionIR::getUnmodeledUseInstruction#ff ON r40.<0>=FunctionIR::FunctionIR::getUnmodeledUseInstruction#ff.<0> OUTPUT FIELDS {FunctionIR::FunctionIR::getUnmodeledUseInstruction#ff.<1>,r40.<1>,r40.<2>}
Now that we have `Expr.getParentWithConversions`, we can implement
`TranslatedElement.getRealParent` simpler. This implementation also
avoids recursion.
In theory this query will produce no results on C++ code; in practice, I
suspect the "cpp" suite is often run on code compiled as C, so it is
likely to be worth running anyways.
Rewrite the predicate `succSplits()` and the construction of the IPA type `TSplits`.
The two are now mutually dependent, see more in the comment for the module
`SuccSplits`.
These queries have no results on our test cases in the repo, but
`ambiguousSuccessors` has results on any large C++ code base, and
`unexplainedLoop` has results on Windows builds of ChakraCore.
The syntactic node assiociated with accessor calls was previously always the
underlying member access. For example, in
```
x.Prop = y.Prop;
```
the implicit call to `x.set_Prop()` was at the syntactic node `x.Prop`, while the
implicit call to `y.get_Prop()` was at the syntactic node `y.Prop`.
However, this breaks the invariant that arguments to calls dominate the call itself,
as the argument `y.Prop` for the implicit `value` parameter in `x.set_Prop()` will
be evaluated after the call (the left-hand side in an assignment is evaluated before
the right-hand side).
The solution is to redefine the access call to `x.set_Prop()` to point to the whole
assignment `x.Prop = y.Prop`, instead of the access `x.Prop`. For reads, we still want
to associate the accessor call with the member access.
A corner case arises when multiple setters are called in a tuple assignment:
```
(x.Prop1, x.Prop2) = (0, 1)
```
In this case, we cannot associate the assignment with both `x.set_Prop1()` and
`x.set_Prop2()`, so we instead revert to using the underlying member accesses as
before.
This predicate was unbearably slow on a ChakraCore snapshot (and
probably everywhere else):
ReachableBlock::getAFeasiblePredecessorBlock#2#ff#antijoin_rhs .. 1m6s
ReachableBlock::getAFeasiblePredecessorBlock#ff#antijoin_rhs .... 31.8s
With this change, the predicate is so fast that it doesn't even show up
in the clause timing report.
It's possible that we only tested this for performance in 1.18, and then
it has regressed in 1.19. Otherwise I can't explain how we've missed
this. I'm using QL for Eclipse 1.20.0.201901070127.
I looked through a few hundred results from this query on lgtm.com and
found that most of the FPs had to do with comment lines ending in `}`.
This change should fix most of them, at the cost of very few false
negatives.
On Wireshark, this query goes from 7,425 results to 6,686 results before
filtering for generated code. Almost all the lost results were FP,
except a handful of results involving initializer lists.
The existing `Node.asExpr` predicate changes semantics so it becomes the
one that most users should use when they don't want to think about
`Conversion`s. A new `Node.asConvertedExpr` predicate is added and has
the same semantics as the old `Node.asExpr` predicate. It's for advanced
users that know about `Conversion`s and want to account for them.
With this change, the `IRDataflowTestCommon.qll` and
`DataflowTestCommon.qll` files use the same definitions of sources and
sinks. Since the IR data flow library is meant to be compatible with the
AST data flow library, this is what we ought to be testing.
Two alerts change but not necessarily for the right reasons.
Data flow nodes for expressions do not take CFG splitting into account. Example:
```
if (b)
x = tainted;
x = x.ToLower();
if (!b)
Use(x);
```
Flow is incorrectly reported from `tainted` to `x` in `Use(x)`, because the step
from `tainted` to `x.ToLower()` throws away the information that `b = true`.
The solution is to remember the splitting in data flow expression nodes, that is,
to represent the exact control flow node instead of just the expression. With that
we get flow from `tainted` to `[b = true] x.ToLower()`, but not from `tainted` to
`[b = false] x.ToLower()`.
The data flow API remains unchanged, but in order for analyses to fully benefit from
CFG splitting, sanitizers in particular should be CFG-based instead of expression-based:
```
if (b)
x = tainted;
if (IsInvalid(x))
return;
Use(x);
```
If the call to `IsInvalid()` is a sanitizer, then defining an expression node to be
a sanitizer using `GuardedExpr` will be too conservative (`x` in `Use(x)` is in fact
not guarded). However, `[b = true] x` in `[b = true] Use(x)` is guarded, and to help
defining guard-based sanitizers, the class `GuardedDataFlowNode` has been introduced.
As we prepare to clarify how conversions are treated, we don't want a
`sink(...)` declaration where it's non-obvious which conversions are
applied to arguments.
This code was clearly using `isConstant` as an indirect way of checking
whether `getValue` would have a result. That's no longer valid, so I
changed it to check `getValue` directly.
These files had come out of sync due to 89148a9ec7 and 8c9c316e1b. I
synced the files by replaying the changes that those commits made in
`aliased_ssa/` to the two other copies.
The CFG construction code previously contained half of an approximation
of which address expressions are constant. Now this this property is
properly modelled by `Expr.isConstant`, we can remove this code.
This fixes most discrepancies between the QL-based CFG and the
extractor-based CFG on Wireshark.
Before this change, `Expr.isConstant` only was only true for those
constant expressions that could be represented as QL values: numbers,
Booleans, and string literals. It was not true for string literals
converted from arrays to pointers, and it was not true for addresses of
variables with static lifetime.
The concept of a "constant expression" varies between C and C++ and
between versions of the standard, but they all include addresses of data
with static lifetime. These are modelled by the new library
`AddressConstantExpression.qll`, which is based on the code in
`EscapesTree.qll` and modified for its new purpose.
I've tested the change for performance on Wireshark and for correctness
with the included tests. I've also checked on Wireshark that all static
initializers in C files are considered constant, which was not the case
before.
This change suppresses results from "Declaration hides parameter" where
the ParameterDeclarationEntry does not link up to the right
FunctionDeclarationEntry.
Bad magic ended up in `LocalVariable.getFunction` and effectively
created a Cartesian product. Before this change, the timing looked like
this:
Variable::LocalVariable::getFunction_dispred#bb ... 50.1s
#select#cpe#123#fff ............................... 20.6s
After this change, those predicates become much faster:
Variable::LocalVariable::getFunction_dispred#ff ... 121ms
DeclarationHidesParameter::localVariableNames#fff . 77ms
#select#cpe#123#fff ............................... 28ms
Introducing the predicate `localVariableNames` ensures that we can do
the main join on two columns simultaneously, so that's a change we
should keep even if we remove the `pragma[nomagic]` later.
This test was intended to catch regressions in the CFG, but it looks
like it's just catching insignificant extractor changes. The test has
started failing after some recent extractor changes, but I have no way
to pinpoint the failure and understand whether it's a problem or not, so
I think it's better to delete this test.
The remaining tests check whether the QL-based CFG generates the same
graph as the extractor-based CFG. Furthermore, the `successor-tests`
check that the extractor-based CFG works as intended.
We agreed in the review of the original PR that `getName` is more
appropriate here than `getQualifiedName`. Using `getName` ensures that
we also match the `std::`-prefixed versions of these functions as well
as user-defined versions.
This commit doesn't change any behavior but just uses the preferred
high-level predicates. The `getChild` predicate inspects the raw
database more or less directly, and the database layout could change in
the future.
This should make the documentation more in line with the documentation
for our other queries. The @name of the query is changed to "Use of
string copy function in a condition".
https://github.com/Semmle/ql/pull/698 removed `document.cookie` as a remote flow source, which some of the tests relied on. We now use `location.search` instead.
This recursive predicate is made faster by working around a known
optimizer problem (QL-796) that causes the optimizer to insert extra
type checks in recursive case even when they are only needed in the
base case.
If a summary does not specify a configuration, it is taken to apply to all configurations without custom sanitisers/barriers.
If a source summary does not specify a flow label, `data` is assumed.
If a sink summary does not specify a flow label, both `data` and `taint` are assumed.
Flow step summaries cannot omit flow labels.
Note that the standard extraction queries always provide explicit configurations and flow labels, and hence do not exercise this functionality.
This reduces the number of bounds computed, and will simplify use of the
library. The resulting locations in the tests may be slightly strange,
because the example `Instruction` for a `ValueNumber` is the first
appearing in the IR, regardless of source order, and may not be the most
closely related `Instruction` to the bounded value. I think that's worth
doing for the performance and usability benefits.
It now uses a facade pattern similar to `InvokeNode`: the range of the class is defined by an abstract class `DataFlow::SourceNode::Range`, while the actual behaviour is defined by the (no longer abstract) `SourceNode` class itself.
Clients that want to add new source nodes need to extend `DataFlow::SourceNode::Range`, those that want to refine the behaviour of existing source nodes should extend `DataFlow::SourceNode` itself.
While this is technically a breaking API change, I think separating the two aspects in this way is cleaner and makes it easier to use, and improves performance as well.
I changed to detect any logical operation usage (i.e. !, ==), but I kept usage in a conditional directly as a separate detection condition. I found no false positives on the projects you shared with me previously.
This implements calculation of the control-flow graph in QL. The new
code is not enabled yet as we'll need more extractor changes first.
The `SyntheticDestructorCalls.qll` file is a temporary solution that can
be removed when the extractor produces this information directly.
Instead of only considering a fixed set of paths for `dotnet` and `mono`,
first attempt to lookup the paths based on the `PATH` environment variable.
This change also fixes a potential `System.IO.DirectoryNotFoundException` exception,
which could be thrown when the `shared/Microsoft.NETCore.App` folder was not
present.
These queries are only tenuously security queries, and marking them as
security queries can cause them to have greater prominence than is
merited by the results that they report.
2018-10-03 11:48:27 +01:00
8690 changed files with 598556 additions and 272903 deletions
about: Tell us about an alert that shouldn't be reported
title: LGTM.com - false positive
labels: false-positive
assignees: ''
---
**Description of the false positive**
<!-- Please explain briefly why you think it shouldn't be included. -->
**URL to the alert on the project page on LGTM.com**
<!--
1. Open the project on LGTM.com.
For example, https://lgtm.com/projects/g/pallets/click/.
2. Switch to the `Alerts` tab. For example, https://lgtm.com/projects/g/pallets/click/alerts/.
3. Scroll to the alert that you would like to report.
4. Click on the right most icon `View this alert within the complete file`.
5. A new browser tab opens. Copy and paste the page URL here.
For example, https://lgtm.com/projects/g/pallets/click/snapshot/719fb7d8322b0767cdd1e5903ba3eb3233ba8dd5/files/click/_winconsole.py#xa08d213ab3289f87:1.
* Posting, or threatening to post, people’s personally identifying information (“doxing”).
* Insults, especially those using discriminatory terms or slurs.
* Behavior that could be perceived as sexual attention.
* Advocating for or encouraging any of the above behaviors.
* Advocating for or encouraging any of the above behaviors.
* Understand disagreements: Disagreements, both social and technical, are useful learning opportunities. Seek to understand others’ viewpoints and resolve differences constructively.
This code is not exhaustive or complete. It serves to capture our common understanding of a productive, collaborative environment. We expect the code to be followed in spirit as much as in the letter.
We welcome contributions to our standard library and standard checks, got an idea for a new check, or how to improve an existing query? Then please go ahead an open a Pull Request!
We welcome contributions to our standard library and standard checks. Got an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request!
Before we accept your pull request, we will require that you have agreed to our Contributor License Agreement, this is not something that you need to do before you submit your pull request, but until you've done so, we will be unable to accept your contribution.
Before we accept your pull request, we require that you have agreed to our Contributor License Agreement, this is not something that you need to do before you submit your pull request, but until you've done so, we will be unable to accept your contribution.
Please read our [QL Style Guide](docs/ql-style-guide.md) for information on how to format QL code in this repository.
## Adding a new query
If you have an idea for a query that you would like to share with other Semmle users, please open a pull request to add it to this repository.
Follow the steps below to help other users understand what your query does, and to ensure that your query is consistent with the other Semmle queries.
1.**Consult the QL documentation for query writers**
There is lots of useful documentation to help you write QL, ranging from information about query file structure to language-specific tutorials. For more information on the documentation available, see [Writing QL queries](https://help.semmle.com/QL/learn-ql/writing-queries/writing-queries.html) on [help.semmle.com](https://help.semmle.com).
2.**Format your QL correctly**
All of Semmle's standard QL queries and libraries are uniformly formatted for clarity and consistency, so we strongly recommend that all QL contributions follow the same formatting guidelines. If you use QL for Eclipse, you can auto-format your query in the [QL editor](https://help.semmle.com/ql-for-eclipse/Content/WebHelp/ql-editor.html). For more information, see the [QL style guide](https://github.com/Semmle/ql/blob/master/docs/ql-style-guide.md).
3.**Make sure your query has the correct metadata**
Query metadata is used by Semmle's analysis to identify your query and make sure the query results are displayed properly.
The most important metadata to include are the `@name`, `@description`, and the `@kind`.
Other metadata properties (`@precision`, `@severity`, and `@tags`) are usually added after the query has been reviewed by Semmle staff.
For more information on writing query metadata, see the [Query metadata style guide](https://github.com/Semmle/ql/blob/master/docs/query-metadata-style-guide.md).
4.**Make sure the `select` statement is compatible with the query type**
The `select` statement of your query must be compatible with the query type (determined by the `@kind` metadata property) for alert or path results to be displayed correctly in LGTM and QL for Eclipse.
For more information on `select` statement format, see [Introduction to query files](https://help.semmle.com/QL/learn-ql/writing-queries/introduction-to-queries.html#select-clause) on help.semmle.com.
5.**Save your query in a `.ql` file in correct language directory in this repository**
There are five language-specific directories in this repository:
* C/C++: `ql/cpp/ql/src`
* C#:`ql/csharp/ql/src`
* Java: `ql/java/ql/src`
* JavaScript: `ql/javascript/ql/src`
* Python: `ql/python/ql/src`
Each language-specific directory contains further subdirectories that group queries based on their `@tags` properties or purpose. Select the appropriate subdirectory for your new query, or create a new one if necessary.
6.**Write a query help file**
Query help files explain the purpose of your query to other users. Write your query help in a `.qhelp` file and save it in the same directory as your new query.
For more information on writing query help, see the [Query help style guide](https://github.com/Semmle/ql/blob/master/docs/query-help-style-guide.md).
@@ -4,13 +4,13 @@ This open source repository contains the standard QL libraries and queries that
## How do I learn QL and run queries?
LGTM has [extensive documentation](https://lgtm.com/help/ql/introduction-to-ql) on getting started with writing QL.
You can use the [interactive query console](https://lgtm.com/help/lgtm/using-query-console) or the [QL for Eclipse](https://lgtm.com/help/lgtm/running-queries-ide) plugin to try out your queries on any open-source project that's currently being analyzed.
There is [extensive documentation](https://help.semmle.com/QL/learn-ql/) on getting started with writing QL.
You can use the [interactive query console](https://lgtm.com/help/lgtm/using-query-console) on LGTM.com or the [QL for Eclipse](https://lgtm.com/help/lgtm/running-queries-ide) plugin to try out your queries on any open-source project that's currently being analyzed.
## Contributing
We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our [contributing guidelines](CONTRIBUTING.md) and [QL style guide](docs/ql-style-guide.md).
We welcome contributions to our standard library and standard checks. Do you have an idea for a new check, or how to improve an existing query? Then please go ahead and open a pull request! Before you do, though, please take the time to read our [contributing guidelines](CONTRIBUTING.md). You can also consult our [style guides](https://github.com/Semmle/ql/tree/master/docs) to learn how to format your QL for consistency and clarity, how to write query metadata, and how to write query help documentation for your query.
## License
The LGTM queries are licensed under [Apache License 2.0](LICENSE) by [Semmle](https://semmle.com).
The QL queries in this repository are licensed under [Apache License 2.0](LICENSE) by [Semmle](https://semmle.com).
| Array argument size mismatch (`cpp/array-arg-size-mismatch`) | reliability | Finds function calls where the size of an array being passed is smaller than the array size of the declared parameter. Newly displayed on LGTM. |
| Lossy function result cast (`cpp/lossy-function-result-cast`) | correctness | Finds function calls whose result type is a floating point type, which are implicitly cast to an integral type. Newly available on LGTM but results not displayed by default. |
| Returning stack-allocated memory (`cpp/return-stack-allocated-memory`) | reliability, external/cwe/cwe-825 | Finds functions that may return a pointer or reference to stack-allocated memory. This query existed already but has been rewritten from scratch to make the error rate low enough for use on LGTM. Results displayed by default. |
| Use of string copy function in a condition (`cpp/string-copy-return-value-as-boolean`) | correctness | This query identifies calls to string copy functions used in conditions, where it's likely that a different function was intended to be called. Results are displayed by default on LGTM. |
| Suspicious pointer scaling (`cpp/suspicious-pointer-scaling`) | Fewer false positives | False positives involving types that are not uniquely named in the snapshot have been fixed. |
| Array argument size mismatch (`cpp/array-arg-size-mismatch`) | Fewer false positive results | An exception has been added to this query for variable sized arrays. |
| Call to memory access function may overflow buffer (`cpp/overflow-buffer`) | More correct results | This query now recognizes calls to `RtlCopyMemoryNonTemporal` and `RtlSecureZeroMemory`. |
| Call to memory access function may overflow buffer (`cpp/overflow-buffer`) | More correct results | Calls to `fread` are now examined by this query. |
| Lossy function result cast (`cpp/lossy-function-result-cast`) | Fewer false positive results | The whitelist of rounding functions built into this query has been expanded. |
| Memory is never freed (`cpp/memory-never-freed`) | More correct results | Support for more Microsoft-specific memory allocation/de-allocation functions has been added. |
| Memory may not be freed (`cpp/memory-may-not-be-freed`) | More correct results | Support for more Microsoft-specific memory allocation/de-allocation functions has been added. |
| Mismatching new/free or malloc/delete (`cpp/new-free-mismatch`) | More correct results | Data flow through global variables for this query has been improved. |
| 'new[]' array freed with 'delete' (`cpp/new-array-delete-mismatch`) | More correct results | Data flow through global variables for this query has been improved. |
| 'new' object freed with 'delete[]' (`cpp/new-delete-array-mismatch`) | More correct results | Data flow through global variables for this query has been improved. |
| Potential buffer overflow (`cpp/potential-buffer-overflow`) | Deprecated | This query has been deprecated. Use Potentially overrunning write (`cpp/overrunning-write`) and Potentially overrunning write with float to string conversion (`cpp/overrunning-write-with-float`) instead. |
| Resource not released in destructor (`cpp/resource-not-released-in-destructor`) | Fewer false positive results | The query no longer highlights code that releases a resource via a virtual method call, function pointer, or lambda. |
| Returning stack-allocated memory (`cpp/return-stack-allocated-memory`) | More correct results | Many more stack allocated expressions are now recognized. |
| Suspicious add with sizeof (`cpp/suspicious-add-sizeof`) | Fewer false positive results | Pointer arithmetic on `char * const` expressions (and other variations of `char *`) are now correctly excluded from the results. |
| Suspicious pointer scaling (`cpp/suspicious-pointer-scaling`) | Fewer false positive results | False positive results involving types that are not uniquely named in the snapshot have been fixed. |
| Unused static variable (`cpp/unused-static-variable`) | Fewer false positive results | Variables with the attribute `unused` are now excluded from the query. |
| Resource not released in destructor (`cpp/resource-not-released-in-destructor`) | Fewer false positive results | Fix false positives where a resource is released via a virtual method call. |
| Use of inherently dangerous function (`cpp/potential-buffer-overflow`) | Cleaned up | This query no longer catches uses of `gets`, and has been renamed 'Potential buffer overflow'. |
| Use of potentially dangerous function (`cpp/potentially-dangerous-function`) | More correct results | This query now catches uses of `gets`. |
## Changes to QL libraries
* The `semmle.code.cpp.dataflow.DataFlow` library now supports _definition by reference_ via output parameters of known functions.
* Data flows through `memcpy` and `memmove` by default.
* Custom flow into or out of arguments assigned by reference can be modeled with the new class `DataFlow::DefinitionByReferenceNode`.
* The data flow library adds flow through library functions that are modeled in `semmle.code.cpp.models.interfaces.DataFlow`. Queries can add subclasses of `DataFlowFunction` to specify additional flow.
* There is a new `Namespace.isInline()` predicate, which holds if the namespace was declared as `inline namespace`.
* The `Expr.isConstant()` predicate now also holds for _address constant expressions_, which are addresses that will be constant after the program has been linked. These address constants do not have a result for `Expr.getValue()`.
* There are new `Function.isDeclaredConstexpr()` and `Function.isConstexpr()` predicates. They can be used to tell whether a function was declared as `constexpr`, and whether it actually is `constexpr`.
* There is a new `Variable.isConstexpr()` predicate. It can be used to tell whether a variable is `constexpr`.
| Off-by-one comparison against container length (cs/index-out-of-bounds) | Fewer false positives | Results have been removed when there are additional guards on the index. |
| Dereferenced variable is always null (cs/dereferenced-value-is-always-null) | Improved results | The query has been rewritten from scratch, and the analysis is now based on static single assignment (SSA) forms. The query is now enabled by default in LGTM. |
| Dereferenced variable may be null (cs/dereferenced-value-may-be-null) | Improved results | The query has been rewritten from scratch, and the analysis is now based on static single assignment (SSA) forms. The query is now enabled by default in LGTM. |
| Clear text storage of sensitive information (`cs/cleartext-storage-of-sensitive-information`) | More results | Now includes data sources for user controls in `System.Windows.Forms`. |
| Dereferenced variable is always null (`cs/dereferenced-value-is-always-null`) | Improved results | The query has been rewritten from scratch, and the analysis is now based on static single assignment (SSA) forms. Results are now shown by default in LGTM. |
| Dereferenced variable may be null (`cs/dereferenced-value-may-be-null`) | Improved results | The query has been rewritten from scratch, and the analysis is now based on static single assignment (SSA) forms. Results are now shown by default in LGTM. |
| Double-checked lock is not thread-safe (`cs/unsafe-double-checked-lock`) | Fewer false positive and more true positive results | No longer highlights code where the underlying field was not updated in the `lock` statement, or where the field is a `struct`. Results have been added where there are other statements inside the `lock` statement. |
| Exposure of private information (`cs/exposure-of-sensitive-information`) | More results | Now includes data sources for user controls in `System.Windows.Forms`. |
| Improper control of generation of code (`cs/code-injection`) | More results | Now includes data sources for user controls in `System.Windows.Forms`. |
| Off-by-one comparison against container length (`cs/index-out-of-bounds`) | Fewer false positive results | No longer reports results when there are additional guards on the index. |
| SQL query built from user-controlled sources (`cs/sql-injection`) | More results | Now includes data sources for user controls in `System.Windows.Forms`. |
| Uncontrolled format string (`cs/uncontrolled-format-string`) | More results | Now includes data sources for user controls in `System.Windows.Forms`. |
| Unused format argument (`cs/format-argument-unused`) | Fewer false positive results | No longer reports results where the format string is empty. This is often used as a default value and is not an interesting result. |
| Use of default ToString() (`cs/call-to-object-tostring`) | Fewer false positive results | No longer reports results for `char` arrays passed to `StringBuilder.Append()`, which were incorrectly marked as using `ToString`. |
| Use of default ToString() (`cs/call-to-object-tostring`) | Fewer results | No longer reports results when the object is an interface or an abstract class. |
| Using a package with a known vulnerability (`cs/use-of-vulnerable-package`) | More results | This query detects packages vulnerable to CVE-2019-0657. |
## Changes to code extraction
@@ -22,4 +25,14 @@
## Changes to QL libraries
## Changes to the autobuilder
* The class `TrivialProperty` now includes library properties determined to be trivial using CIL analysis. This may increase the number of results for all queries that use data flow.
* Taint-tracking steps have been added for the `Json.NET` package. This will improve results for queries that use taint tracking.
* Support has been added for EntityFrameworkCore, including
- Stored data flow sources
- Sinks for SQL expressions
- Data flow through fields that are mapped to the database
* Support has been added for NHibernate-Core, including
- Stored data flow sources
- Sinks for SQL expressions
- Data flow through fields that are mapped to the database
| Arbitrary file write during archive extraction ("Zip Slip") (`java/zipslip`) | Fewer false positive results | Results involving a sanitization step that converts a destination `Path` to a `File` are no longer reported. |
| Result of multiplication cast to wider type (`java/integer-multiplication-cast-to-long`) | Fewer results | Results involving conversions to `float` or `double` are no longer reported, as they were almost exclusively false positives. |
## Changes to QL libraries
* The deprecated library `semmle.code.java.security.DataFlow` has been removed.
Improved data flow libraries have been available in
`semmle.code.java.dataflow.DataFlow`,
`semmle.code.java.dataflow.TaintTracking`, and
`semmle.code.java.dataflow.FlowSources` since 1.16.
* Taint tracking now includes additional default data-flow steps through
collections, maps, and iterators. This affects all security queries, which
can report more results based on such paths.
* The `FlowSources` and `TaintTracking` libraries are extended to cover additional remote user
input and taint steps from the following frameworks: Guice, Protobuf, Thrift and Struts.
This affects all security queries, which may yield additional results on projects
*The taint tracking library now recognizes flow through persistent storage, this may give more results for the security queries.
*File classification now recognizes additional generated files, for example, files from [HTML Tidy](html-tidy.org).
* The taint tracking library now recognizes flow through persistent storage, class fields, and callbacks in certain cases. Handling of regular expressions has also been improved. This may give more results for the security queries.
* Type inference for function calls has been improved. This may give additional results for queries that rely on type inference.
* The [Closure-Library](https://github.com/google/closure-library/wiki/goog.module:-an-ES6-module-like-alternative-to-goog.provide) module system is now supported.
| Arbitrary file write during archive extraction ("Zip Slip") (`js/zipslip`) | security, external/cwe/cwe-022 | Identifies extraction routines that allow arbitrary file overwrite vulnerabilities, indicating a possible violation of [CWE-022](https://cwe.mitre.org/data/definitions/22.html). Results are shown on LGTM by default. |
| Arrow method on Vue instance (`js/vue/arrow-method-on-vue-instance`) | reliability, frameworks/vue | Highlights arrow functions that are used as methods on Vue instances. Results are shown on LGTM by default.|
| Cross-window communication with unrestricted target origin (`js/cross-window-information-leak`) | security, external/cwe/201, external/cwe/359 | Highlights code that sends potentially sensitive information to another window without restricting the receiver window's origin, indicating a possible violation of [CWE-201](https://cwe.mitre.org/data/definitions/201.html). Results are shown on LGTM by default. |
| Double escaping or unescaping (`js/double-escaping`) | correctness, security, external/cwe/cwe-116 | Highlights potential double escaping or unescaping of special characters, indicating a possible violation of [CWE-116](https://cwe.mitre.org/data/definitions/116.html). Results are shown on LGTM by default. |
| Incomplete regular expression for hostnames (`js/incomplete-hostname-regexp`) | correctness, security, external/cwe/cwe-020 | Highlights hostname sanitizers that are likely to be incomplete, indicating a violation of [CWE-020](https://cwe.mitre.org/data/definitions/20.html). Results are shown on LGTM by default.|
| Incomplete URL substring sanitization | correctness, security, external/cwe/cwe-020 | Highlights URL sanitizers that are likely to be incomplete, indicating a violation of [CWE-020](https://cwe.mitre.org/data/definitions/20.html). Results shown on LGTM by default. |
| Incorrect suffix check (`js/incorrect-suffix-check`) | correctness, security, external/cwe/cwe-020 | Highlights error-prone suffix checks based on `indexOf`, indicating a potential violation of [CWE-20](https://cwe.mitre.org/data/definitions/20.html). Results are shown on LGTM by default. |
| Loop iteration skipped due to shifting (`js/loop-iteration-skipped-due-to-shifting`) | correctness | Highlights code that removes an element from an array while iterating over it, causing the loop to skip over some elements. Results are shown on LGTM by default. |
| Unused property (`js/unused-property`) | maintainability | Highlights properties that are unused. Results are shown on LGTM by default. |
| Useless comparison test (`js/useless-comparison-test`) | correctness | Highlights code that is unreachable due to a numeric comparison that is always true or always false. Results are shown on LGTM by default. |
| Client-side cross-site scripting | More true-positive results, fewer false-positive results. | This rule now recognizes WinJS functions that are vulnerable to HTML injection, and no longer flags certain safe uses of jQuery. |
| Insecure randomness | More results | This rule now flags insecure uses of `crypto.pseudoRandomBytes`. |
| Uncontrolled data used in network request | More results | This rule now recognizes host values that are vulnerable to injection. |
| Unused parameter | Fewer false-positive results | This rule no longer flags parameters with leading underscore. |
| Unused variable, import, function or class | Fewer false-positive results | This rule now flags fewer variables that are implictly used by JSX elements, and no longer flags variables with leading underscore. |
| Uncontrolled data used in path expression | Fewer false-positive results | This rule now recognizes the Express `root` option, which prevents path traversal. |
| Ambiguous HTML id attribute | Fewer falsepositive results | This query now treats templates more conservatively. Its precision has been revised to 'high'. |
| Assignment to exports variable | Fewer results | This query no longer flags code that is also flagged by the query "Useless assignment to local variable". |
| Client-side cross-site scripting | More true positive and fewer false positive results. | This query now recognizes WinJS functions that are vulnerable to HTML injection. It no longer flags certain safe uses of jQuery, and recognizes custom sanitizers. |
| Hard-coded credentials | Fewer falsepositive results | This query no longer flags the empty string as a hardcoded username. |
| Insecure randomness | More results | This query now flags insecure uses of `crypto.pseudoRandomBytes`. |
| Reflected cross-site scripting | Fewer falsepositive results. | This query now recognizes custom sanitizers. |
| Stored cross-site scripting | Fewer false positive results. | This query now recognizes custom sanitizers. |
| Unbound event handler receiver (`js/unbound-event-handler-receiver`) | Fewer false positive results | Additional ways that class methods can be bound are now recognized. |
| Uncontrolled data used in network request | More results | This query now recognizes host values that are vulnerable to injection. |
| Uncontrolled data used in path expression | Fewer false positive results | This query now recognizes the Express `root` option, which prevents path traversal. |
| Unneeded defensive code | More true positive and fewer false positive results | This query now recognizes additional defensive code patterns. |
| Unsafe dynamic method access | Fewer false positive results | This query no longer flags concatenated strings as unsafe method names. |
| Unused parameter | Fewer false positive results | This query no longer flags parameters with leading underscore. |
| Unused variable, import, function or class | Fewer false positive results | This query now flags fewer variables that are implictly used by JSX elements. It no longer flags variables with a leading underscore and variables in dead code. |
| Unvalidated dynamic method call | More true positive results | This query now flags concatenated strings as unvalidated method names in more cases. |
| Useless assignment to property. | Fewer false positive results | This query now treats assignments with complex right-hand sides correctly. |
| Useless conditional | Fewer results | Additional defensive coding patterns are now ignored. |
| Useless conditional | More true positive results | This query now flags additional uses of function call values. |
## Changes to QL libraries
*`DataFlow::SourceNode` is no longer an abstract class; to add new source nodes, extend `DataFlow::SourceNode::Range` instead.
* Subclasses of `DataFlow::PropRead` are no longer automatically made source nodes; you now need to additionally define a corresponding subclass of `DataFlow::SourceNode::Range` to achieve this.
* The deprecated libraries `semmle.javascript.DataFlow` and `semmle.javascript.dataflow.CallGraph` have been removed; they are both superseded by `semmle.javascript.dataflow.DataFlow`.
* Overriding `DataFlow::InvokeNode.getACallee()` no longer affects the call graph seen by the interprocedural data flow libraries. To do this, the 1-argument version `getACallee(int imprecision)` can be overridden instead.
* The predicate `DataFlow::returnedPropWrite` was intended for internal use only and is no longer available.
The extractor now parses all Python code from a single unified grammar. This means that almost all Python code will be successfully parsed, even if mutually incompatible Python code is present in the same project. This also means that Python code for any version can be correctly parsed on a worker running any other supported version of Python. For example, Python 3.7 code is parsed correctly, even if the installed version of Python is only 3.5. This will reduce the number of syntax errors found in many projects.
### Regular expression analysis improvements
The Python `re` (regular expressions) module library has a couple of constants called `MULTILINE` and `VERBOSE` which determine the parsing of regular expressions. Python 3.6 changed the implementation of these constants, which resulted in false positive results for some queries. The relevant QL libraries have been updated to support both implementations which will remove false positive results from projects that use Python 3.6 and later versions.
### API improvements
The API has been improved to declutter the global namespace and improve discoverability and readability.
* New predicates `ModuleObject::named(name)` and `ModuleObject.attr(name)` have been added, allowing more readable access to common objects. For example, `(any ModuleObject m | m.getName() = "sys").getAttribute("exit")` can be replaced with `ModuleObject::named("sys").attr("exit")`
* The API for accessing builtin functions has been improved. Predicates of the form `theXXXFunction()`, such as `theLenFunction()`, have been deprecated in favor of `Object::builtin(name)`.
* A configuration based API has been added for writing data flow and taint tracking queries. This is provided as a convenience for query authors who have written data flow or taint tracking queries for other languages, so they can use a similar format of query across multiple languages.
| Default version of SSL/TLS may be insecure (`py/insecure-default-protocol`) | security, external/cwe/cwe-327 | Finds instances where an insecure default protocol may be used. Results are shown on LGTM by default. |
| Incomplete regular expression for hostnames (`py/incomplete-hostname-regexp`) | security, external/cwe/cwe-020 | Finds instances where a hostname is incompletely sanitized because a regular expression contains an unescaped character. Results are shown on LGTM by default. |
| Incomplete URL substring sanitization (`py/incomplete-url-substring-sanitization`) | security, external/cwe/cwe-020 | Finds instances where a URL is incompletely sanitized due to insufficient checks. Results are shown on LGTM by default. |
| Insecure temporary file (`py/insecure-temporary-file`) | security, external/cwe/cwe-377 | Finds uses of the insecure and deprecated `tempfile.mktemp`, `os.tempnam`, and `os.tmpnam` functions. Results are shown on LGTM by default. |
| Overly permissive file permissions (`py/overly-permissive-file`) | security, external/cwe/cwe-732 | Finds instances where a file is created with overly permissive permissions. Results are not shown on LGTM by default. |
| Use of insecure SSL/TLS version (`py/insecure-protocol`) | security, external/cwe/cwe-327 | Finds instances where a known insecure protocol has been specified. Results are shown on LGTM by default. |
| Comparison using is when operands support \_\_eq\_\_ (`py/comparison-using-is`) | Fewer false positive results | Results where one of the objects being compared is an enum member are no longer reported. |
| Modification of parameter with default (`py/modification-of-default-value`) | More true positive results | Instances where the mutable default value is mutated inside other functions are now also reported. |
| Mutation of descriptor in \_\_get\_\_ or \_\_set\_\_ method (`py/mutable-descriptor`) | Fewer false positive results | Results where the mutation does not occur when calling one of the `__get__`, `__set__` or `__delete__` methods are no longer reported. |
| Redundant comparison (`py/redundant-comparison`) | Fewer false positive results | Results in chained comparisons are no longer reported. |
| Unused import (`py/unused-import`) | Fewer false positive results | Results where the imported module is used in a `doctest` string are no longer reported. |
| Unused import (`py/unused-import`) | Fewer false positive results | Results where the imported module is used in a type-hint comment are no longer reported. |
## Changes to QL libraries
* Added support for the `dill` pickle library.
* Added support for the `bottle` web framework.
* Added support for the `CherryPy` web framework.
* Added support for the `falcon` web API framework.
* Added support for the `turbogears` web framework.
> Please describe your changes in terms that are suitable for
> customers to read. These notes will have only minor tidying up
> before they are published as part of the release notes.
>
> This file is written for lgtm users and should contain *only*
> notes about changes that affect lgtm enterprise users. Add
> any other customer-facing changes to the `studio-java.md`
> file.
>
## General improvements
## Changes to code extraction
*The extractor now supports [Nullish Coalescing](https://github.com/tc39/proposal-nullish-coalescing) expressions.
*The TypeScript extractor now handles the control-flow of logical operators and destructuring assignments more accurately.
*Parallel extraction of JavaScript files (but not TypeScript files) on LGTM is now supported. If LGTM is configured to evaluate queries using multiple threads, then JavaScript files are also extracted using multiple threads.
*Experimental support for [E4X](https://developer.mozilla.org/en-US/docs/Archive/Web/E4X), a legacy language extension developed by Mozilla, is available.
* Additional [Flow](https://flow.org/) syntax is now supported.
* [Nullish Coalescing](https://github.com/tc39/proposal-nullish-coalescing) expressions are now supported.
* [TypeScript 3.2](https://www.typescriptlang.org/docs/handbook/release-notes/typescript-3-2.html) is now supported.
* The TypeScript extractor now handles the control flow of logical operators and destructuring assignments more accurately.
| Call to alloca in a loop (`cpp/alloca-in-loop`) | reliability, correctness, external/cwe/cwe-770 | Finds calls to `alloca` in loops, which can lead to stack overflow if the number of iterations is large. Newly displayed [on LGTM](https://lgtm.com/rules/1508831665988/). |
| Call to function with fewer arguments than declared parameters (`cpp/too-few-arguments`) | correctness, maintainability, security | Finds all cases where the number of arguments is fewer than the number of parameters of the function, provided the function is also properly declared/defined elsewhere. Results are displayed by default [on LGTM](https://lgtm.com/rules/1508860726279/). |
| Call to a function with one or more incompatible arguments (`cpp/mistyped-function-arguments`) | correctness, maintainability | Finds all cases where the types of arguments do not match the types of parameters of the function, provided the function is also properly declared/defined elsewhere. Results are not displayed by default [on LGTM](https://lgtm.com/rules/1508849286093/). |
| Use of dangerous function (`cpp/dangerous-function-overflow`) | reliability, security, external/cwe/cwe-242 | Finds calls to `gets`, which does not guard against buffer overflow. These results were previously detected by the `cpp/potentially-dangerous-function` query. Results for both queries are displayed by default on LGTM. |
| Buffer not sufficient for string (`cpp/overflow-calculated`) | Fewer results | This query no longer reports results that would be found by the 'No space for zero terminator' (`cpp/no-space-for-terminator`) query. |
| Call to function with extraneous arguments (`cpp/futile-params`) | Improved coverage | Query has been generalized to find all cases where the number of arguments exceeds the number of parameters of the function, provided the function is also properly declared/defined elsewhere. |
| Commented-out code (`cpp/commented-out-code`) | More correct results | Commented-out preprocessor code is now detected by this query. |
| Comparison result is always the same | Fewer false positive results | The range analysis library is now more conservative about floating point values being possibly `NaN`. |
| Constructor with default arguments will be used as a copy constructor (`cpp/constructor-used-as-copy-constructor`) | Lowered severity and precision | The severity and precision of this query have been reduced to "warning" and "low", respectively. This coding pattern is used intentionally and safely in a number of real-world projects. Results are no longer displayed on LGTM unless you choose to display them. |
| Dead code due to goto or break statement (`cpp/dead-code-goto`) | Fewer false positive results | Functions containing preprocessor logic are now excluded from this analysis. |
| Memory is never freed (`cpp/memory-never-freed`) | More correct results | Support added for more Microsoft-specific allocation functions, including `LocalAlloc`, `GlobalAlloc`, `HeapAlloc` and `CoTaskMemAlloc`. |
| Memory may not be freed (`cpp/memory-may-not-be-freed`) | More correct results | Support added for more Microsoft-specific allocation functions, including `LocalAlloc`, `GlobalAlloc`, `HeapAlloc` and `CoTaskMemAlloc`. |
| Mismatching new/free or malloc/delete (`cpp/new-free-mismatch`) | Fewer false positive results | Fixed an issue where functions were being identified as allocation functions inappropriately. This correction also affects `cpp/new-array-delete-mismatch` and `cpp/new-delete-array-mismatch`. |
| No space for zero terminator (`cpp/no-space-for-terminator`) | More correct results | This query now detects calls to `std::malloc`. |
| Overflow in uncontrolled allocation size (`cpp/uncontrolled-allocation-size`) | More correct results | This query has been reworked so that it can find a wider variety of results. |
| Resource not released in destructor (`cpp/resource-not-released-in-destructor`) | Fewer false positive results | Resource allocation and deallocation functions are now determined more accurately. |
| Use of potentially dangerous function | More correct results | Calls to `localtime`, `ctime` and `asctime` are now detected by this query. |
| Wrong type of arguments to formatting function (`cpp/wrong-type-format-argument`) | More correct results and fewer false positive results | This query now understands non-standard uses of `%L`. In addition, it more accurately identifies wide and non-wide string/character format arguments on different platforms. |
| Use of potentially dangerous function (`cpp/potentially-dangerous-function`) | Fewer results | Results relating to the standard library `gets` function have been moved into a new query (`cpp/dangerous-function-overflow`). |
## Changes to QL libraries
- The predicate `Declaration.hasGlobalName` now only holds for declarations that are not nested in a class. For example, it no longer holds for a member function `MyClass::myFunction` or a constructor `MyClass::MyClass`, whereas previously it would classify those two declarations as global names.
- In class `Declaration`, predicates `getQualifiedName/0` and `hasQualifiedName/1` are no longer recommended for finding functions by name. Instead, use `hasGlobalName/1` and the new `hasQualifiedName/2` and `hasQualifiedName/3` predicates. This improves performance and identifies names involving templates and inline namespaces more reliably.
- Additional support for definition by reference has been added to the `semmle.code.cpp.dataflow.TaintTracking` library, including:
- Taint-specific edges for functions modeled in `semmle.code.cpp.models.interfaces.DataFlow`.
- Flow through library functions that are modeled in `semmle.code.cpp.models.interfaces.Taint`. Queries can add subclasses of `TaintFunction` to specify additional flow.
- There is a new `FoldExpr` class, representing C++17 fold expressions.
- The member predicates `DeclarationEntry.getUnspecifiedType`, `Expr.getUnspecifiedType`, and `Variable.getUnspecifiedType` have been added. These should be preferred over the existing `getUnderlyingType` predicates.
C# analysis now supports the extraction and analysis of many C# 8 features. For details see [Changes to code extraction](#changes-to-code-extraction) and [Changes to QL libraries](#changes-to-ql-libraries) below.
| Thread-unsafe capturing of an ICryptoTransform object (`cs/thread-unsafe-icryptotransform-captured-in-lambda`) | concurrency, security, external/cwe/cwe-362 | Highlights instances of classes where a field of type `System.Security.Cryptography.ICryptoTransform` is captured by a lambda, and appears to be used in a thread initialization method. Results are not shown on [LGTM](https://lgtm.com/rules/1508141845995/) by default. |
| Constant condition (`cs/constant-condition`) | Fewer false positive results | The query now ignores code where the `null` value is in a conditional expression on the left hand side of a null-coalescing expression. For example, in `(a ? b : null) ?? c`, `null` is not considered to be a constant condition. |
| Thread-unsafe use of a static ICryptoTransform field (`cs/thread-unsafe-icryptotransform-field-in-class`) | Fewer false positive results | The criteria for a result has changed to include nested properties, nested fields, and collections. The format of the alert message has changed to highlight the static field. The query name has been updated. |
| Useless upcast (`cs/useless-upcast`) | Fewer false positive results | The query now ignores code where the upcast is used to disambiguate the target of a constructor call. |
## Changes to code extraction
* The following C# 8 features are now extracted:
- Range expressions
- Recursive patterns
- Using declaration statements
-`static` modifiers on local functions
- Null-coalescing assignment expressions
* The `unmanaged` type parameter constraint is also now extracted.
## Changes to QL libraries
* The class `Attribute` has two new predicates: `getConstructorArgument()` and `getNamedArgument()`. The first predicate returns arguments to the underlying constructor call and the second returns named arguments for initializing fields and properties.
* The class `TypeParameterConstraints` has a new predicate `hasUnmanagedTypeConstraint()`. This shows whether the type parameter has the `unmanaged` constraint.
* The following QL classes have been added to model C# 8 features:
- Class `AssignCoalesceExpr` models null-coalescing assignment, for example `x ??= y`
- Class `IndexExpr` models from-end index expressions, for example `^1`
- Class `PatternExpr` is an `Expr` that appears in a pattern. It has the new subclasses `DiscardPatternExpr`, `LabeledPatternExpr`, `RecursivePatternExpr`, `TypeAccessPatternExpr`, `TypePatternExpr`, and `VariablePatternExpr`.
- Class `PatternMatch` models a pattern being matched. It has the subclasses `Case` and `IsExpr`.
- Class `PositionalPatternExpr` models position patterns, for example `(int x, int y)`
- Class `PropertyPatternExpr` models property patterns, for example `Length: int len`
- Class `RangeExpr` models range expressions, for example `1..^1`
- Class `SwitchCaseExpr` models the arm of a switch expression, for example `(false, false) => true`
- Class `SwitchExpr` models `switch` expressions, for example `(a, b) switch { ... }`
- Classes `IsConstantExpr`, `IsTypeExpr` and `IsPatternExpr` are deprecated in favour of `IsExpr`
- Class `Switch` models both `SwitchExpr` and `SwitchStmt`
- Class `Case` models both `CaseStmt` and `SwitchCaseExpr`
- Class `UsingStmt` models both `UsingBlockStmt` and `UsingDeclStmt`
| Implicit conversion from array to string (`java/print-array`) | Fewer false positive results | Results in slf4j logging calls are no longer reported as slf4j supports array printing. |
| Result of multiplication cast to wider type (`java/integer-multiplication-cast-to-long`) | Fewer false positive results | Range analysis is now used to exclude results involving multiplication of small values that cannot overflow. |
## Changes to QL libraries
* The `Guards` library has been extended to account for method calls that check
conditions by conditionally throwing an exception. This includes the
`checkArgument` and `checkState` methods in
`com.google.common.base.Preconditions`, the `isTrue` and `validState` methods
in `org.apache.commons.lang3.Validate`, as well as any similar custom
methods. This means that more guards are recognized which improves the precision of a number of queries including `java/index-out-of-bounds`,
`java/dereferenced-value-may-be-null`, and `java/useless-null-check`.
* The default sanitizer in taint tracking has been made more precise. The
sanitizer works by looking for guards that inspect tainted strings. It
previously worked at the level of individual variables. Now it
uses the `Guards` library, such that only guarded variable accesses are
sanitized. This may give additional results for security queries.
* Spring framework support now takes into account additional
annotations that indicate remote user input. This affects all security
* The security queries now track data flow through Base64 decoders such as the Node.js `Buffer` class, the DOM function `atob`, and a number of npm packages including [`abab`](https://www.npmjs.com/package/abab), [`atob`](https://www.npmjs.com/package/atob), [`btoa`](https://www.npmjs.com/package/btoa), [`base-64`](https://www.npmjs.com/package/base-64), [`js-base64`](https://www.npmjs.com/package/js-base64), [`Base64.js`](https://www.npmjs.com/package/Base64) and [`base64-js`](https://www.npmjs.com/package/base64-js).
* The security queries now track data flow through exceptions.
* The security queries now treat comparisons with symbolic constants as sanitizers, resulting in fewer false positive results.
* TypeScript 3.5 is now supported.
* On LGTM, TypeScript projects now have static type information extracted by default, resulting in more security results.
Users of the command-line tools must still pass `--typescript-full` to the extractor to enable this.
| Missing regular expression anchor (`js/regex/missing-regexp-anchor`) | correctness, security, external/cwe/cwe-20 | Highlights regular expression patterns that may be missing an anchor, indicating a possible violation of [CWE-20](https://cwe.mitre.org/data/definitions/20.html). Results are not shown on LGTM by default. |
| Prototype pollution (`js/prototype-pollution`) | security, external/cwe-250, external/cwe-400 | Highlights code that allows an attacker to modify a built-in prototype object through an unsanitized recursive merge function. Results are not shown on [LGTM](https://lgtm.com/rules/1508857356317/) by default. |
| Arbitrary file write during zip extraction ("Zip Slip") | More results | This rule now considers more libraries, including tar as well as zip. |
| Client-side URL redirect | More results and fewer false-positive results | This rule now recognizes additional uses of the document URL. It also treats URLs as safe in more cases where the hostname cannot be tampered with. |
| Double escaping or unescaping | More results | This rule now considers the flow of regular expressions literals. |
| Expression has no effect | Fewer false-positive results | This rule now treats uses of `Object.defineProperty` more conservatively. |
| Incomplete regular expression for hostnames | More results | This rule now tracks regular expressions for host names further. |
| Incomplete string escaping or encoding | More results | This rule now considers the flow of regular expressions literals, and it no longer flags the removal of trailing newlines. |
| Incorrect suffix check | Fewer false-positive results | This rule now recognizes valid checks in more cases. |
| Password in configuration file | Fewer false positive results | This query now excludes passwords that are inserted into the configuration file using a templating mechanism or read from environment variables. Results are no longer shown on LGTM by default. |
| Replacement of a substring with itself | More results | This rule now considers the flow of regular expressions literals. |
| Server-side URL redirect | Fewer false-positive results | This rule now treats URLs as safe in more cases where the hostname cannot be tampered with. |
| Tainted path | More results and fewer false-positive results | This rule now analyzes path manipulation code more precisely. |
| Type confusion through parameter tampering | Fewer false-positive results | This rule now recognizes additional emptiness checks. |
| Useless assignment to property | Fewer false-positive results | This rule now ignores reads of additional getters. |
| Unreachable statement | Unreachable throws no longer give an alert | This ignores unreachable throws, as they could be intentional (for example, to placate the TS compiler). |
## Changes to QL libraries
*`RegExpLiteral` is now a `DataFlow::SourceNode`.
*`JSDocTypeExpr` now has source locations and is a subclass of `Locatable` and `TypeAnnotation`.
* The two-parameter versions of predicate `isBarrier` in `DataFlow::Configuration` and of predicate `isSanitizer` in `TaintTracking::Configuration` have been renamed to `isBarrierEdge` and `isSanitizerEdge`, respectively. The old names are maintained for backwards-compatibility in this version, but will be deprecated in the next version and subsequently removed.
* Various predicates named `getTypeAnnotation()` now return `TypeAnnotation` instead of `TypeExpr`.
In rare cases, this may cause compilation errors in existing code. Cast the result to `TypeExpr` if this happens.
* The `getALabel` predicate in `LabeledBarrierGuardNode` and `LabeledSanitizerGuardNode`
has been deprecated and overriding it no longer has any effect.
Instead use the 3-parameter version of `blocks` or `sanitizes`.
The old API is now deprecated, but will be continued to be supported for at least another year.
### Impact on existing queries.
As points-to analysis underpins many queries, and provides the call-graph and reachability analysis required for taint-tracking, the results of many queries may change.
The improved reachability analysis and non-local tracking of bound methods may identify new results.
The increased precision in tracking of values through `*` arguments may remove false positive results.
Overall the number of true positive results should increase and the number false negative results should decline.
We welcome feedback on the new implementation, particularly any surprising changes in results.
## New queries
| **Query** | **Tags** | **Purpose** |
|-----------|----------|-------------|
| Accepting unknown SSH host keys when using Paramiko (`py/paramiko-missing-host-key-validation`) | security, external/cwe/cwe-295 | Finds instances where Paramiko is configured to accept unknown host keys. Results are shown [on LGTM](https://lgtm.com/rules/1508297729270/) by default. |
| Pythagorean calculation with sub-optimal numerics (`py/pythagorean`) | accuracy | Finds instances of hypotenuse calculation using `math.sqrt` instead of `math.hypot`. Results are not shown on LGTM by default. |
| Use of 'return' or 'yield' outside a function (`py/return-or-yield-outside-function`) | reliability, correctness | Finds instances where `return`, `yield`, and `yield from` are used outside a function. Results are not shown on LGTM by default. |
## Changes to code extraction
* String literals as expressions within literal string interpolation (f-strings) are now handled correctly.
* The Python extractor now handles invalid input more robustly. In particular, it exits gracefully when:
* A non-existent file or directory is specified using the `--path` option, or as a file name.
* An invalid number is specified for the `--max-procs` option.
* Custom file types can now be specified using the `filetypes` property in the `extraction/javascript/index` section of `lgtm.yml`. The property should be a map from file extensions (including the dot) to file types. Valid file types are `html`, `js`, `json`, `typescript`, `xml` and `yaml`.
* ECMAScript 2019 support is now enabled by default.
* On LGTM, JavaScript extraction for projects that do not contain any JavaScript or TypeScript code will now fail, even if the project contains other file types (such as HTML or YAML) recognized by the JavaScript extractor.
* XML files can now be extracted on LGTM. To enable XML extraction, set the `xml_mode` property in the `extraction/javascript/index` section of your `lgtm.yml` file to `all`. The default value of this property is `disabled`, meaning that XML files will not be extracted. (Note, that the `xml_mode` property does not apply to files that you map to the `xml` file type using the `filetypes` property. LGTM will always extract these files.)
* YAML files are now extracted by default on LGTM. If required, you can specify exclusion filters in your `lgtm.yml` file to override this behavior.
For detailed information about customizing LGTM extraction, see [JavaScript extraction](https://help.semmle.com/lgtm-enterprise/user/help/javascript-extraction.html).
| Call to alloca in a loop (`cpp/alloca-in-loop`) | Fewer false positive results | The query no longer highlights code where the stack allocation could not be reached multiple times in the loop, typically due to a `break` or `return` statement. |
| Continue statement that does not continue (`cpp/continue-in-false-loop`) | Fewer false positive results | Analysis is now restricted to `do`-`while` loops. This query is now run and displayed by default on LGTM. |
| Expression has no effect (`cpp/useless-expression`) | Fewer false positive results | Calls to functions with the `weak` attribute are no longer considered to be side-effect free, because they could be overridden with a different implementation at link time. |
| No space for zero terminator (`cpp/no-space-for-terminator`) | Fewer false positive results | False positive results for strings that are not null-terminated have been excluded. |
| Non-constant format string (`cpp/non-constant-format`) | Fewer false positive results | The query was rewritten using the taint-tracking library. |
| Sign check of bitwise operation (`cpp/bitwise-sign-check`) | Fewer false positive and more true positive results | The query now understands the direction of each comparison, making it more accurate. |
| Suspicious pointer scaling (`cpp/suspicious-pointer-scaling`) | Lower precision | The precision of this query has been reduced to "medium". This coding pattern is used intentionally and safely in a number of real-world projects. Results are no longer displayed on LGTM unless you choose to display them. |
| Variable used in its own initializer (`cpp/use-in-own-initializer`) | Fewer false positive results | False positive results for constant variables with the same name in different namespaces have been removed. |
## Changes to QL libraries
- The data flow library (`semmle.code.cpp.dataflow.DataFlow`) has had the
following improvements, all of which benefit the taint tracking library
(`semmle.code.cpp.dataflow.TaintTracking`) as well.
- This release includes preliminary support for interprocedural flow through
fields (non-static data members). In some cases, data stored in a field in
one function can now flow to a read of the same field in a different
function.
- The possibility of specifying barrier edges using
`isBarrierEdge`/`isSanitizerEdge` in data-flow and taint-tracking
configurations has been replaced with the option of specifying in- and
out-barriers on nodes by overriding `isBarrierIn`/`isSanitizerIn` and
`isBarrierOut`/`isSanitizerOut`. This should be simpler to use effectively,
as it does not require knowledge about the actual edges used internally by
the library.
- The library now models data flow through `std::swap`.
- Recursion through the `DataFlow` library is now always a compile error. Such recursion has been deprecated since release 1.16 in March 2018. If one `DataFlow::Configuration` needs to depend on the results of another, switch one of them to use one of the `DataFlow2` through `DataFlow4` libraries.
- In the `semmle.code.cpp.dataflow.TaintTracking` library, the second copy of `Configuration` has been renamed from `TaintTracking::Configuration2` to `TaintTracking2::Configuration`, and the old name is now deprecated. Import `semmle.code.cpp.dataflow.TaintTracking2` to access the new name.
- The `semmle.code.cpp.security.TaintTracking` library now considers a pointer difference calculation as blocking taint flow.
- The predicate `Variable.getAnAssignedValue()` now reports assignments to fields resulting from aggregate initialization (` = {...}`).
- The predicate `TypeMention.toString()` has been simplified to always return the string "`type mention`". This may improve performance when using `Element.toString()` or its descendants.
- Fixed the `LocalScopeVariableReachability.qll` library's handling of loops where the entry condition is always true on first entry, and where there is more than one control flow path through the loop condition. This change increases the accuracy of the `LocalScopeVariableReachability.qll` library and queries that depend on it.
- There is a new `Variable.isThreadLocal()` predicate. It can be used to tell whether a variable is `thread_local`.
- C/C++ code examples have been added to QLDoc comments on many more classes in the QL libraries.
| Constant condition (`cs/constant-condition`) | Fewer false positive results | Results have been removed for default cases (`_`) in switch expressions. |
| Dispose may not be called if an exception is thrown during execution (`cs/dispose-not-called-on-throw`) | Fewer false positive results | Results have been removed where an object is disposed both by a `using` statement and a `Dispose` call. |
| Unchecked return value (`cs/unchecked-return-value`) | Fewer false positive results | Method calls that are expression bodies of `void` callables (for example, the call to `Foo` in `void Bar() => Foo()`) are no longer considered to use the return value. |
## Removal of old queries
The following historic queries are no longer available in the distribution:
* Added lines (`cs/vcs/added-lines-per-file`)
* Churned lines (`cs/vcs/churn-per-file`)
* Defect filter
* Defect from SVN
* Deleted lines (`cs/vcs/deleted-lines-per-file`)
* Files edited in pairs
* Filter: only files recently edited
* Large files currently edited
* Metric from SVN
* Number of authors (version control) (`cs/vcs/authors-per-file`)
* Number of file-level changes (`cs/vcs/commits-per-file`)
* Number of co-committed files (`cs/vcs/co-commits-per-file`)
* Number of file re-commits (`cs/vcs/recommits-per-file`)
* Number of recent file changes (`cs/vcs/recent-commits-per-file`)
* Number of authors
* Number of commits
* Poorly documented files with many authors
* Recent activity
## Changes to code extraction
* The following C# 8 features are now extracted:
- Suppress-nullable-warning expressions, for example `x!`
- Nullable reference types, for example `string?`
## Changes to QL libraries
* The new class `AnnotatedType` models types with type annotations, including nullability information, return kinds (`ref` and `readonly ref`), and parameter kinds (`in`, `out`, and `ref`).
- The new predicate `Assignable.getAnnotatedType()` gets the annotated type of an assignable (such as a variable or a property).
- The new predicates `Callable.getAnnotatedReturnType()` and `DelegateType.getAnnotatedReturnType()` gets the annotated type of the return value.
- The new predicate `ArrayType.getAnnotatedElementType()` gets the annotated type of the array element.
- The new predicate `ConstructedGeneric.getAnnotatedTypeArgument()` gets the annotated type of a type argument.
- The new predicate `TypeParameterConstraints.getAnAnnotatedTypeConstraint()` gets a type constraint with type annotations.
* The new class `SuppressNullableWarningExpr` models suppress-nullable-warning expressions such as `x!`.
* The data-flow and taint-tracking libraries now support flow through fields. All existing configurations will have field-flow enabled by default, but it can be disabled by adding `override int fieldFlowBranchLimit() { result = 0 }` to the configuration class. Field assignments, `this.Foo = x`, object initializers, `new C() { Foo = x }`, and field initializers `int Foo = 0` are supported.
* The possibility of specifying barrier edges using
`isBarrierEdge`/`isSanitizerEdge` in data-flow and taint-tracking
configurations has been replaced with the option of specifying in- and
out-barriers on nodes by overriding `isBarrierIn`/`isSanitizerIn` and
`isBarrierOut`/`isSanitizerOut`. This should be simpler to use effectively,
as it does not require knowledge about the actual edges used internally by
| Equals method does not inspect argument type (`java/unchecked-cast-in-equals`) | Fewer false positive and more true positive results | Precision has been improved by doing a bit of inter-procedural analysis and relying less on ad-hoc method names. |
| Uncontrolled data in arithmetic expression (`java/uncontrolled-arithmetic`) | Fewer false positive results | Precision has been improved in several ways, in particular, by better detection of guards along the data-flow path. |
| Uncontrolled data used in path expression (`java/path-injection`) | Fewer false positive results | The query no longer reports results guarded by `!var.contains("..")`. |
| User-controlled data in arithmetic expression (`java/tainted-arithmetic`) | Fewer false positive results | Precision has been improved in several ways, in particular, by better detection of guards along the data-flow path. |
## Changes to QL libraries
* The virtual dispatch library has been updated to give more precise dispatch
targets for `Object.toString()` calls. This affects all security queries and
removes false positive results that arose from paths through impossible `toString()`
calls.
* The library `VCS.qll` and all queries that imported it have been removed.
* The second copy of the interprocedural `TaintTracking` library has been
renamed from `TaintTracking::Configuration2` to
`TaintTracking2::Configuration`, and the old name is now deprecated. Import
`semmle.code.java.dataflow.TaintTracking2` to access the new name.
* The data-flow library now makes it easier to specify barriers/sanitizers
arising from guards by overriding the predicate
`isBarrierGuard`/`isSanitizerGuard` on data-flow and taint-tracking
configurations respectively.
* The possibility of specifying barrier edges using
`isBarrierEdge`/`isSanitizerEdge` in data-flow and taint-tracking
configurations has been replaced with the option of specifying in- and
out-barriers on nodes by overriding `isBarrierIn`/`isSanitizerIn` and
`isBarrierOut`/`isSanitizerOut`. This should be simpler to use effectively,
as it does not require knowledge about the actual edges used internally by
* Automatic classification of test files has been improved, in particular `__tests__` and `__mocks__` folders (as used by [Jest](https://jestjs.io)) are now recognized.
* Support for the following frameworks and libraries has been improved:
* Support for tracking data flow and taint through getter functions (that is, functions that return a property of one of their arguments) and through the receiver object of method calls has been improved. This may produce more security alerts.
* Taint tracking through object property names has been made more precise, resulting in fewer false positive results.
* Method calls are now resolved in more cases, due to improved class hierarchy analysis. This may produce more security alerts.
* Jump-to-definition now resolves calls to their definition in more cases, and supports jumping from a JSDoc type annotation to its definition.
| Indirect uncontrolled command line (`js/indirect-command-line-injection`) | correctness, security, external/cwe/cwe-078, external/cwe/cwe-088 | Highlights command-line invocations that may indirectly introduce a command-line injection vulnerability elsewhere, indicating a possible violation of [CWE-78](https://cwe.mitre.org/data/definitions/78.html). Results are not shown on LGTM by default. |
| Conflicting HTML element attributes (`js/conflicting-html-attribute`) | No changes to results | Results are no longer shown on LGTM by default. |
| Shift out of range (`js/shift-out-of-range`| Fewer false positive results | This rule now correctly handles BigInt shift operands. |
| Superfluous trailing arguments (`js/superfluous-trailing-arguments`) | Fewer false-positive results. | This rule no longer flags calls to placeholder functions that trivially throw an exception. |
| Undocumented parameter (`js/jsdoc/missing-parameter`) | No changes to results | This rule is now run on LGTM, although its results are still not shown by default. |
| Missing space in string concatenation (`js/missing-space-in-concatenation`) | Fewer false positive results | The rule now requires a word-like part exists in the string concatenation. |
## Changes to QL libraries
- The `getName()` predicate on functions and classes now gets a name that is
inferred from the context if the function or class was not declared with a name.
- The two-argument and three-argument variants of `DataFlow::Configuration::isBarrier` and
`TaintTracking::Configuration::isSanitizer` have been deprecated. Overriding them no
longer has any effect. Use `isBarrierEdge` and `isSanitizerEdge` instead.
- The QLDoc for most AST classes have been expanded with concrete syntax examples.
- Tutorials on how to use [flow labels](https://help.semmle.com/QL/learn-ql/javascript/flow-labels.html)
and [type tracking](https://help.semmle.com/QL/learn-ql/javascript/type-tracking.html) have been published,
as well as a [data flow cheat sheet](https://help.semmle.com/QL/learn-ql/javascript/dataflow-cheat-sheet.html) for quick reference.
Tracking of "unknown" values from modules that are absent from the database has been improved. Particularly when an "unknown" value is used as a decorator, the decorated function is tracked.
### Loop unrolling
The extractor now unrolls a single iteration of loops that are known to run at least once. This improves analysis in cases like the following
```python
ifseq:
forxinseq:
y=x
y# y is defined here
```
### Better API for function parameter annotations
Instances of the `Parameter` and `ParameterDefinition` class now have a `getAnnotation` method that returns the corresponding parameter annotation, if one exists.
### Improvements to the Value API
- The Value API has been extended with classes representing functions, classes, tuples, and other types.
-`Value::forInt(int x)` and `Value::forString(string s)` have been added to make it easier to refer to the `Value` entities for common constants.
### Other improvements
- Short flags for regexes (for example, `re.M` for multiline regexes) are now handled correctly.
- Modules with multiple import roots no longer get multiple names.
- A new `NegativeIntegerLiteral` class has been added as a subtype of `ImmutableLiteral`, so that `-1` is treated as an `ImmutableLiteral`. This means that queries looking for the use of constant integers will automatically handle negative numbers.
## New queries
| **Query** | **Tags** | **Purpose** |
|-----------|----------|-------------|
| Arbitrary file write during tarfile extraction (`py/tarslip`) | security, external/cwe/cwe-022 | Finds instances where extracting from a tar archive can result in arbitrary file writes. Results are not shown on LGTM by default. |
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | reliability, japanese-era | This query is a combination of two old queries that were identical in purpose but separate as an implementation detail. This new query replaces Hard-coded Japanese era start date in call (`cpp/japanese-era/constructor-or-method-with-exact-era-date`) and Hard-coded Japanese era start date in struct (`cpp/japanese-era/struct-with-exact-era-date`). |
| Hard-coded Japanese era start date in call (`cpp/japanese-era/constructor-or-method-with-exact-era-date`) | Deprecated | This query has been deprecated. Use the new combined query Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) instead. |
| Hard-coded Japanese era start date in struct (`cpp/japanese-era/struct-with-exact-era-date`) | Deprecated | This query has been deprecated. Use the new combined query Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) instead. |
| Hard-coded Japanese era start date (`cpp/japanese-era/exact-era-date`) | More correct results | This query now checks for the beginning date of the Reiwa era (1st May 2019). |
| Sign check of bitwise operation (`cpp/bitwise-sign-check`) | Fewer false positive results | Results involving `>=` or `<=` are no longer reported. |
| Too few arguments to formatting function (`cpp/wrong-number-format-arguments`) | Fewer false positive results | Fixed false positives resulting from mistmatching declarations of a formatting function. |
| Too many arguments to formatting function (`cpp/too-many-format-arguments`) | Fewer false positive results | Fixed false positives resulting from mistmatching declarations of a formatting function. |
| Unclear comparison precedence (`cpp/comparison-precedence`) | Fewer false positive results | False positives involving template classes and functions have been fixed. |
## Changes to QL libraries
* The data-flow library has been extended with a new feature to aid debugging.
Instead of specifying `isSink(Node n) { any() }` on a configuration to
explore the possible flow from a source, it is recommended to use the new
`Configuration::hasPartialFlow` predicate, as this gives a more complete
picture of the partial flow paths from a given source. The feature is
disabled by default and can be enabled for individual configurations by
overriding `int explorationLimit()`.
* The data-flow library now supports flow out of C++ reference parameters.
* The data-flow library now allows flow through the address-of operator (`&`).
* The `DataFlow::DefinitionByReferenceNode` class now considers `f(x)` to be a
definition of `x` when `x` is a variable of pointer type. It no longer
considers deep paths such as `f(&x.myField)` to be definitions of `x`. These
changes are in line with the user expectations we've observed.
* There is now a `DataFlow::localExprFlow` predicate and a
`TaintTracking::localExprTaint` predicate to make it easy to use the most
common case of local data flow and taint: from one `Expr` to another.
* The member predicates of the `FunctionInput` and `FunctionOutput` classes have been renamed for
clarity (e.g. `isOutReturnPointer()` to `isReturnValueDeref()`). The existing member predicates
have been deprecated, and will be removed in a future release. Code that uses the old member
predicates should be updated to use the corresponding new member predicate.
* The control-flow graph is now computed in QL, not in the extractor. This can
lead to regressions (or improvements) in how queries are optimized because
optimization in QL relies on static size estimates, and the control-flow edge
relations will now have different size estimates than before.
| Unsafe year argument for 'DateTime' constructor (`cs/unsafe-year-construction`) | reliability, date-time | Finds incorrect manipulation of `DateTime` values, which could lead to invalid dates. |
| Mishandling the Japanese era start date (`cs/mishandling-japanese-era`) | reliability, date-time | Finds hard-coded Japanese era start dates that could be invalid. |
| Dereferenced variable may be null (`cs/dereferenced-value-may-be-null`) | Fewer false positive results | More `null` checks are now taken into account, including `null` checks for `dynamic` expressions and `null` checks such as `object alwaysNull = null; if (x != alwaysNull) ...`. |
| Missing Dispose call on local IDisposable (`cs/local-not-disposed`) | Fewer false positive results | The query has been rewritten in order to identify more dispose patterns. For example, a local `IDisposable` that is disposed of by passing through a fluent API is no longer reported. |
## Removal of old queries
## Changes to code extraction
*`nameof` expressions are now extracted correctly when the name is a namespace.
## Changes to QL libraries
* The new class `NamespaceAccess` models accesses to namespaces, for example in `nameof` expressions.
* The data-flow library now makes it easier to specify barriers/sanitizers
arising from guards by overriding the predicate
`isBarrierGuard`/`isSanitizerGuard` on data-flow and taint-tracking
configurations respectively.
* The data-flow library has been extended with a new feature to aid debugging.
Instead of specifying `isSink(Node n) { any() }` on a configuration to
explore the possible flow from a source, it is recommended to use the new
`Configuration::hasPartialFlow` predicate, as this gives a more complete
picture of the partial flow paths from a given source. The feature is
disabled by default and can be enabled for individual configurations by
overriding `int explorationLimit()`.
*`foreach` statements where the body is guaranteed to be executed at least once, such as `foreach (var x in new string[]{ "a", "b", "c" }) { ... }`, are now recognized by all analyses based on the control flow graph (such as SSA, data flow and taint tracking).
* Fixed the control flow graph for `switch` statements where the `default` case was not the last case. This had caused the remaining cases to be unreachable. `SwitchStmt.getCase(int i)` now puts the `default` case last.
* There is now a `DataFlow::localExprFlow` predicate and a
`TaintTracking::localExprTaint` predicate to make it easy to use the most
common case of local data flow and taint: from one `Expr` to another.
| Continue statement that does not continue (`java/continue-in-false-loop`) | correctness | Finds `continue` statements in `do { ... } while (false)` loops. |
| Dereferenced variable may be null (`java/dereferenced-value-may-be-null`) | Fewer false positives | Certain indirect null guards involving two auxiliary variables known to be equal can now be detected. |
| Non-synchronized override of synchronized method (`java/non-sync-override`) | Fewer false positives | Results are now only reported if the immediately overridden method is synchronized. |
| Query built from user-controlled sources (`java/sql-injection`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as SQL expressions sinks. |
| Query built from local-user-controlled sources (`java/sql-injection-local`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as SQL expressions sinks. |
| Query built without neutralizing special characters (`java/concatenated-sql-query`) | More results | The query now identifies arguments to `Statement.executeLargeUpdate` and `Connection.prepareCall` as SQL expressions sinks. |
| Useless comparison test (`java/constant-comparison`) | Fewer false positives | Additional overflow check patterns are now recognized and no longer reported. |
## Changes to QL libraries
* The data-flow library has been extended with a new feature to aid debugging.
Instead of specifying `isSink(Node n) { any() }` on a configuration to
explore the possible flow from a source, it is recommended to use the new
`Configuration::hasPartialFlow` predicate, as this gives a more complete
picture of the partial flow paths from a given source. The feature is
disabled by default and can be enabled for individual configurations by
| Unused index variable (`js/unused-index-variable`) | correctness | Highlights loops that iterate over an array, but do not use the index variable to access array elements, indicating a possible typo or logic error. Results are shown on LGTM by default. |
| Loop bound injection (`js/loop-bound-injection`) | security, external/cwe/cwe-834 | Highlights loops where a user-controlled object with an arbitrary .length value can trick the server to loop indefinitely. Results are shown on LGTM by default. |
| Suspicious method name (`js/suspicious-method-name-declaration`) | correctness, typescript, methods | Highlights suspiciously named methods where the developer likely meant to write a constructor or function. Results are shown on LGTM by default. |
| Shell command built from environment values (`js/shell-command-injection-from-environment`) | correctness, security, external/cwe/cwe-078, external/cwe/cwe-088 | Highlights shell commands that may change behavior inadvertently depending on the execution environment, indicating a possible violation of [CWE-78](https://cwe.mitre.org/data/definitions/78.html). Results are shown on LGTM by default.|
| Use of returnless function (`js/use-of-returnless-function`) | maintainability, correctness | Highlights calls where the return value is used, but the callee never returns a value. Results are shown on LGTM by default. |
| Useless regular expression character escape (`js/useless-regexp-character-escape`) | correctness, security, external/cwe/cwe-20 | Highlights regular expression strings with useless character escapes, indicating a possible violation of [CWE-20](https://cwe.mitre.org/data/definitions/20.html). Results are shown on LGTM by default. |
| Unreachable method overloads (`js/unreachable-method-overloads`) | correctness, typescript | Highlights method overloads that are impossible to use from client code. Results are shown on LGTM by default. |
| Incomplete string escaping or encoding (`js/incomplete-sanitization`) | Fewer false-positive results | This rule now recognizes additional ways delimiters can be stripped away. |
| Client-side cross-site scripting (`js/xss`) | More results, fewer false-positive results | More potential vulnerabilities involving functions that manipulate DOM attributes are now recognized, and more sanitizers are detected. |
| Code injection (`js/code-injection`) | More results | More potential vulnerabilities involving functions that manipulate DOM event handler attributes are now recognized. |
| Hard-coded credentials (`js/hardcoded-credentials`) | Fewer false-positive results | This rule now flags fewer password examples. |
| Illegal invocation (`js/illegal-invocation`) | Fewer false-positive results | This rule now correctly handles methods named `call` and `apply`. |
| Incorrect suffix check (`js/incorrect-suffix-check`) | Fewer false-positive results | The query recognizes valid checks in more cases. |
| Network data written to file (`js/http-to-file-access`) | Fewer false-positive results | This query has been renamed to better match its intended purpose, and now only considers network data untrusted. |
| Password in configuration file (`js/password-in-configuration-file`) | Fewer false-positive results | This rule now flags fewer password examples. |
| Prototype pollution (`js/prototype-pollution`) | More results | The query now highlights vulnerable uses of jQuery and Angular, and the results are shown on LGTM by default. |
| Reflected cross-site scripting (`js/reflected-xss`) | Fewer false-positive results | The query now recognizes more sanitizers. |
| Stored cross-site scripting (`js/stored-xss`) | Fewer false-positive results | The query now recognizes more sanitizers. |
| Uncontrolled command line (`js/command-line-injection`) | More results | This query now treats responses from servers as untrusted. |
## Changes to QL libraries
*`Expr.getDocumentation()` now handles chain assignments.
## Removal of deprecated queries
The following queries (deprecated since 1.17) are no longer available in the distribution:
| Clear-text logging of sensitive information (`py/clear-text-logging-sensitive-data`) | security, external/cwe/cwe-312 | Finds instances where sensitive information is logged without encryption or hashing. Results are shown on LGTM by default. |
| Clear-text storage of sensitive information (`py/clear-text-storage-sensitive-data`) | security, external/cwe/cwe-312 | Finds instances where sensitive information is stored without encryption or hashing. Results are shown on LGTM by default. |
| Binding a socket to all network interfaces (`py/bind-socket-all-network-interfaces`) | security | Finds instances where a socket is bound to all network interfaces. Results are shown on LGTM by default. |
exists(Block body | body = this.(Loop ).getStmt() or
body = this.(SwitchStmt).getStmt()
| strictcount(body.getAStmt+()) > 6)
and not exists (this.getGeneratingMacro())
exists(Block body |
body = this.(Loop).getStmt() or
body = this.(SwitchStmt).getStmt()
|
strictcount(body.getAStmt+()) > 6
) and
not exists(this.getGeneratingMacro())
}
}
from Block b, int n, ComplexStmt complexStmt
where n = strictcount(ComplexStmt s | s = b.getAStmt()) and n > 3
and complexStmt = b.getAStmt()
select b, "Block with too many statements (" + n.toString() + " complex statements in the block). Complex statements at: $@", complexStmt, complexStmt.toString()
where
n = strictcount(ComplexStmt s | s = b.getAStmt()) and
n > 3 and
complexStmt = b.getAStmt()
select b,
"Block with too many statements (" + n.toString() +
" complex statements in the block). Complex statements at: $@", complexStmt,
from Function f, LocalVariable lv, ParameterDeclarationEntry pde, string name
where
lv = localVariableNames(f, name) and
pde = functionParameterNames(f, name) and
not lv.isInMacroExpansion()
select lv, "Local variable '" + lv.getName() + "' hides a $@.", pde, "parameter of the same name"
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.