mirror of
https://github.com/github/codeql.git
synced 2025-12-17 17:23:36 +01:00
532 lines
21 KiB
Markdown
532 lines
21 KiB
Markdown
# Using the shared data-flow library
|
|
|
|
This document is aimed towards language maintainers and contains implementation
|
|
details that should be mostly irrelevant to query writers.
|
|
|
|
## Overview
|
|
|
|
The shared data-flow library implements sophisticated global data flow on top
|
|
of a language-specific data-flow graph. The language-specific bits supply the
|
|
graph through a number of predicates and classes, and the shared implementation
|
|
takes care of matching call-sites with returns and field writes with reads to
|
|
ensure that the generated paths are well-formed. The library also supports a
|
|
number of additional features for improving precision, for example pruning
|
|
infeasible paths based on type information.
|
|
|
|
## File organisation
|
|
|
|
The data-flow library consists of a number of files typically located in
|
|
`<lang>/dataflow` and `<lang>/dataflow/internal`:
|
|
|
|
```
|
|
dataflow/DataFlow.qll
|
|
dataflow/internal/DataFlowImpl.qll
|
|
dataflow/internal/DataFlowCommon.qll
|
|
dataflow/internal/DataFlowImplSpecific.qll
|
|
```
|
|
|
|
`DataFlow.qll` provides the user interface for the library and consists of just
|
|
a few lines of code importing the implementation:
|
|
|
|
#### `DataFlow.qll`
|
|
```ql
|
|
import <lang>
|
|
|
|
module DataFlow {
|
|
import semmle.code.java.dataflow.internal.DataFlowImpl
|
|
}
|
|
```
|
|
|
|
The `DataFlowImpl.qll` and `DataFlowCommon.qll` files contain the library code
|
|
that is shared across languages. These contain `Configuration`-specific and
|
|
`Configuration`-independent code, respectively. This organization allows
|
|
multiple copies of the library to exist without duplicating the
|
|
`Configuration`-independent predicates (for the use case when a query wants to
|
|
use two instances of global data flow and the configuration of one depends on
|
|
the results from the other). Using multiple copies just means duplicating
|
|
`DataFlow.qll` and `DataFlowImpl.qll`, for example as:
|
|
|
|
```
|
|
dataflow/DataFlow2.qll
|
|
dataflow/DataFlow3.qll
|
|
dataflow/internal/DataFlowImpl2.qll
|
|
dataflow/internal/DataFlowImpl3.qll
|
|
```
|
|
|
|
The file `DataFlowImplSpecific.qll` provides all the language-specific classes
|
|
and predicates that the library needs as input and is the topic of the rest of
|
|
this document.
|
|
|
|
This file must provide two modules named `Public` and `Private`, which the
|
|
shared library code will import publicly and privately, respectively, thus
|
|
allowing the language-specific part to choose which classes and predicates
|
|
should be exposed by `DataFlow.qll`.
|
|
|
|
A typical implementation looks as follows, thereby organizing the predicates in
|
|
two files, which we'll subsequently assume:
|
|
|
|
#### `DataFlowImplSpecific.qll`
|
|
```ql
|
|
module Private {
|
|
import DataFlowPrivate
|
|
}
|
|
|
|
module Public {
|
|
import DataFlowPublic
|
|
}
|
|
```
|
|
|
|
## Defining the data-flow graph
|
|
|
|
The main input to the library is the data-flow graph. One must define a class
|
|
`Node` and an edge relation `simpleLocalFlowStep(Node node1, Node node2)`. The
|
|
`Node` class should be in `DataFlowPublic`.
|
|
|
|
Recommendations:
|
|
* Make `Node` an IPA type. There is commonly a need for defining various
|
|
data-flow nodes that are not necessarily represented in the AST of the
|
|
language.
|
|
* Define `predicate localFlowStep(Node node1, Node node2)` as an alias of
|
|
`simpleLocalFlowStep` and expose it publicly. The reason for this indirection
|
|
is that it gives the option of exposing local flow augmented with field flow.
|
|
See the C/C++ implementation, which makes use of this feature. Another use of
|
|
this indirection is to hide synthesized local steps that are only relevant
|
|
for global flow. See the C# implementation for an example of this.
|
|
* Define `predicate localFlow(Node node1, Node node2) { localFlowStep*(node1, node2) }`.
|
|
* Make the local flow step relation in `simpleLocalFlowStep` follow
|
|
def-to-first-use and use-to-next-use steps for SSA variables. Def-use steps
|
|
also work, but the upside of `use-use` steps is that sources defined in terms
|
|
of variable reads just work out of the box. It also makes certain
|
|
barrier-implementations simpler.
|
|
|
|
The shared library does not use `localFlowStep` nor `localFlow` but users of
|
|
`DataFlow.qll` may expect the existence of `DataFlow::localFlowStep` and
|
|
`DataFlow::localFlow`.
|
|
|
|
### `Node` subclasses
|
|
|
|
The `Node` class needs a number of subclasses. As a minimum the following are needed:
|
|
```
|
|
ExprNode
|
|
ParameterNode
|
|
PostUpdateNode
|
|
|
|
OutNode
|
|
ArgumentNode
|
|
ReturnNode
|
|
CastNode
|
|
```
|
|
and possibly more depending on the language and its AST. Of the above, the
|
|
first 3 should be public, but the last 4 can be private. Also, the last 4 will
|
|
likely be subtypes of `ExprNode`. For further details about `ParameterNode`,
|
|
`ArgumentNode`, `ReturnNode`, and `OutNode` see [The call-graph](#the-call-graph)
|
|
below. For further details about `CastNode` see [Type pruning](#type-pruning) below.
|
|
For further details about `PostUpdateNode` see [Field flow](#field-flow) below.
|
|
|
|
Nodes corresponding to expressions and parameters are the most common for users
|
|
to interact with so a couple of convenience predicates are generally included:
|
|
```
|
|
DataFlowExpr Node::asExpr()
|
|
Parameter Node::asParameter()
|
|
ExprNode exprNode(DataFlowExpr n)
|
|
ParameterNode parameterNode(Parameter n)
|
|
```
|
|
Here `DataFlowExpr` should be an alias for the language-specific class of
|
|
expressions (typically called `Expr`). Parameters do not need an alias for the
|
|
shared implementation to refer to, so here you can just use the
|
|
language-specific class name (typically called `Parameter`).
|
|
|
|
### The call-graph
|
|
|
|
In order to make inter-procedural flow work a number of classes and predicates
|
|
must be provided.
|
|
|
|
First, two types, `DataFlowCall` and `DataFlowCallable`, must be defined. These
|
|
should be aliases for whatever language-specific class represents calls and
|
|
callables (a "callable" is intended as a broad term covering functions,
|
|
methods, constructors, lambdas, etc.). It can also be useful to represent
|
|
`DataFlowCall` as an IPA type if implicit calls need to be modelled. The
|
|
call-graph should be defined as a predicate:
|
|
```ql
|
|
/** Gets a viable target for the call `c`. */
|
|
DataFlowCallable viableCallable(DataFlowCall c)
|
|
```
|
|
Furthermore, each `Node` must be associated with exactly one callable and this
|
|
relation should be defined as:
|
|
```ql
|
|
/** Gets the callable in which node `n` occurs. */
|
|
DataFlowCallable nodeGetEnclosingCallable(Node n)
|
|
```
|
|
|
|
In order to connect data-flow across calls, the 4 `Node` subclasses
|
|
`ArgumentNode`, `ParameterNode`, `ReturnNode`, and `OutNode` are used.
|
|
Flow into callables from arguments to parameters are matched up using
|
|
language-defined classes `ParameterPosition` and `ArgumentPosition`,
|
|
so these three predicates must be defined:
|
|
```ql
|
|
/** Holds if `p` is a `ParameterNode` of `c` with position `pos`. */
|
|
predicate isParameterNode(ParameterNode p, DataFlowCallable c, ParameterPosition pos)
|
|
|
|
/** Holds if `arg` is an `ArgumentNode` of `c` with position `pos`. */
|
|
predicate isArgumentNode(ArgumentNode arg, DataFlowCall c, ArgumentPosition pos)
|
|
|
|
/** Holds if arguments at position `apos` match parameters at position `ppos`. */
|
|
predicate parameterMatch(ParameterPosition ppos, ArgumentPosition apos)
|
|
```
|
|
|
|
For most languages return-flow is simpler and merely consists of matching up a
|
|
`ReturnNode` with the data-flow node corresponding to the value of the call,
|
|
represented as `OutNode`. For this use-case we would define a singleton type
|
|
`ReturnKind`, a trivial `ReturnNode::getKind()`, and `getAnOutNode` to relate
|
|
calls and `OutNode`s:
|
|
```ql
|
|
private newtype TReturnKind = TNormalReturnKind()
|
|
|
|
/** Gets the kind of this return node. */
|
|
ReturnKind ReturnNode::getKind() { any() }
|
|
|
|
/**
|
|
* Gets a node that can read the value returned from `call` with return kind
|
|
* `kind`.
|
|
*/
|
|
OutNode getAnOutNode(DataFlowCall call, ReturnKind kind) {
|
|
result = call.getNode() and
|
|
kind = TNormalReturnKind()
|
|
}
|
|
```
|
|
|
|
For more complex use-cases when a language allows a callable to return multiple
|
|
values, for example through `out` parameters in C#, the `ReturnKind` class can
|
|
be defined and used to match up different kinds of `ReturnNode`s with the
|
|
corresponding `OutNode`s.
|
|
|
|
#### First-class functions
|
|
|
|
For calls to first-class functions, the library supports built-in call resolution based on data flow between a function creation expression and a call. The interface that needs to be implemented is
|
|
|
|
```ql
|
|
class LambdaCallKind
|
|
|
|
/** Holds if `creation` is an expression that creates a lambda of kind `kind` for `c`. */
|
|
predicate lambdaCreation(Node creation, LambdaCallKind kind, DataFlowCallable c)
|
|
|
|
/** Holds if `call` is a lambda call of kind `kind` where `receiver` is the lambda expression. */
|
|
predicate lambdaCall(DataFlowCall call, LambdaCallKind kind, Node receiver)
|
|
|
|
/** Extra data-flow steps needed for lambda flow analysis. */
|
|
predicate additionalLambdaFlowStep(Node nodeFrom, Node nodeTo, boolean preservesValue)
|
|
```
|
|
|
|
with the semantics that `call` will resolve to `c` if there is a data-flow path from `creation` to `receiver`, with matching `kind`s.
|
|
|
|
The implementation keeps track of a one-level call context, which means that we are able to handle situations like this:
|
|
```csharp
|
|
Apply(f, x) { f(x); }
|
|
|
|
Apply(x => NonSink(x), "tainted"); // GOOD
|
|
|
|
Apply(x => Sink(x), "not tainted"); // GOOD
|
|
```
|
|
|
|
However, since we only track one level the following example will have false-positive flow:
|
|
```csharp
|
|
Apply(f, x) { f(x); }
|
|
|
|
ApplyWrapper(f, x) { Apply(f, x) }
|
|
|
|
ApplyWrapper(x => NonSink(x), "tainted"); // GOOD (FALSE POSITIVE)
|
|
|
|
ApplyWrapper(x => Sink(x), "not tainted"); // GOOD (FALSE POSITIVE)
|
|
```
|
|
|
|
## Flow through global variables
|
|
|
|
Flow through global variables are called jump-steps, since such flow steps
|
|
essentially jump from one callable to another completely discarding call
|
|
contexts.
|
|
|
|
Adding support for this type of flow is done with the following predicate:
|
|
```ql
|
|
predicate jumpStep(Node node1, Node node2)
|
|
```
|
|
|
|
If global variables are common and certain databases have many reads and writes
|
|
of the same global variable, then a direct step may have performance problems,
|
|
since the straight-forward implementation is just a cartesian product of reads
|
|
and writes for each global variable. In this case it can be beneficial to
|
|
remove the cartesian product by introducing an intermediate `Node` for the
|
|
value of each global variable.
|
|
|
|
Note that, jump steps of course also can be used to implement other
|
|
cross-callable flow. As an example Java also uses this mechanism for variable
|
|
capture flow. But beware that this will lose the call context, so normal
|
|
inter-procedural flow should use argument-parameter-, and return-outnode-flow
|
|
as described above.
|
|
|
|
## Field flow
|
|
|
|
The library supports tracking flow through field stores and reads. In order to
|
|
support this, a class `Content` and two predicates
|
|
`storeStep(Node node1, Content f, Node node2)` and
|
|
`readStep(Node node1, Content f, Node node2)` must be defined. It generally
|
|
makes sense for stores to target `PostUpdateNode`s, but this is not a strict
|
|
requirement. Besides this, certain nodes must have associated
|
|
`PostUpdateNode`s. The node associated with a `PostUpdateNode` should be
|
|
defined by `PostUpdateNode::getPreUpdateNode()`.
|
|
|
|
`PostUpdateNode`s are generally used when we need two data-flow nodes for a
|
|
single AST element in order to distinguish the value before and after some
|
|
side-effect (typically a field store, but it may also be addition of taint
|
|
through an additional step targeting a `PostUpdateNode`).
|
|
|
|
It is recommended to introduce `PostUpdateNode`s for all `ArgumentNode`s (this
|
|
can be skipped for immutable arguments), and all field qualifiers for both
|
|
reads and stores.
|
|
|
|
Remember to define local flow for `PostUpdateNode`s as well in
|
|
`simpleLocalFlowStep`. In general out-going local flow from `PostUpdateNode`s
|
|
should be use-use flow, and there is generally no need for in-going local flow
|
|
edges for `PostUpdateNode`s.
|
|
|
|
We will illustrate how the shared library makes use of `PostUpdateNode`s
|
|
through a couple of examples.
|
|
|
|
### Example 1
|
|
|
|
Consider the following setter and its call:
|
|
```
|
|
setFoo(obj, x) {
|
|
sink1(obj.foo);
|
|
obj.foo = x;
|
|
}
|
|
|
|
setFoo(myobj, source);
|
|
sink2(myobj.foo);
|
|
```
|
|
Here `source` should flow to the argument of `sink2` but not the argument of
|
|
`sink1`. The shared library handles most of the complexity involved in this
|
|
flow path, but needs a little bit of help in terms of available nodes. In
|
|
particular it is important to be able to distinguish between the value of the
|
|
`myobj` argument to `setFoo` before the call and after the call, since without
|
|
this distinction it is hard to avoid also getting flow to `sink1`. The value
|
|
before the call should be the regular `ArgumentNode` (which will get flow into
|
|
the call), and the value after the call should be a `PostUpdateNode`. Thus a
|
|
`PostUpdateNode` should exist for the `myobj` argument with the `ArgumentNode`
|
|
as its pre-update node. In general `PostUpdateNode`s should exist for any
|
|
mutable `ArgumentNode`s to support flow returning through a side-effect
|
|
updating the argument.
|
|
|
|
This example also suggests how `simpleLocalFlowStep` should be implemented for
|
|
`PostUpdateNode`s: we need a local flow step between the `PostUpdateNode` for
|
|
the `myobj` argument and the following `myobj` in the qualifier of `myobj.foo`.
|
|
|
|
Inside `setFoo` the actual store should also target a
|
|
`PostUpdateNode` - in this case associated with the qualifier `obj` - as this
|
|
is the mechanism the shared library uses to identify side-effects that should
|
|
be reflected at call sites as setter-flow. The shared library uses the
|
|
following rule to identify setters: If the value of a parameter may flow to a
|
|
node that is the pre-update node of a `PostUpdateNode` that is reached by some
|
|
flow, then this represents an update to the parameter, which will be reflected
|
|
in flow continuing to the `PostUpdateNode` of the corresponding argument in
|
|
call sites.
|
|
|
|
### Example 2
|
|
|
|
In the following two lines we would like flow from `x` to reach the
|
|
`PostUpdateNode` of `a` through a sequence of two store steps, and this is
|
|
indeed handled automatically by the shared library.
|
|
```
|
|
a.b.c = x;
|
|
a.getB().c = x;
|
|
```
|
|
The only requirement for this to work is the existence of `PostUpdateNode`s.
|
|
For a specified read step (in `readStep(Node n1, Content f, Node n2)`) the
|
|
shared library will generate a store step in the reverse direction between the
|
|
corresponding `PostUpdateNode`s. A similar store-through-reverse-read will be
|
|
generated for calls that can be summarized by the shared library as getters.
|
|
This usage of `PostUpdateNode`s ensures that `x` will not flow into the `getB`
|
|
call after reaching `a`.
|
|
|
|
### Example 3
|
|
|
|
Consider a constructor and its call (for this example we will use Java, but the
|
|
idea should generalize):
|
|
```java
|
|
MyObj(Content content) {
|
|
this.content = content;
|
|
}
|
|
|
|
obj = new MyObj(source);
|
|
sink(obj.content);
|
|
```
|
|
|
|
We would like the constructor call to act in the same way as a setter, and
|
|
indeed this is quite simple to achieve. We can introduce a synthetic data-flow
|
|
node associated with the constructor call, let us call it `MallocNode`, and
|
|
make this an `ArgumentNode` with position `-1` such that it hooks up with the
|
|
implicit `this`-parameter of the constructor body. Then we can set the
|
|
corresponding `PostUpdateNode` of the `MallocNode` to be the constructor call
|
|
itself as this represents the value of the object after construction, that is
|
|
after the constructor has run. With this setup of `ArgumentNode`s and
|
|
`PostUpdateNode`s we will achieve the desired flow from `source` to `sink`
|
|
|
|
### Field flow barriers
|
|
|
|
Consider this field flow example:
|
|
```
|
|
obj.f = source;
|
|
obj.f = safeValue;
|
|
sink(obj.f);
|
|
```
|
|
or the similar case when field flow is used to model collection content:
|
|
```
|
|
obj.add(source);
|
|
obj.clear();
|
|
sink(obj.get(key));
|
|
```
|
|
Clearing a field or content like this should act as a barrier, and this can be
|
|
achieved by marking the relevant `Node, Content` pair as a clear operation in
|
|
the `clearsContent` predicate. A reasonable default implementation for fields
|
|
looks like this:
|
|
```ql
|
|
predicate clearsContent(Node n, Content c) {
|
|
n = any(PostUpdateNode pun | storeStep(_, c, pun)).getPreUpdateNode()
|
|
}
|
|
```
|
|
However, this relies on the local step relation using the smallest possible
|
|
use-use steps. If local flow is implemented using def-use steps, then
|
|
`clearsContent` might not be easy to use.
|
|
|
|
## Type pruning
|
|
|
|
The library supports pruning paths when a sequence of value-preserving steps
|
|
originate in a node with one type, but reaches a node with another and
|
|
incompatible type, thus making the path impossible.
|
|
|
|
The type system for this is specified with the class `DataFlowType` and the
|
|
compatibility relation `compatibleTypes(DataFlowType t1, DataFlowType t2)`.
|
|
Using a singleton type as `DataFlowType` means that this feature is effectively
|
|
disabled.
|
|
|
|
It can be useful to use a simpler type system for pruning than whatever type
|
|
system might come with the language, as collections of types that would
|
|
otherwise be equivalent with respect to compatibility can then be represented
|
|
as a single entity (this improves performance). As an example, Java uses erased
|
|
types for this purpose and a single equivalence class for all numeric types.
|
|
|
|
The type of a `Node` is given by the following predicate
|
|
```
|
|
DataFlowType getNodeType(Node n)
|
|
```
|
|
and every `Node` should have a type.
|
|
|
|
One also needs to define the string representation of a `DataFlowType`:
|
|
```
|
|
string ppReprType(DataFlowType t)
|
|
```
|
|
The `ppReprType` predicate is used for printing a type in the labels of
|
|
`PathNode`s, this can be defined as `none()` if type pruning is not used.
|
|
|
|
Finally, one must define `CastNode` as a subclass of `Node` as those nodes
|
|
where types should be checked. Usually this will be things like explicit casts.
|
|
The shared library will also check types at `ParameterNode`s and `OutNode`s
|
|
without needing to include these in `CastNode`. It is semantically perfectly
|
|
valid to include all nodes in `CastNode`, but this can hurt performance as it
|
|
will reduce the opportunity for the library to compact several local steps into
|
|
one. It is also perfectly valid to leave `CastNode` as the empty set, and this
|
|
should be the default if type pruning is not used.
|
|
|
|
## Virtual dispatch with call context
|
|
|
|
Consider a virtual call that may dispatch to multiple different targets. If we
|
|
know the call context of the call then this can sometimes be used to reduce the
|
|
set of possible dispatch targets and thus eliminate impossible call chains.
|
|
|
|
The library supports a one-level call context for improving virtual dispatch.
|
|
|
|
Conceptually, the following predicate should be implemented as follows:
|
|
```ql
|
|
DataFlowCallable viableImplInCallContext(DataFlowCall call, DataFlowCall ctx) {
|
|
exists(DataFlowCallable enclosing |
|
|
result = viableCallable(call) and
|
|
enclosing = call.getEnclosingCallable() and
|
|
enclosing = viableCallable(ctx)
|
|
|
|
|
not ... <`result` is impossible target for `call` given `ctx`> ...
|
|
)
|
|
}
|
|
```
|
|
However, joining the virtual dispatch relation with itself in this way is
|
|
usually way too big to be feasible. Instead, the relation above should only be
|
|
defined for those values of `call` for which the set of resulting dispatch
|
|
targets might be reduced. To do this, define the set of `call`s that might for
|
|
some reason benefit from a call context as the following predicate (the `c`
|
|
column should be `call.getEnclosingCallable()`):
|
|
```ql
|
|
predicate mayBenefitFromCallContext(DataFlowCall call, DataFlowCallable c)
|
|
```
|
|
And then define `DataFlowCallable viableImplInCallContext(DataFlowCall call,
|
|
DataFlowCall ctx)` as sketched above, but restricted to
|
|
`mayBenefitFromCallContext(call, _)`.
|
|
|
|
The shared implementation will then compare counts of virtual dispatch targets
|
|
using `viableCallable` and `viableImplInCallContext` for each `call` in
|
|
`mayBenefitFromCallContext(call, _)` and track call contexts during flow
|
|
calculation when differences in these counts show an improved precision in
|
|
further calls.
|
|
|
|
## Additional features
|
|
|
|
### Access path length limit
|
|
|
|
The maximum length of an access path is the maximum number of nested stores
|
|
that can be tracked. This is given by the following predicate:
|
|
```ql
|
|
int accessPathLimit() { result = 5 }
|
|
```
|
|
We have traditionally used 5 as a default value here, and real examples have
|
|
been observed to require at least this much. Changing this value has a direct
|
|
impact on performance for large databases.
|
|
|
|
### Hidden nodes
|
|
|
|
Certain synthetic nodes can be hidden to exclude them from occurring in path
|
|
explanations. This is done through the following predicate:
|
|
```ql
|
|
predicate nodeIsHidden(Node n)
|
|
```
|
|
|
|
### Unreachable nodes
|
|
|
|
Consider:
|
|
```
|
|
foo(source1, false);
|
|
foo(source2, true);
|
|
|
|
foo(x, b) {
|
|
if (b)
|
|
sink(x);
|
|
}
|
|
```
|
|
Sometimes certain data-flow nodes can be unreachable based on the call context.
|
|
In the above example, only `source2` should be able to reach `sink`. This is
|
|
supported by the following predicate where one can specify unreachable nodes
|
|
given a call context.
|
|
```ql
|
|
predicate isUnreachableInCall(Node n, DataFlowCall callcontext) { .. }
|
|
```
|
|
Note that while this is a simple interface it does have some scalability issues
|
|
if the number of unreachable nodes is large combined with many call sites.
|
|
|
|
### `BarrierGuard`s
|
|
|
|
The class `BarrierGuard` must be defined. See
|
|
https://github.com/github/codeql/pull/1718 for details.
|
|
|
|
### Consistency checks
|
|
|
|
The file `dataflow/internal/DataFlowImplConsistency.qll` contains a number of
|
|
consistency checks to verify that the language-specific parts satisfy the
|
|
invariants that are expected by the shared implementation. Run these queries to
|
|
check for inconsistencies.
|