Dataflow: Add documentation language maintainers.

This commit is contained in:
Anders Schack-Mulligen
2020-06-30 14:57:56 +02:00
parent b57cfc965a
commit 4dabbac19b

View File

@@ -0,0 +1,472 @@
# Using the shared data-flow library
This document is aimed towards language maintainers and contain implementation
details that should be mostly irrelevant to query writers.
## Overview
The shared data-flow library implements sophisticated global data flow on top
of a language-specific data-flow graph. The language-specific bits supply the
graph through a number of predicates and classes, and the shared implementation
takes care of matching call-sites with returns and field writes with reads to
ensure that the generated paths are well-formed. The library also supports a
number of additional features for improving precision, for example pruning
infeasible paths based on type information.
## File organisation
The data-flow library consists of a number of files typically located in
`<lang>/dataflow` and `<lang>/dataflow/internal`:
```
dataflow/DataFlow.qll
dataflow/internal/DataFlowImpl.qll
dataflow/internal/DataFlowCommon.qll
dataflow/internal/DataFlowImplSpecific.qll
```
`DataFlow.qll` provides the user interface for the library and consists of just
a few lines of code importing the implementation:
#### `DataFlow.qll`
```ql
import <lang>
module DataFlow {
import semmle.code.java.dataflow.internal.DataFlowImpl
}
```
The `DataFlowImpl.qll` and `DataFlowCommon.qll` files contain the library code
that is shared across languages. These contain `Configuration`-specific and
`Configuration`-independent code, respectively. This organization allows
multiple copies of the library (for the use case when a query wants to use two
instances of global data flow and the configuration of one depends on the
results from the other). Using multiple copies just means duplicating
`DataFlow.qll` and `DataFlowImpl.qll`, for example as:
```
dataflow/DataFlow2.qll
dataflow/DataFlow3.qll
dataflow/internal/DataFlowImpl2.qll
dataflow/internal/DataFlowImpl3.qll
```
The `DataFlowImplSpecific.qll` provides all the language-specific classes and
predicates that the library needs as input and is the topic of the rest of this
document.
This file must provide two modules named `Public` and `Private`, which the
shared library code will import publicly and privately, respectively, thus
allowing the language-specific part to choose which classes and predicates
should be exposed by `DataFlow.qll`.
A typical implementation looks as follows, thereby organizing the predicates in
two files, which we'll subsequently assume:
#### `DataFlowImplSpecific.qll`
```ql
module Private {
import DataFlowPrivate
}
module Public {
import DataFlowPublic
}
```
## Defining the data-flow graph
The main input to the library is the data-flow graph. One must define a class
`Node` and an edge relation `simpleLocalFlowStep(Node node1, Node node2)`. The
`Node` class should be in `DataFlowPublic`.
Recommendations:
* Make `Node` an IPA type. There is commonly a need for defining various
data-flow nodes that are not necessarily represented in the AST of the
language.
* Define `predicate localFlowStep(Node node1, Node node2)` as an alias of
`simpleLocalFlowStep` and expose it publicly. The reason for this indirection
is that it gives the option of exposing local flow augmented with field flow.
See the C/C++ implementation, which makes use of this feature.
* Define `predicate localFlow(Node node1, Node node2) { localFlowStep*(node1, node2) }`.
* Make the local flow step relation in `simpleLocalFlowStep` follow
def-to-first-use and use-to-next-use steps for SSA variables. Def-use steps
also work, but the upside of `use-use` steps is that sources defined in terms
of variable reads just work out of the box. It also makes certain
barrier-implementations simpler.
The shared library does not use `localFlowStep` nor `localFlow` but users of
`DataFlow.qll` may expect the existence of `DataFlow::localFlowStep` and
`DataFlow::localFlow`.
### `Node` subclasses
The `Node` class needs a number of subclasses. As a minimum the following are needed:
```
ExprNode
ParameterNode
PostUpdateNode
OutNode
ArgumentNode
ReturnNode
CastNode
```
and possibly more depending on the language and its AST. Of the above, the
first 3 should be public, but the last 4 can be private. Also, the last 4 will
likely be subtypes of `ExprNode`. For further details about `ParameterNode`,
`ArgumentNode`, `ReturnNode`, and `OutNode` see [The call-graph](#the-call-graph)
below. For further details about `CastNode` see [Type pruning](#type-pruning) below.
For further details about `PostUpdateNode` see [Field flow](#field-flow) below.
Nodes corresponding to expressions and parameters are the most common for users
to interact with so a couple of convenience predicates are generally included:
```
DataFlowExpr Node::asExpr()
Parameter Node::asParameter()
ExprNode exprNode(DataFlowExpr n)
ParameterNode parameterNode(Parameter n)
```
Here `DataFlowExpr` should be an alias for the language-specific class of
expressions (typically called `Expr`). Parameters do not need an alias for the
shared implementation to refer to, so here you can just use the
language-specific class name (typically called `Parameter`).
### The call-graph
In order to make inter-procedural flow work a number of classes and predicates
must be provided.
First, two types, `DataFlowCall` and `DataFlowCallable`, must be defined. These
should be aliases for whatever language-specific class represents calls and
callables (a "callable" is intended as a broad term covering functions,
methods, constructors, lambdas, etc.). The call-graph should be defined as a
predicate:
```ql
DataFlowCallable viableCallable(DataFlowCall c)
```
In order to connect data-flow across calls, the 4 `Node` subclasses
`ArgumentNode`, `ParameterNode`, `ReturnNode`, and `OutNode` are used.
Flow into callables from arguments to parameters are matched up using an
integer position, so these two classes must define:
```ql
ArgumentNode::argumentOf(DataFlowCall call, int pos)
ParameterNode::isParameterOf(DataFlowCallable c, int pos)
```
It is typical to use `pos = -1` for an implicit `this`-parameter.
For most languages return-flow is simpler and merely consists of matching up a
`ReturnNode` with the data-flow node corresponding to the value of the call,
represented as `OutNode`. For this use-case we would define a singleton type
`ReturnKind`, a trivial `ReturnNode::getKind()`, and `getAnOutNode` to relate
calls and `OutNode`s:
```ql
private newtype TReturnKind = TNormalReturnKind()
ReturnKind ReturnNode::getKind() { any() }
OutNode getAnOutNode(DataFlowCall call, ReturnKind kind) {
result = call.getNode() and
kind = TNormalReturnKind()
}
```
For more complex use-cases when a language allows a callable to return multiple
values, for example through `out` parameters in C#, the `ReturnKind` class can
be defined and used to match up different kinds of `ReturnNode`s with the
corresponding `OutNode`s.
## Flow through global variables
Flow through global variables are called jump-steps, since such flow steps
essentially jump from one callable to another completely discarding call
context.
Adding support for this type of flow is done with the following predicate:
```ql
predicate jumpStep(Node node1, Node node2)
```
If global variables are common and certain databases have many reads and writes
of the same global variable, then a direct step may have performance problems,
since the straight-forward implementation is just a cartesian product of reads
and writes for each global variable. In this case it can be beneficial to
remove the cartesian product by introducing an intermediate `Node` for the
value of each global variable.
Note that, jump steps of course also can be used to implement other
cross-callable flow. As an example Java also uses this mechanism for variable
capture flow. But beware that this will lose the call context, so normal
inter-procedural flow should use argument-parameter-, and return-outnode-flow
as described above.
## Field flow
The library supports tracking flow through field stores and reads. In order to
support this, a class `Content` and two predicates
`storeStep(Node node1, Content f, PostUpdateNode node2)` and
`readStep(Node node1, Content f, Node node2)` must be defined. Besides this,
certain nodes must have associated `PostUpdateNode`s. The node associated with
a `PostUpdateNode` should be defined by `PostUpdateNode::getPreUpdateNode()`.
`PostUpdateNode`s are generally used when we need two data-flow nodes for a
single AST element in order to distinguish the value before and after some
side-effect (typically a field store, but it may also be addition of taint
through an additional step targeting a `PostUpdateNode`).
It is recommended to introduce `PostUpdateNode`s for all `ArgumentNode`s (this
can be skipped for immutable arguments), and all field qualifiers for both
reads and stores.
Remember to define local flow for `PostUpdateNode`s as well in
`simpleLocalFlowStep`. In general out-going local flow from `PostUpdateNode`s
should be use-use flow, and there is generally no need for in-going local flow
edges for `PostUpdateNode`s.
We will illustrate how the shared library makes use of `PostUpdateNode`s
through a couple of examples.
### Example 1
Consider the following setter and its call:
```
setFoo(obj, x) {
sink1(obj.foo);
obj.foo = x;
}
setFoo(myobj, source);
sink2(myobj.foo);
```
Here `source` should flow to the argument of `sink2` but not the argument of
`sink1`. The shared library handles most of the complexity involved in this
flow path, but needs a little bit of help in terms of available nodes. In
particular it is important to be able to distinguish between the value of the
`myobj` argument to `setFoo` before the call and after the call, since without
this distinction it is hard to avoid also getting flow to `sink1`. The value
before the call should be the regular `ArgumentNode` (which will get flow into
the call), and the value after the call should be a `PostUpdateNode`. Thus a
`PostUpdateNode` should exist for the `myobj` argument with the `ArgumentNode`
as its pre-update node. In general `PostUpdateNode`s should exist for any
mutable `ArgumentNode`s to support flow returning through a side-effect
updating the argument.
This example also suggests how `simpleLocalFlowStep` should be implemented for
`PostUpdateNode`s: we need a local flow step between the `PostUpdateNode` for
the `myobj` argument and the following `myobj` in the qualifier of `myobj.foo`.
Inside `setFoo` the actual store should also target a
`PostUpdateNode` - in this case associated with the qualifier `obj` - as this
is the mechanism the shared library uses to identify side-effects that should
be reflected at call sites as setter-flow. The shared library uses the
following rule to identify setters: If the value of a parameter may flow to a
node that is the pre-update node of a `PostUpdateNode` that is reached by some
flow, then this represents an update to the parameter, which will be reflected
in flow continuing to the `PostUpdateNode` of the corresponding argument in
call sites.
### Example 2
In the following two lines we would like flow from `x` to reach the
`PostUpdateNode` of `a` through a sequence of two store steps, and this is
indeed handled automatically by the shared library.
```
a.b.c = x;
a.getB().c = x;
```
The only requirement for this to work is the existence of `PostUpdateNode`s.
For a specified read step (in `readStep(Node n1, Content f, Node n2)`) the
shared library will generate a store step in the reverse direction between the
corresponding `PostUpdateNode`s. A similar store-through-reverse-read will be
generated for calls that can be summarized by the shared library as getters.
This usage of `PostUpdateNode`s ensures that `x` will not flow into the `getB`
call after reaching `a`.
### Example 3
Consider a constructor and its call (for this example we will use Java, but the
idea should generalize):
```java
MyObj(Content content) {
this.content = content;
}
obj = new MyObj(source);
sink(obj.content);
```
We would like the constructor call to act in the same way as a setter, and
indeed this is quite simple to achieve. We can introduce a synthetic data-flow
node associated with the constructor call, let us call it `MallocNode`, and
make this an `ArgumentNode` with position `-1` such that it hooks up with the
implicit `this`-parameter of the constructor body. Then we can set the
corresponding `PostUpdateNode` of the `MallocNode` to be the constructor call
itself as this represents the value of the object after construction, that is
after the constructor has run. With this setup of `ArgumentNode`s and
`PostUpdateNode`s we will achieve the desired flow from `source` to `sink`
### Field flow barriers
Consider this field flow example:
```
obj.f = source;
obj.f = safeValue;
sink(obj.f);
```
or the similar case when field flow is used to model collection content:
```
obj.add(source);
obj.clear();
sink(obj.get(key));
```
Clearing a field or content like this should act as a barrier, and this can be
achieved by marking the relevant `Node, Content` pair as a clear operation in
the `clearsContent` predicate. A reasonable default implementation for fields
looks like this:
```ql
predicate clearsContent(Node n, Content c) {
n = any(PostUpdateNode pun | storeStep(_, c, pun)).getPreUpdateNode()
}
```
However, this relies on the local step relation using the smallest possible
use-use steps. If local flow is implemented using def-use steps, then
`clearsContent` might not be easy to use.
## Type pruning
The library supports pruning paths when a sequence of value-preserving steps
originate in a node with one type, but reaches a node with another and
incompatible type, thus making the path impossible.
The type system for this is specified with the class `DataFlowType` and the
compatibility relation `compatibleTypes(DataFlowType t1, DataFlowType t2)`.
Using a singleton type as `DataFlowType` means that this feature is effectively
disabled.
It can be useful to use a simpler type system for pruning than whatever type
system might come with the language, as collections of types that would
otherwise be equivalent with respect to compatibility can then be represented
as a single entity (this improves performance). As an example, Java uses erased
types for this purpose and a single equivalence class for all numeric types.
One also needs to define
```
Type Node::getType()
Type Node::getTypeBound()
DataFlowType getErasedRepr(Type t)
string ppReprType(DataFlowType t)
```
where `Type` can be a language-specific name for the types native to the
language. Of the member predicate `Node::getType()` and `Node::getTypeBound()`
only the latter is used by the library, but the former is usually nice to have
if it makes sense for the language. The `getErasedRepr` predicate acts as the
translation between regular types and the type system used for pruning, the
shared library will use `getErasedRepr(node.getTypeBound())` to get the
`DataFlowType` for a node. The `ppReprType` predicate is used for printing a
type in the labels of `PathNode`s, this can be defined as `none()` if type
pruning is not used.
Finally, one must define `CastNode` as a subclass of `Node` as those nodes
where types should be checked. Usually this will be things like explicit casts.
The shared library will also check types at `ParameterNode`s and `OutNode`s
without needing to include these in `CastNode`. It is semantically perfectly
valid to include all nodes in `CastNode`, but this can hurt performance as it
will reduce the opportunity for the library to compact several local steps into
one.
## Virtual dispatch with call context
Consider a virtual call that may dispatch to multiple different targets. If we
know the call context of the call then this can sometimes be used to reduce the
set of possible dispatch targets and thus eliminate impossible call chains.
The library supports a one-level call context for improving virtual dispatch.
Conceptually, the following predicate should be implemented as follows:
```ql
DataFlowCallable viableImplInCallContext(DataFlowCall call, DataFlowCall ctx) {
exists(DataFlowCallable enclosing |
result = viableCallable(call) and
enclosing = call.getEnclosingCallable() and
enclosing = viableCallable(ctx)
|
not ... <`result` is impossible target for `call` given `ctx`> ...
)
}
```
However, joining the virtual dispatch relation with itself in this way is
usually way too big to be feasible. Instead, the relation above should only be
defined for those values of `call` for which the set of resulting dispatch
targets might be reduced. To do this, define the set of `call`s that might for
some reason benefit from a call context as the following predicate (the `c`
column should be `call.getEnclosingCallable()`):
```ql
predicate mayBenefitFromCallContext(DataFlowCall call, DataFlowCallable c)
```
And then define `DataFlowCallable viableImplInCallContext(DataFlowCall call,
DataFlowCall ctx)` as sketched above, but restricted to
`mayBenefitFromCallContext(call, _)`.
The shared implementation will then compare counts of virtual dispatch targets
using `viableCallable` and `viableImplInCallContext` for each `call` in
`mayBenefitFromCallContext(call, _)` and track call contexts during flow
calculation when differences in these counts show an improved precision in
further calls.
## Additional features
### Access path length limit
The maximum length of an access path is the maximum number of nested stores
that can be tracked. This is given by the following predicate:
```ql
int accessPathLimit() { result = 5 }
```
We have traditionally used 5 as a default value here, as we have yet to observe
the need for this much field nesting. Changing this value has a direct impact
on performance for large databases.
### Hidden nodes
Certain synthetic nodes can be hidden to exclude them from occurring in path
explanations. This is done through the following predicate:
```ql
predicate nodeIsHidden(Node n)
```
### Unreachable nodes
Consider:
```
foo(source1, false);
foo(source2, true);
foo(x, b) {
if (b)
sink(x);
}
```
Sometimes certain data-flow nodes can be unreachable based on the call context.
In the above example, only `source2` should be able to reach `sink`. This is
supported by the following predicate where one can specify unreachable nodes
given a call context.
```ql
predicate isUnreachableInCall(Node n, DataFlowCall callcontext) { .. }
```
Note that while this is a simple interface it does have some scalability issues
if the number of unreachable nodes is large combined with many call sites.
### `BarrierGuard`s
The class `BarrierGuard` must be defined. See
https://github.com/github/codeql/pull/1718 for details.
### Consistency checks
The file `dataflow/internal/DataFlowImplConsistency.qll` contains a number of
consistency checks to verify that the language-specfic parts satisfy the
invariants that are expected by the shared implementation. Run these queries to
check for inconsistencies.