The data flow library conflates pointers and their objects in some
places but not others. For example, a member function call `x.f()` will
cause flow from `x` of type `T` to `this` of type `T*` inside `f`. It
might be ideal to avoid that conflation, but that's not realistic
without using the IR.
We've had good experience in the taint tracking library with conflating
pointers and objects, and it improves results for field flow, so perhaps
it's time to try it out for all data flow.
The `toString` for IR data-flow nodes are now similar to AST data-flow
nodes. This should make it easier to use the IR as a drop-in replacement
in the future. There are still differences because the IR data flow
library takes conversions into account.
I did not attempt to align the new nodes we use for field flow. That can
come later, when we add field flow to IR data flow.
This string looked out of place compared to `ExplicitParameterNode`,
whose string is simply the name of the parameter and therefore
indistinguishable from an access to the parameter without looking at the
location also. This has not been a problem so far, and if we want to
distinguish more clearly between initial values and accesses at some
point, we should do it for `ExplicitParameterNode` and
`UninitializedNode` too.
The data flow library conflates pointers and objects enough for the
`definitionByReference` predicate to be too strict in some cases. It was
too permissive in other cases that are now (or will be) handled better
by field flow.
See also the change note entry.
To compensate for the lack of field flow, the taint tracking library has
previously considered taint to flow from fields to their containing
structs and back again from the structs to any of their fields. This
leads to false flow between unrelated fields and is not needed now that
we have proper flow through fields.
This allows a member initializer list to be seen as a sequence of field
assignments. For example, the constructor
C() : a(taint()) { }
now has data flow similar to
C() { this.a = taint(); }
This commit changes C++ `ConstructorCall` to behave like
`new`-expressions in Java: they are both `ExprNode`s and
`PostUpdateNodes`, and there's a "pre-update node" (here called
`PreConstructorCallNode`) to play the role of the qualifier argument
when calling a constructor.
This removes a lot of flow steps, but it all seems to be flow that was
present twice: both exiting a `PartialDefNode` and a
`DefinitionByReferenceNode`. All `DefinitionByReferenceNode`s are now
`PartialDefNode`s.
This commit changes how data flow works in the following code.
MyType x = source();
defineByReference(&x);
sink(x);
The question here is whether there should be flow from `source` to
`sink`. Such flow is desirable if `defineByReference` doesn't write to
all of `x`, but it's undesirable if `defineByReference` is a typical
init function in `C` that writes to every field or if
`defineByReference` is `memcpy` or `memset` on the full range.
Before 1.20.0, there would be flow from `source` to `sink` in case `x`
happened to be modeled with `BlockVar` but not in case `x` happened to
be modelled with SSA. The choice of modelling depends on an analysis of
how `x` is used elsewhere in the function, and it's supposed to be an
internal implementation detail that there are two ways to model
variables. In 1.20.0, I changed the `BlockVar` behavior so it worked the
same as SSA, never allowing that flow. It turns out that this change
broke a customer's query.
This commit reverts `BlockVar` to its old behavior of letting flow
propagate past the `defineByReference` call and then regains consistency
by changing all variables that are ever defined by reference to be
modelled with `BlockVar` instead of SSA. This means we now get too much
flow in certain cases, but that appears to be better overall than
getting too little flow. See also the discussion in CPP-336.