This query uses data flow for nullness analysis, which is always going
to be a large overapproximation. The overapproximation became too big
for one of the test cases after the recent change to make data flow go
across assignment by reference.
To make this query more conservative, it will now only report that the
`pDacl` argument can be null if there isn't also evidence that it can be
non-null.
This commit changes how data flow works in the following code.
MyType x = source();
defineByReference(&x);
sink(x);
The question here is whether there should be flow from `source` to
`sink`. Such flow is desirable if `defineByReference` doesn't write to
all of `x`, but it's undesirable if `defineByReference` is a typical
init function in `C` that writes to every field or if
`defineByReference` is `memcpy` or `memset` on the full range.
Before 1.20.0, there would be flow from `source` to `sink` in case `x`
happened to be modeled with `BlockVar` but not in case `x` happened to
be modelled with SSA. The choice of modelling depends on an analysis of
how `x` is used elsewhere in the function, and it's supposed to be an
internal implementation detail that there are two ways to model
variables. In 1.20.0, I changed the `BlockVar` behavior so it worked the
same as SSA, never allowing that flow. It turns out that this change
broke a customer's query.
This commit reverts `BlockVar` to its old behavior of letting flow
propagate past the `defineByReference` call and then regains consistency
by changing all variables that are ever defined by reference to be
modelled with `BlockVar` instead of SSA. This means we now get too much
flow in certain cases, but that appears to be better overall than
getting too little flow. See also the discussion in CPP-336.
My recent changes to suppress FPs in `ReturnStackAllocatedMemory.ql`
caused us to lose all results where there was a `Conversion` at the
initial address escape. We cannot handle conversions in general, but
this commit restores the good results for the trivial types of
conversion that we can handle.
Since we cannot track data flow from a fully-converted expression but
only the unconverted expression, we should check whether the address
initially escapes into the unconverted expression, not the
fully-converted one.
This fixes most of the false positives observed on lgtm.com.
Also added `PhiInstruction::getAnInputOperand()`, and renamed `PhiInstruction::getAnOperandDefinitionInstruction()` to `getAnInput()` for consistency with other `Instruction` classes.
The `automaticVariableAddressEscapes` predicate got join-ordered badly
in its `unaliased_ssa` version. These are the tuple counts on Wireshark,
where one pipeline step is seen to have 716 million tuples:
```
[2019-03-02 11:29:41] (42s) Starting to evaluate predicate AliasAnalysis::automaticVariableAddressEscapes#2#f
[2019-03-02 11:30:06] (67s) Tuple counts:
353419 ~0% {1} r1 = JOIN project#Instruction::VariableAddressInstruction#class#2#ff WITH AliasAnalysis::resultEscapesNonReturn#2#f ON project#Instruction::VariableAddressInstruction#class#2#ff.<0>=AliasAnalysis::resultEscapesNonReturn#2#f.<0> OUTPUT FIELDS {AliasAnalysis::resultEscapesNonReturn#2#f.<0>}
353419 ~0% {2} r2 = JOIN r1 WITH IRConstruction::Cached::getInstructionEnclosingFunctionIR#ff@staged_ext ON r1.<0>=IRConstruction::Cached::getInstructionEnclosingFunctionIR#ff@staged_ext.<0> OUTPUT FIELDS {IRConstruction::Cached::getInstructionEnclosingFunctionIR#ff@staged_ext.<1>,r1.<0>}
353419 ~0% {2} r3 = JOIN r2 WITH FunctionIR::FunctionIR::getFunction_dispred#3#ff ON r2.<0>=FunctionIR::FunctionIR::getFunction_dispred#3#ff.<0> OUTPUT FIELDS {FunctionIR::FunctionIR::getFunction_dispred#3#ff.<1>,r2.<1>}
716040298 ~0% {2} r4 = JOIN r3 WITH IRVariable::IRVariable#class#3#ff_10#join_rhs ON r3.<0>=IRVariable::IRVariable#class#3#ff_10#join_rhs.<0> OUTPUT FIELDS {IRVariable::IRVariable#class#3#ff_10#join_rhs.<1>,r3.<1>}
4480139 ~0% {2} r5 = JOIN r4 WITH IRVariable::IRAutomaticVariable#class#3#ff ON r4.<0>=IRVariable::IRAutomaticVariable#class#3#ff.<0> OUTPUT FIELDS {r4.<1>,r4.<0>}
66760 ~91% {1} r6 = JOIN r5 WITH Instruction::VariableInstruction::getVariable_dispred#2#ff ON r5.<0>=Instruction::VariableInstruction::getVariable_dispred#2#ff.<0> AND r5.<1>=Instruction::VariableInstruction::getVariable_dispred#2#ff.<1> OUTPUT FIELDS {r5.<1>}
return r6
[2019-03-02 11:30:06] (67s) >>> Relation AliasAnalysis::automaticVariableAddressEscapes#2#f: 35531 rows using 0 MB
```
The predicate contained a cyclic join, which is always hard to optimize.
I couldn't see a reason to join the `FunctionIR`, so I removed that
part. The predicate is now fast, and there are no changes in the tests.
This query is only appropriate for setuid programs. Since such programs
are at most 0.1% of all code we analyse, I would say this query has a
precision of at most 0.1%.