C++: Add more documentation to the 'cpp/invalid-pointer-deref' query.

This commit is contained in:
Mathias Vorreiter Pedersen
2023-07-19 14:42:20 +01:00
parent 0a0e9bb25b
commit 922f4d5496
3 changed files with 162 additions and 18 deletions

View File

@@ -1,6 +1,56 @@
/**
* This file provides the first phase of the `cpp/invalid-pointer-deref` query that identifies flow
* from an allocation to a pointer-arithmetic instruction that constructs a pointer that is out of bounds.
*
* Consider the following snippet:
* ```cpp
* 1. char* begin = (char*)malloc(size);
* 2. char* end = begin + size;
* 3. for(int *p = begin; p <= end; p++) {
* 4. use(*p);
* 5. }
* ```
* this file identifies the flow from `new int[size]` to `base + size`.
*
* This is done using the product-flow library. The configuration tracks flow from the pair `(allocation, size of allocation)`
* to a pair `(a, b)` where there exists a pointer-arithmetic instruction `pai` such that:
* 1. `a` is equal to the left-hand side of `pai`, and
* 2. `b` is a dataflow node that represents an operand that _non-strictly_ upper bounds the right-hand side of `pai`.
* See `pointerAddInstructionHasBounds` for the implementation of this.
*
* In the above example, the pair `(a, b)` is `(base, size)` from the expression `base + size` on line 2. However, it could
* also be something more complex like `(base, size)` where `base` is from line 3 and `size` is from line 2, and the
* pointer-arithmetic instruction is `base + n` on line 3 in the following example:
* ```cpp
* 1. int* base = new int[size];
* 2. if(n <= size) {
* 3. int* end = base + n;
* 4. for(int* p = base; p <= end; ++p) {
* 5. *p = 0; // BUG: Should have been bounded by `p < end`.
* 6. }
* 7. }
* ```
*
* Reducing the size of `pointerAddInstructionHasBounds`:
* The `pointerAddInstructionHasBounds` can be very large since the `sink2` column is defined as anything that non-strictly
* upper bounds the right-hand side of a pointer-arithmetic instruction. In order to reduce the size of this predicate we prune
* the set of pointer-arithmetic instructions to only those instructions where the left-hand side flows from an allocation.
*
* Handling false positives:
*
* Consider a snippet such as:
* ```cpp
* 1. int* base = new int[size];
* 2. int n = condition() ? size : 0;
* 3. if(n >= size) return;
* 4. int* end = base + n;
* 5. for(int* p = base; p <= end; ++p) {
* 6. *p = 0; // This is fine since `end < base + size`
* 7. }
* ```
* In order to remove this false positive we define a barrier (see `Barrier2::BarrierConfig2`) that finds the possible guards
* that compares a value to the size of the allocation. In the above example, that's the `(n >= size)` guard on line 3.
* `Barrier2::getABarrierNode` then defines any node that's guarded by such a guard as a barrier in the dataflow configuration.
*/
private import cpp
@@ -151,24 +201,8 @@ private module InterestingPointerAddInstruction {
}
/**
* A product-flow configuration for flow from an (allocation, size) pair to a
* pointer-arithmetic operation that is non-strictly upper-bounded by `allocation + size`.
*
* The goal of this query is to find patterns such as:
* ```cpp
* 1. char* begin = (char*)malloc(size);
* 2. char* end = begin + size;
* 3. for(int *p = begin; p <= end; p++) {
* 4. use(*p);
* 5. }
* ```
*
* We do this by splitting the task up into two configurations:
* 1. `AllocToInvalidPointerConfig` find flow from `malloc(size)` to `begin + size`, and
* 2. `InvalidPointerToDerefConfig` finds flow from `begin + size` to an `end` (on line 3).
*
* Finally, the range-analysis library will find a load from (or store to) an address that
* is non-strictly upper-bounded by `end` (which in this case is `*p`).
* A product-flow configuration for flow from an `(allocation, size)` pair to a pointer-
* arithmetic instruction that is non-strictly upper-bounded by `allocation + size`.
*/
private module Config implements ProductFlow::StateConfigSig {
class FlowState1 = Unit;

View File

@@ -2,6 +2,71 @@
* This file provides the second phase of the `cpp/invalid-pointer-deref` query that identifies flow
* from the out-of-bounds pointer identified by the `AllocationToInvalidPointer.qll` library to
* a dereference of the out-of-bounds pointer.
*
* Consider the following snippet:
* ```cpp
* 1. char* begin = (char*)malloc(size);
* 2. char* end = begin + size;
* 3. for(int *p = begin; p <= end; p++) {
* 4. use(*p);
* 5. }
* ```
* this file identifies the flow from `base + size` to `end`. We call `base + size` the "dereference source" and `end`
* the "dereference sink" (even though `end` is not actually dereferenced - it will be used to find the correct
* dereference eventually).
*
* Merely _constructing_ a pointer that's out-of-bounds is fine if the pointer is never dereferenced (in reality, the
* standard only guarentees that it's safe to move the pointer one element past the last element. But we ignore that
* here). So this step is about identifying which of those out-of-bounds pointers identified from step 1 that are
* actually being dereferenced. We do this using a regular dataflow configuration (see `InvalidPointerToDerefConfig`).
*
* This dataflow traversal defines the set of sources as any dataflow node that is non-strictly upper-bounded by the
* pointer-arithmetic instruction identified by `AllocationToInvalidPointer.qll`. (TOOD: I'm pretty sure this is incorrect,
* and we should define the set of sources as anything that is non-strictly _lower_ bounded by the pointer-arithmetic
* instruction). That is, the set of sources is any dataflow node `source` such that `source.asInstruction <= pai + delta1`
* for some `delta1 >= 0`.
*
* The set of sinks is defined to be any address operand `addr` that is non-strictly upper-bounded by the sink. That is,
* any dataflow node `n` such that `addr <= sink.asInstruction() + delta2` for some `delta2`. We call the instruction that
* consumes the address operand the "operation".
*
* For example, consider the flow from `begin + size` to `end` above. The sink is `end` on line 3 because that is a dataflow
* node whose underlying instruction non-strictly upper bounds the address operand `p` in `use(*p)`. The load attached to `*p`
* is the "operation". To ensure that the path makes intuitive sense, we only pick operations that are control-flow reachable
* from the dereference sink.
*
* To compute the amount of the dereference is away from the final entry of the allocation, we sum the two deltas `delta1` and
* `delta2`. This is done in the `operationIsOffBy` predicate (which is the only predicate exposed by this file).
*
* Handling false positives:
*
* Consider the following snippet:
* ```cpp
* 1. char *p = new char[size];
* 2. char *end = p + size; // $ alloc=L363
* 3. if (p < end) {
* 4. p += 1;
* 5. }
* 6. if (p < end) {
* 7. int val = *p; // GOOD
* 8. }
* ```
* this is safe because `p` is guarded to be strictly less than `end` on line 6 before the dereference on line 7. However, if we
* run the query on the above without further modifications we'd see an alert on line 7. This is because range analysis infers
* that `p <= end` after the increment on line 4, and thus the result of `p += 1` is seen as a valid dereference source. This
* node then flows to `p` on line 6 (which is a valid dereference sink since it non-strictly upper bounds an address operand), and
* range analysis then infers that the address operand of `*p` (i.e., `p`) is non-strictly upper bounded by `p`, and thus reports
* an alert on line 7.
*
* In order to handle this false positive, we define a barrier that identifies guards such as `p < end` that ensures that a value
* is less than the pointer-arithmetic instruction that computed the invalid pointer. This is done in the `InvalidPointerToDerefBarrier`
* module. Since the node we're tracking isn't necessarily _equal_ to the pointer-arithmetic instruction, but rather satisfies
* `node.asInstruction() <= pai + delta`, we need to account for the delta when checking if a guard is sufficiently strong to infer
* that a future dereference is safe. To do this, we check that the guard guarantees that a node `n` satisfies `n < node + d` where
* `node` is a node we know is equal to the value of the dereference source (i.e., it satisfies `node.asInstruction() <= pai + delta`)
* and `d <= delta`. Combining this we have `n < node + d <= node + delta <= pai + 2*delta` (TODO: Oops. This math doesn't quite work
* out. This is because we need to redefine the `BarrierConfig` to start flow at the pointer-arithmetic instruction instead of at the
* dereference source. When combined with TODO above it's easy to show that this guard ensures that the dereference is safe).
*/
private import cpp

View File

@@ -15,6 +15,51 @@
* external/cwe/cwe-787
*/
/*
* High-level description of the query:
*
* The goal of this query is to identify issues such as:
* ```cpp
* 1. int* base = new int[size];
* 2. int* end = base + size;
* 3. for(int* p = base; p <= end; ++p) {
* 4. *p = 0; // BUG: Should have been bounded by `p < end`.
* 5. }
* ```
* In order to do this, we split the problem into three subtasks:
* 1. First, we find flow from `new int[size]` to `base + size`.
* 2. Then, we find flow from `base + size` to `end` (on line 3).
* 3. Finally, we use range-analysis to find a write to (or read from) a pointer that may be equal to `end`.
*
* Step 1 is implemented in `AllocationToInvalidPointer.qll`, and step 2 is implemented by
* `InvalidPointerToDereference.qll`. See those files for the description of these.
*
* This file imports both libraries and define a final dataflow configuration that constructs the full path from
* the allocation to the dereference of the out-of-bounds pointer. This is done for several reasons:
* 1. It means the user is able to inspect the entire path from the allocation to the dereference, which can be useful
* to understand the problem highlighted.
* 2. It ensures that the call-contexts line up correctly when we transition from step 1 to step 2. See the
* `test_missing_call_context_1` and `test_missing_call_context_2` tests for how this may flag false positives
* without this final configuration.
*
* The source of the final path is an allocation that's:
* 1. identified as flowing to an invalid pointer (by `AllocationToInvalidPointer`), and
* 2. for which the invalid pointer flows to a dereference (as identified by `InvalidPointerToDereference`).
*
* The path can be described in 3 "chunks":
* 1. One path from the allocation to the construction of the invalid pointer
* 2. Another path from the construction of the invalid pointer to the final pointer that's about to be dereferenced.
* 3. Finally, there's a single step from the dataflow node that represents the final pointer to the dereference.
*
* Step 1 happens when the flow state is `TInitial`, and step 2 and 3 happens when the flow state is `TPointerArith(pai)`
* where the pointer-arithmetic instruction `pai` tracks the instruction that generated the out-of-bounds pointer. This
* instruction is used in the construction of the alert message.
*
* The set of pointer-arithmetic instructions that define the `TPointerArith` flow state is restricted to be the pointer-
* arithmetic instructions that both receive flow from the allocation (as identified by `AllocationToInvalidPointer.qll`),
* and further flows to a dereference (as identified by `InvalidPointerToDereference.qll`).
*/
import cpp
import semmle.code.cpp.dataflow.new.DataFlow
import semmle.code.cpp.ir.IR