From cee527e03a762ae39c316ea455d9d095df46758f Mon Sep 17 00:00:00 2001 From: Tom Hvitved Date: Wed, 6 Apr 2022 11:11:02 +0200 Subject: [PATCH] Document flow through arrays in `dataflow.md` --- docs/ql-libraries/dataflow/dataflow.md | 142 ++++++++++++++++++++++--- 1 file changed, 127 insertions(+), 15 deletions(-) diff --git a/docs/ql-libraries/dataflow/dataflow.md b/docs/ql-libraries/dataflow/dataflow.md index 31a8e12a5cc..2ce0e308a16 100644 --- a/docs/ql-libraries/dataflow/dataflow.md +++ b/docs/ql-libraries/dataflow/dataflow.md @@ -125,7 +125,7 @@ For further details about `PostUpdateNode` see [Field flow](#field-flow) below. Nodes corresponding to expressions and parameters are the most common for users to interact with so a couple of convenience predicates are generally included: -``` +```ql DataFlowExpr Node::asExpr() Parameter Node::asParameter() ExprNode exprNode(DataFlowExpr n) @@ -266,10 +266,23 @@ as described above. ## Field flow The library supports tracking flow through field stores and reads. In order to -support this, a class `Content` and two predicates -`storeStep(Node node1, Content f, Node node2)` and -`readStep(Node node1, Content f, Node node2)` must be defined. It generally -makes sense for stores to target `PostUpdateNode`s, but this is not a strict +support this, two classes `ContentSet` and `Content`, and two predicates +`storeStep(Node node1, ContentSet f, Node node2)` and +`readStep(Node node1, ContentSet f, Node node2)`, must be defined. The interaction +between `ContentSet` and `Content` is defined through + +```ql +Content ContentSet::getAStoreContent(); +Content ContentSet::getAReadContent(); +``` + +which means that a `storeStep(n1, cs, n2)` will be interpreted as storing into _any_ +of `cs.getAStoreContent()`, and dually that a `readStep(n1, cs, n2)` will be +interpreted as reading from _any_ of `cs.getAReadContent()`. In most cases, there +will be a simple bijection between `ContentSet` and `Content`, but when modelling +for example flow through arrays it can be more involved (see [Example 4](#example-4)). + +It generally makes sense for stores to target `PostUpdateNode`s, but this is not a strict requirement. Besides this, certain nodes must have associated `PostUpdateNode`s. The node associated with a `PostUpdateNode` should be defined by `PostUpdateNode::getPreUpdateNode()`. @@ -294,7 +307,7 @@ through a couple of examples. ### Example 1 Consider the following setter and its call: -``` +```java setFoo(obj, x) { sink1(obj.foo); obj.foo = x; @@ -335,12 +348,12 @@ call sites. In the following two lines we would like flow from `x` to reach the `PostUpdateNode` of `a` through a sequence of two store steps, and this is indeed handled automatically by the shared library. -``` +```java a.b.c = x; a.getB().c = x; ``` The only requirement for this to work is the existence of `PostUpdateNode`s. -For a specified read step (in `readStep(Node n1, Content f, Node n2)`) the +For a specified read step (in `readStep(Node n1, ContentSet c, Node n2)`) the shared library will generate a store step in the reverse direction between the corresponding `PostUpdateNode`s. A similar store-through-reverse-read will be generated for calls that can be summarized by the shared library as getters. @@ -370,26 +383,123 @@ itself as this represents the value of the object after construction, that is after the constructor has run. With this setup of `ArgumentNode`s and `PostUpdateNode`s we will achieve the desired flow from `source` to `sink` +### Example 4 + +Assume we want to track flow through arrays precisely: + +```rb +a[0] = tainted +a[1] = not_tainted +sink(a[0]) # bad +sink(a[1]) # good +sink(a[unknown]) # bad; unknown may be 0 + +b[unknown] = tainted +sink(b[0]) # bad; unknown may be 0 + +c[unknown][0] = tainted +c[unknown][1] = not_tainted +sink(c[0][0]) # bad; unknown may be 0 +sink(c[0][1]) # good +``` + +This can be done by defining + +```ql +newtype TContent = + TKnownArrayElementContent(int i) { i in [0 .. 10] } or + TUnknownArrayElementContent() + +class Content extends TContent { + ... +} + +newtype TContentSet = + TSingletonContent(Content c) or + TAnyArrayElementContent() + +class ContentSet extends TContentSet { + Content getAStoreContent() { + this = TSingletonContent(result) + or + // for reverse stores + this = TAnyArrayElementContent() and + result = TUnknownArrayElementContent() + } + + Content getAReadContent() { + this = TSingletonContent(result) + or + this = TAnyArrayElementContent() and + (result = TUnknownArrayElementContent() or result = TKnownArrayElementContent(_)) + } +} +``` + +and we will have the following store/read steps +```rb +# storeStep(tainted, TSingletonContent(TKnownArrayElementContent(0)), [post update] a) +a[0] = tainted + +# storeStep(not_tainted, TSingletonContent(TKnownArrayElementContent(1)), [post update] a) +a[1] = not_tainted + +# readStep(a, TSingletonContent(TKnownArrayElementContent(0)), a[0]) +# readStep(a, TSingletonContent(TUnknownArrayElementContent()), a[0]) +sink(a[0]) # bad + +# readStep(a, TSingletonContent(TKnownArrayElementContent(1)), a[1]) +# readStep(a, TSingletonContent(TUnknownArrayElementContent()), a[1]) +sink(a[1]) # good + +# readStep(a, TAnyArrayElementContent(), a[unknown]) +sink(a[unknown]) # bad; unknown may be 0 + +# storeStep(tainted, TSingletonContent(TUnknownArrayElementContent()), [post update] b) +b[unknown] = tainted + +# readStep(b, TSingletonContent(TKnownArrayElementContent(0)), b[0]) +# readStep(b, TSingletonContent(TUnknownArrayElementContent()), b[0]) +sink(b[0]) # bad; unknown may be 0 + +# storeStep(tainted, TSingletonContent(TKnownArrayElementContent(0)), [post update] c[unknown]) +# storeStep(not_tainted, TSingletonContent(TKnownArrayElementContent(1)), [post update] c[unknown]) +# readStep(c, TAnyArrayElementContent(), c[unknown]) +# storeStep([post update] c[unknown], TAnyArrayElementContent(), [post update] c) # auto-generated reverse store (see Example 2) +c[unknown][0] = tainted +c[unknown][1] = not_tainted + + +# readStep(c[0], TSingletonContent(TKnownArrayElementContent(0)), c[0][0]) +# readStep(c[0], TSingletonContent(TUnknownArrayElementContent()), c[0][0]) +# readStep(c[0], TSingletonContent(TKnownArrayElementContent(1)), c[0][1]) +# readStep(c[0], TSingletonContent(TUnknownArrayElementContent()), c[0][1]) +# readStep(c, TSingletonContent(TKnownArrayElementContent(0)), c[0]) +# readStep(c, TSingletonContent(TUnknownArrayElementContent()), c[0]) +sink(c[0][0]) # bad; unknown may be 0 +sink(c[0][1]) # good +``` + ### Field flow barriers Consider this field flow example: -``` +```java obj.f = source; obj.f = safeValue; sink(obj.f); ``` or the similar case when field flow is used to model collection content: -``` +```java obj.add(source); obj.clear(); sink(obj.get(key)); ``` Clearing a field or content like this should act as a barrier, and this can be -achieved by marking the relevant `Node, Content` pair as a clear operation in +achieved by marking the relevant `Node, ContentSet` pair as a clear operation in the `clearsContent` predicate. A reasonable default implementation for fields looks like this: ```ql -predicate clearsContent(Node n, Content c) { +predicate clearsContent(Node n, ContentSet c) { n = any(PostUpdateNode pun | storeStep(_, c, pun)).getPreUpdateNode() } ``` @@ -397,6 +507,8 @@ However, this relies on the local step relation using the smallest possible use-use steps. If local flow is implemented using def-use steps, then `clearsContent` might not be easy to use. +Note that `clearsContent(n, cs)` is interpreted using `cs.getAReadContent()`. + ## Type pruning The library supports pruning paths when a sequence of value-preserving steps @@ -415,13 +527,13 @@ as a single entity (this improves performance). As an example, Java uses erased types for this purpose and a single equivalence class for all numeric types. The type of a `Node` is given by the following predicate -``` +```ql DataFlowType getNodeType(Node n) ``` and every `Node` should have a type. One also needs to define the string representation of a `DataFlowType`: -``` +```ql string ppReprType(DataFlowType t) ``` The `ppReprType` predicate is used for printing a type in the labels of @@ -499,7 +611,7 @@ predicate nodeIsHidden(Node n) ### Unreachable nodes Consider: -``` +```java foo(source1, false); foo(source2, true);