Files
codeql/python/ql/src/experimental/dataflow/internal

Using the shared dataflow library

File organisation

The files currently live in semmle/code/python (whereas the exisitng implementation lives in semmle/python/dataflow).

In there is found DataFlow.qll, DataFlow2.qll etc. which refer to internal\DataFlowImpl, internal\DataFlowImpl2 etc. respectively. The DataFlowImplN-files are all identical copies to avoid mutual recursion. They start off by including two files internal\DataFlowImplCommon and internal\DataFlowImplSpecific. The former contains all the language-agnostic definitions, while the latter is where we describe our favorite language. Sepcific simply forwards to two other files internal/DataFlowPrivate.qll and internal/DataFlowPublic.qll. Definitions in the former will be hidden behind a private modifier, while those in the latter can be referred to in data flow queries. For instance, the definition of DataFlow::Node should likely be in DataFlowPublic.qll.

Define the dataflow graph

In order to use the dataflow library, we need to define the dataflow graph, that is define the nodes and the edges.

Define the nodes

The nodes are defined in the type DataFlow::Node (found in DataFlowPublic.qll). This should likely be an IPA type, so we can extend it as needed.

Typical cases needed to construct the call graph include

  • argument node
  • parameter node
  • return node

Typical extensions include

  • postupdate nodes
  • implicit this-nodes

Define the edges

The edges split into local flow (within a function) and global flow (the call graph, between functions/procedures).

Extra flow, such as reading from and writing to global variables, can be captured in jumpStep. The local flow should be obtainalble from an SSA computation.

The global flow should be obtainable from a PointsTo analysis. It is specified via viableCallable and getAnOutNode. Consider making ReturnKind a singleton IPA type as in java.

If complicated dispatch needs to be modelled, try using the [reduced|pruned]viable* predicates.

Field flow

To track flow through fields we need to provide a model of fields, that is the Content class.

Field access is specified via read_step and store_step.

Work is being done to make field flow handle lists and dictionaries and the like.

PostUpdateNodes become important when field flow is used, as they track modifications to fields resulting from function calls.

Type pruning

If type information is available, flows can be discarded on the grounds of type mismatch.

Tracked types are given by the class DataFlowType and the predicate getTypeBound, and compatibility is recorded in the predicate compatibleTypes.

Further, possible casts are given by the class CastNode.


Plan

Stage I, data flow

Phase 0, setup

Define minimal IPA type for DataFlow::Node Define all required predicates empty (via none()), except compatibleTypes which should be any(). Define ReturnKind, DataFlowType, and Content as singleton IPA types.

Phase 1, local flow

Implement simpleLocalFlowStep based on the existing SSA computation

Phase 2, local flow

Implement viableCallable and getAnOutNode based on the existing predicate PointsTo.

Phase 3, field flow

Redefine Content and implement read_step and store_step.

Review use of post-update nodes.

Phase 4, type pruning

Use type trackers to obtain relevant type information and redefine DataFlowType to contain appropriate cases. Record the type information in getTypeBound.

Implement compatibleTypes (perhaps simply as the identity).

If necessary, re-implement getErasedRepr and ppReprType.

If necessary, redefine CastNode.

Phase 5, bonus

Review possible use of [reduced|pruned]viable* predicates.

Review need for more elaborate ReturnKind.

Review need for non-empty jumpStep.

Review need for non-empty isUnreachableInCall.

Stage II, taint tracking

Phase 0, setup

Implement all predicates empty.

Phase 1, experiments

Try recovering an existing taint tracking query by implementing sources, sinks, sanitizers, and barriers.