Files
codeql/python/ql/lib/semmle/python/dataflow/new/internal
Taus a9c8163ab3 Python: Fix uses of implicit this
Quoting the style guide:

"14. _Always_ qualify _calls_ to predicates of the same class with
`this`."
2021-10-13 13:43:36 +00:00
..
2021-10-01 13:11:43 +02:00
2021-10-01 13:11:43 +02:00
2021-10-01 13:11:43 +02:00
2021-10-01 13:11:43 +02:00

Using the shared dataflow library

File organisation

The files currently live in experimental (whereas the existing implementation lives in semmle\python\dataflow).

In there is found DataFlow.qll, DataFlow2.qll etc. which refer to internal\DataFlowImpl, internal\DataFlowImpl2 etc. respectively. The DataFlowImplN-files are all identical copies to avoid mutual recursion. They start off by including two files internal\DataFlowImplCommon and internal\DataFlowImplSpecific. The former contains all the language-agnostic definitions, while the latter is where we describe our favorite language. Sepcific simply forwards to two other files internal\DataFlowPrivate.qll and internal\DataFlowPublic.qll. Definitions in the former will be hidden behind a private modifier, while those in the latter can be referred to in data flow queries. For instance, the definition of DataFlow::Node should likely be in DataFlowPublic.qll.

Define the dataflow graph

In order to use the dataflow library, we need to define the dataflow graph, that is define the nodes and the edges.

Define the nodes

The nodes are defined in the type DataFlow::Node (found in DataFlowPublic.qll). This should likely be an IPA type, so we can extend it as needed.

Typical cases needed to construct the call graph include

  • argument node
  • parameter node
  • return node

Typical extensions include

  • postupdate nodes
  • implicit this-nodes

Define the edges

The edges split into local flow (within a function) and global flow (the call graph, between functions/procedures).

Extra flow, such as reading from and writing to global variables, can be captured in jumpStep. The local flow should be obtainalble from an SSA computation. Local flow nodes are generally either control flow nodes or SSA variables. Flow from control flow nodes to SSA variables comes from SSA variable definitions, while flow from SSA variables to control flow nodes comes from def-use pairs.

The global flow should be obtainable from a PointsTo analysis. It is specified via viableCallable and getAnOutNode. Consider making ReturnKind a singleton IPA type as in java.

Global flow includes local flow within a consistent call context. Thus, for local flow to count as global flow, all relevant nodes should implement getEnclosingCallable.

If complicated dispatch needs to be modelled, try using the [reduced|pruned]viable* predicates.

Field flow

To track flow through fields we need to provide a model of fields, that is the Content class.

Field access is specified via read_step and store_step.

Work is being done to make field flow handle lists and dictionaries and the like.

PostUpdateNodes become important when field flow is used, as they track modifications to fields resulting from function calls.

Type pruning

If type information is available, flows can be discarded on the grounds of type mismatch.

Tracked types are given by the class DataFlowType and the predicate getTypeBound, and compatibility is recorded in the predicate compatibleTypes. If type pruning is not used, compatibleTypes should be implemented as any; if it is implemented, say, as none, all flows will be pruned.

Further, possible casts are given by the class CastNode.


Plan

Stage I, data flow

Phase 0, setup

Define minimal IPA type for DataFlow::Node Define all required predicates empty (via none()), except compatibleTypes which should be any(). Define ReturnKind, DataFlowType, and Content as singleton IPA types.

Phase 1, local flow

Implement simpleLocalFlowStep based on the existing SSA computation

Phase 2, local flow

Implement viableCallable and getAnOutNode based on the existing predicate PointsTo.

Phase 3, field flow

Redefine Content and implement read_step and store_step.

Review use of post-update nodes.

Phase 4, type pruning

Use type trackers to obtain relevant type information and redefine DataFlowType to contain appropriate cases. Record the type information in getTypeBound.

Implement compatibleTypes (perhaps simply as the identity).

If necessary, re-implement getErasedRepr and ppReprType.

If necessary, redefine CastNode.

Phase 5, bonus

Review possible use of [reduced|pruned]viable* predicates.

Review need for more elaborate ReturnKind.

Review need for non-empty jumpStep.

Review need for non-empty isUnreachableInCall.

Stage II, taint tracking

Phase 0, setup

Implement all predicates empty.

Phase 1, experiments

Try recovering an existing taint tracking query by implementing sources, sinks, sanitizers, and barriers.


Status

Achieved

  • Copy of shared library; implemented enough predicates to make it compile.
  • Simple flow into, out of, and through functions.
  • Some tests, in particular a sceleton for something comprehensive.

TODO

  • Implementation has largely been done by finding a plausibly-sounding predicate in the python library to refer to. We should review that we actually have the intended semantics in all places.
  • Comprehensive testing.
  • The regression tests track the value of guards in order to eliminate impossible data flow. We currently have regressions because of this. We cannot readily replicate the existing method, as it uses the interdefinedness of data flow and taint tracking (there is a boolean taint kind). C++ does something similar for eliminating impossible control flow, which we might be able to replicate (they infer values of "interesting" control flow nodes, which are those needed to determine values of guards).
  • Flow for some syntactic constructs are done via extra taint steps in the existing implementation, we should find a way to get data flow for it. Some of this should be covered by field flow.
  • A document is being written about proper use of the shared data flow library, this should be adhered to. In particular, we should consider replacing def-use with def-to-first-use and use-to-next-use in local flow.
  • We seem to get duplicated results for global flow, as well as flow with and without type (so four times the "unique" results).
  • We currently consider control flow nodes like exit nodes for functions, we should probably filter down which ones are of interest.
  • We should probably override ToString for a number of data flow nodes.
  • Test flow through classes, constructors and methods.
  • What happens with named arguments? What does C# do?
  • What should the enclosable callable for global variables be? C++ makes it the variable itself, C# seems to not have nodes for these but only for their reads and writes.
  • Is yield another return type? If not, how is it handled?
  • Should OutNode include magic function calls?
  • Consider creating an internal abstract class for nodes as C# does. Among other things, this can help the optimizer by stating that getEnclosingCallable is functional.