Files
codeql/python/ql/lib/semmle/python/pointsto/Overview.qll
Andrew Eisenberg 3660c64328 Packaging: Rafactor Python core libraries
Extract the external facing `qll` files into the codeql/python-all
query pack.
2021-08-24 13:23:45 -07:00

126 lines
7.4 KiB
Plaintext

/*
* ## Points-to analysis for Python
*
*
* The purpose of points-to analysis is to determine what values a variable might hold at runtime.
* This allows us to write useful queries to check for the misuse of those values.
* In the academic and technical literature, points-to analysis (AKA pointer analysis) attempts to determine which variables can refer to which heap allocated objects.
* From the point of view of Python we can treat all Python objects as "heap allocated objects".
*
*
* The output of the points-to analysis consists of a large set of relations which provide not only points-to information, but call-graph, pruned flow-graph and exception-raising information.
*
* These relations are computed by a large set of mutually recursive predicates which infer the flow of values through the program.
* Our analysis is inter-procedural use contexts to maintain the precision of an intra-procedural analysis.
*
* ### Precision
*
* In conventional points-to, the computed points-to set should be a super-set of the real points-to set (were it possible to determine such a thing).
* However for our purposes we want the points-to set to be a sub-set of the real points-to set.
* This is simply because conventional points-to is used to determine compiler optimisations, so the points-to set needs to be a conservative over-estimate of what is possible.
* We have the opposite concern; we want to eliminate false positives where possible.
*
* This should be born in mind when reading the literature about points-to analysis. In conventional points-to, a precise analysis produces as small a points-to set as possible.
* Our analysis is precise (or very close to it). Instead of seeking to maximise precision, we seek to maximise *recall* and produce as large a points-to set as possible (whilst remaining precise).
*
* When it comes to designing the inference, we always choose precision over recall.
* We want to minimise false positives so it is important to avoid making incorrect inferences, even if it means losing a lot of potential information.
* If a potential new points-to fact would increase the number of values we are able to infer, but decrease precision, then we omit it.
*
* ###Objects
*
* In convention points-to an 'object' is generally considered to be any static instantiation. E.g. in Java this is simply anything looking like `new X(..)`.
* However, in Python as there is no `new` expression we cannot known what is a class merely from the syntax.
* Consequently, we must start with only with the simplest objects and extend to instance creation as we can infer classes.
*
* To perform points-to analysis we start with the set of built-in objects, all literal constants, and class and function definitions.
* From there we can propagate those values. Whenever we see a call `x()` we add a new object if `x` refers to some class.
*
* In the `PointsTo::points_to` relation, the second argument, `Object value` is the "value" referred to by the ControlFlowNode (which will correspond to an rvalue in the source code).
* The set of "values" used will change as the library continues to improve, but currently include the following:
*
* * Classes (both in the source and builtin)
* * Functions (both in the source and builtin)
* * Literal constants defined in the source (string and numbers)
* * Constant objects defined in compiled libraries and the interpreter (None, boolean, strings and numbers)
* * A few other constants such as small integers.
* * Instances of classes
* * Bound methods, static- and class-methods, and properties.
* * Instances of `super`.
* * Missing modules, where no concrete module is found for an import.
*
* A number of constructs that might create a new object, such as binary operations, are omitted if there is no useful information to can be attached to them and they would just increase the size of the database.
*
* ###Contexts
*
* In order to better handle value tracking in functions, we introduce context to the points-to relation.
* There is one `default` context, equivalent to having no context, a `main` context for scripts and any number of call-site contexts.
*
* Adding context to a conventional points-to analysis can significantly improve its precision. Whereas, for our points-to analysis adding context significantly improves the recall of our analysis.
* The consensus in the academic literature is that "object sensitivity" is superior to "call-site sensitivity".
* However, since we are seeking to maximise not minimise our points-to set, it is entirely possible that the reverse is true for us.
* We use "call-site sensitivity" at the moment, although the exact set of contexts used will change.
*
* ### Points-to analysis over the ESSA dataflow graph
*
* In order to perform points-to analysis on the dataflow graph, we
* need to understand the many implicit "definitions" that occur within Python code.
*
* These are:
*
* 1. Implicit definition as "undefined" for any local or global variable at the start of its scope.
* Many of these will be dead and will be eliminated during construction of the dataflow graph.
* 2. Implicit definition of `__name__`, `__package__` and `__module__` at the start of the relevant scopes.
* 3. Implicit definition of all submodules as global variables at the start of an `__init__` module
*
* In addition, there are the "artificial", data-flow definitions:
*
* 1. Phi functions
* 2. Pi (guard, or filter) functions.
* 3. "Refinements" of a variable. These are not definitions of the variable, but may modify the object referred to by the variable,
* possibly changing some inferred facts about the object.
* 4. Definition of any variable that escapes the scope, at entry, exit and at all call-sites.
*
* As an example, consider:
* ```python
* if a:
* float = "global"
* #float can now be either the class 'float' or the string "global"
*
* class C2:
* if b:
* float = "local"
* float
*
* float #Cannot be "local"
* ```
*
* Ignoring `__name__` and `__package__`, the data-flow graph looks something like this, noting that there are two variables named "float"
* in the scope `C2`, the local and the global.
*
* ```
* a_0 = undefined
* b_0 = undefined
* float_0 = undefined
* int_0 = undefined
* float_1 = "global"
* float_2 = phi(float_0, float_1)
* float_3 = float_2 (Definition on entry to C2 for global variable)
* float_4 = undefined (Definition on entry to C2 for local variable)
* float_5 = "local" |
* float_6 = phi(float_4, float_5) |
* float_7 = float_3 (transfer values in global 'float', but not local, back to module scope).
* ```
*
* ### Implementation
*
* <b>This section is for information purposes only. Any or all details may change without notice.</b>
*
* QL, being based on Datalog, has fixed-point semantics which makes it impossible to make negative statements that are recursive.
* To work around this we need to define many predicates over boolean variables. Suppose we have a predicate with determines whether a test can be true or false at runtime.
* We might naively implement this as `predicate test_is_true(ControlFlowNode test, Context ctx)` but this would lead to negative recursion if we want to know when the test can be false.
* Instead we implement it as `boolean test_result(ControlFlowNode test, Context ctx)` where the absence of a value indicates merely that we do (yet) know what value the test may have.
*/