From 8f259d4bb6a774a088b7468493cfafd6660c41dc Mon Sep 17 00:00:00 2001 From: Asger F Date: Mon, 30 May 2022 14:10:34 +0200 Subject: [PATCH] Python: port API graph doc comment --- python/ql/lib/semmle/python/ApiGraphs.qll | 77 ++++++++++++++++++++++- 1 file changed, 74 insertions(+), 3 deletions(-) diff --git a/python/ql/lib/semmle/python/ApiGraphs.qll b/python/ql/lib/semmle/python/ApiGraphs.qll index fcb89e5f866..16fa18adea0 100644 --- a/python/ql/lib/semmle/python/ApiGraphs.qll +++ b/python/ql/lib/semmle/python/ApiGraphs.qll @@ -12,12 +12,83 @@ import semmle.python.dataflow.new.DataFlow private import semmle.python.internal.CachedStages /** - * Provides classes and predicates for working with APIs used in a database. + * Provides classes and predicates for working with the API boundary between the current + * codebase and external libraries. + * + * See `API::Node` for more in-depth documentation. */ module API { /** - * An abstract representation of a definition or use of an API component such as a function - * exported by a Python package, or its result. + * A node in the API graph, representing a value that has crossed the boundary between this + * codebase and an external library (or in general, any external codebase). + * + * ### Basic usage + * + * API graphs are typically used to identify "API calls", that is, calls to an external function + * whose implementation is not necessarily part of the current codebase. + * + * The most basic use of API graphs is typically as follows: + * 1. Start with `API::moduleImport` for the relevant library. + * 2. Follow up with a chain of accessors such as `getMember` describing how to get to the relevant API function. + * 3. Map the resulting API graph nodes to data-flow nodes, using `asSource` or `asSink`. + * + * For example, a simplified way to get arguments to `json.dumps` would be + * ```ql + * API::moduleImport("json").getMember("dumps").getParameter(0).asSink() + * ``` + * + * The most commonly used accessors are `getMember`, `getParameter`, and `getReturn`. + * + * ### API graph nodes + * + * There are two kinds of nodes in the API graphs, distinguished by who is "holding" the value: + * - **Use-nodes** represent values held by the current codebase, which came from an external library. + * (The current codebase is "using" a value that came from the library). + * - **Def-nodes** represent values held by the external library, which came from this codebase. + * (The current codebase "defines" the value seen by the library). + * + * API graph nodes are associated with data-flow nodes in the current codebase. + * (Since external libraries are not part of the database, there is no way to associate with concrete + * data-flow nodes from the external library). + * - **Use-nodes** are associated with data-flow nodes where a value enters the current codebase, + * such as the return value of a call to an external function. + * - **Def-nodes** are associated with data-flow nodes where a value leaves the current codebase, + * such as an argument passed in a call to an external function. + * + * + * ### Access paths and edge labels + * + * Nodes in the API graph are associated with a set of access paths, describing a series of operations + * that may be performed to obtain that value. + * + * For example, the access path `API::moduleImport("json").getMember("dumps")` represents the action of + * importing `json` and then accessing the member `dumps` on the resulting object. + * + * Each edge in the graph is labelled by such an "operation". For an edge `A->B`, the type of the `A` node + * determines who is performing the operation, and the type of the `B` node determines who ends up holding + * the result: + * - An edge starting from a use-node describes what the current codebase is doing to a value that + * came from a library. + * - An edge starting from a def-node describes what the external library might do to a value that + * came from the current codebase. + * - An edge ending in a use-node means the result ends up in the current codebase (at its associated data-flow node). + * - An edge ending in a def-node means the result ends up in external code (its associated data-flow node is + * the place where it was "last seen" in the current codebase before flowing out) + * + * Because the implementation of the external library is not visible, it is not known exactly what operations + * it will perform on values that flow there. Instead, the edges starting from a def-node are operations that would + * lead to an observable effect within the current codebase; without knowing for certain if the library will actually perform + * those operations. (When constructing these edges, we assume the library is somewhat well-behaved). + * + * For example, given this snippet: + * ```python + * import foo + * foo.bar(lambda x: doSomething(x)) + * ``` + * A callback is passed to the external function `foo.bar`. We can't know if `foo.bar` will actually invoke this callback. + * But _if_ the library should decide to invoke the callback, then a value will flow into the current codebase via the `x` parameter. + * For that reason, an edge is generated representing the argument-passing operation that might be performed by `foo.bar`. + * This edge is going from the def-node associated with the callback to the use-node associated with the parameter `x`. */ class Node extends Impl::TApiNode { /**