mirror of
https://github.com/github/codeql.git
synced 2026-04-26 09:15:12 +02:00
docs: rename ql-documentation > language
This commit is contained in:
9
docs/language/learn-ql/python/control-flow-graph.rst
Normal file
9
docs/language/learn-ql/python/control-flow-graph.rst
Normal file
@@ -0,0 +1,9 @@
|
||||
Python control flow graph
|
||||
=========================
|
||||
|
||||
:doc:`Back to tutorial: control flow analysis <control-flow>`
|
||||
|
||||
|Python control flow graph|
|
||||
|
||||
.. |Python control flow graph| image:: ../../images/python-flow-graph.png
|
||||
|
||||
107
docs/language/learn-ql/python/control-flow.rst
Normal file
107
docs/language/learn-ql/python/control-flow.rst
Normal file
@@ -0,0 +1,107 @@
|
||||
Tutorial: Control flow analysis
|
||||
===============================
|
||||
|
||||
In order to analyze the `Control-flow graph <http://en.wikipedia.org/wiki/Control_flow_graph>`__ of a ``Scope`` we can use the two QL classes ``ControlFlowNode`` and ``BasicBlock``. These classes allow you to ask such questions as "can you reach point A from point B?" or "Is it possible to reach point B *without* going through point A?". To report results we use the class ``AstNode``, which represents a syntactic element and corresponds to the source code - allowing the results of the query to be more easily understood.
|
||||
|
||||
The ``ControlFlowNode`` class
|
||||
-----------------------------
|
||||
|
||||
The ``ControlFlowNode`` class represents nodes in the control flow graph. There is a one-to-many relation between AST nodes and control flow nodes. Each syntactic element, the ``AstNode,`` maps to zero, one or many ``ControlFlowNode`` classes, but each ControlFlowNode maps to exactly one ``AstNode``.
|
||||
|
||||
To show why this complex relation is required consider the following Python code:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
try:
|
||||
might_raise()
|
||||
if cond:
|
||||
break
|
||||
finally:
|
||||
close_resource()
|
||||
|
||||
There are many paths through the above code. There are three different paths through the call to ``close_resource();`` one normal path, one path that breaks out of the loop, and one path where an exception is raised by ``might_raise()``. (An annotated flow graph can be seen :doc:`here <control-flow-graph>`.)
|
||||
|
||||
The simplest use of the ``ControlFlowNode`` and ``AstNode`` classes is to find unreachable code. There is one ``ControlFlowNode`` per path through any ``AstNode`` and any ``AstNode`` that is unreachable has no paths flowing through it; therefore any ``AstNode`` without a corresponding ``ControlFlowNode`` is unreachable.
|
||||
|
||||
**Unreachable AST nodes**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from AstNode node
|
||||
where not exists(node.getAFlowNode())
|
||||
select node
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/669220024/>`__. The demo projects on LGTM.com all have some code that has no control flow node, and is therefore unreachable. However, since the ``Module`` class is also a subclass of the ``AstNode`` class, the query also finds any modules implemented in C or with no source code. Therefore, it is better to find all unreachable statements:
|
||||
|
||||
**Unreachable statements**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Stmt s
|
||||
where not exists(s.getAFlowNode())
|
||||
select s
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/670720181/>`__. This query gives fewer results, but most of the projects have some unreachable nodes. These are also highlighted by the standard query: `Unreachable code <https://lgtm.com/rules/3980095>`__.
|
||||
|
||||
The ``BasicBlock`` class
|
||||
------------------------
|
||||
|
||||
The ``BasicBlock`` class represents a `basic block <http://en.wikipedia.org/wiki/Basic_block>`__ of control flow nodes. The ``BasicBlock`` class is not that useful for writing queries directly, but is very useful for building complex analyses, such as data flow. The reason it is useful is that it shares many of the interesting properties of control flow nodes, such as what can reach what and what `dominates <http://en.wikipedia.org/wiki/Dominator_%28graph_theory%29>`__ what, but there are fewer basic blocks than control flow nodes - resulting in queries that are faster and use less memory.
|
||||
|
||||
Example: Finding mutually exclusive basic blocks
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Suppose we have the following Python code:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if condition():
|
||||
return 0
|
||||
pass
|
||||
|
||||
Can we determine that it is impossible to reach both the ``return 0`` statement and the ``pass`` statement in a single execution of this code? For two basic blocks to be mutually exclusive it must be impossible to reach either of them from the other. We can write:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from BasicBlock b1, BasicBlock b2
|
||||
where b1 != b2 and not b1.strictlyReaches(b2) and not b2.strictlyReaches(b1)
|
||||
select b1, b2
|
||||
|
||||
However, by that definition, two basic blocks are mutually exclusive if they are in different scopes. To make the results more useful, we require that both basic blocks can be reached from the same function entry point:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
exists(Function shared, BasicBlock entry |
|
||||
entry.contains(shared.getEntryNode()) and
|
||||
entry.strictlyReaches(b1) and entry.strictlyReaches(b2)
|
||||
)
|
||||
|
||||
Combining these conditions we get:
|
||||
|
||||
**Mutually exclusive blocks within the same function**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from BasicBlock b1, BasicBlock b2
|
||||
where b1 != b2 and not b1.strictlyReaches(b2) and not b2.strictlyReaches(b1) and
|
||||
exists(Function shared, BasicBlock entry |
|
||||
entry.contains(shared.getEntryNode()) and
|
||||
entry.strictlyReaches(b1) and entry.strictlyReaches(b2)
|
||||
)
|
||||
select b1, b2
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/671000028/>`__. This typically gives a very large number of results, because it is a common occurrence in normal control flow. It is, however, an example of the sort of control-flow analysis that is possible. Control-flow analyses such as this are an important aid to data flow analysis which is covered in the next tutorial.
|
||||
|
||||
What next?
|
||||
----------
|
||||
|
||||
- Experiment with the worked examples in the QL for Python tutorial topic: :doc:`Taint tracking and data flow analysis in Python <taint-tracking>`.
|
||||
- Find out more about QL in the `QL language handbook <https://help.semmle.com/QL/ql-handbook/index.html>`__ and `QL language specification <https://help.semmle.com/QL/QLLanguageSpecification.html>`__.
|
||||
79
docs/language/learn-ql/python/functions.rst
Normal file
79
docs/language/learn-ql/python/functions.rst
Normal file
@@ -0,0 +1,79 @@
|
||||
Tutorial: Functions
|
||||
===================
|
||||
|
||||
This example uses the standard QL class ``Function`` (see :doc:`Introducing the Python libraries <introduce-libraries-python>`).
|
||||
|
||||
Finding all functions called "get..."
|
||||
-------------------------------------
|
||||
|
||||
In this example we look for all the "getters" in a program. Programmers moving to Python from Java are often tempted to write lots of getter and setter methods, rather than use properties. We might want to find those methods.
|
||||
|
||||
Using the member predicate ``Function.getName()``, we can list all of the getter functions in a snapshot:
|
||||
|
||||
Tip
|
||||
|
||||
Instead of copying this query, try typing the code. As you start to write a name that matches a library class, a pop-up is displayed making it easy for you to select the class that you want.
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Function f
|
||||
where f.getName().matches("get%")
|
||||
select f, "This is a function called get..."
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/669220031/>`__. This query typically finds a large number of results. Usually, many of these results are for functions (rather than methods) which we are not interested in.
|
||||
|
||||
Finding all methods called "get..."
|
||||
-----------------------------------
|
||||
|
||||
You can modify the query above to return more interesting results. As we are only interested in methods, we can use the ``Function.isMethod()`` predicate to refine the query.
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Function f
|
||||
where f.getName().matches("get%") and f.isMethod()
|
||||
select f, "This is a method called get..."
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/690010035/>`__. This finds methods whose name starts with ``"get"``, but many of those are not the sort of simple getters we are interested in.
|
||||
|
||||
Finding one line methods called "get..."
|
||||
----------------------------------------
|
||||
|
||||
We can modify the query further to include only methods whose body consists of a single statement. We do this by counting the number of lines in each method.
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Function f
|
||||
where f.getName().matches("get%") and f.isMethod()
|
||||
and count(f.getAStmt()) = 1
|
||||
select f, "This function is (probably) a getter."
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/667290044/>`__. This query returns fewer results, but if you examine the results you can see that there are still refinements to be made. This is refined further in :doc:`Tutorial: Statements and expressions <statements-expressions>`.
|
||||
|
||||
Finding a call to a specific function
|
||||
-------------------------------------
|
||||
|
||||
This query uses ``Call`` and ``Name`` to find calls to the function ``input`` - which might potentially be a security hazard (in Python 2).
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Call call, Name name
|
||||
where call.getFunc() = name and name.getId() = "input"
|
||||
select call, "call to 'input'."
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/686330029/>`__. Some of the demo projects on LGTM.com use this function.
|
||||
|
||||
The ``Call`` class represents calls in Python. The ``Call.getFunc()`` predicate gets the expression being called. ``Name.getId()`` gets the identifier (as a string) of the ``Name`` expression. Due to the dynamic nature of Python, this query will select any call of the form ``input(...)`` regardless of whether it is a call to the built-in function ``input`` or not. In a later tutorial we will see how to use the type-inference library to find calls to the built-in function ``input`` regardless of name of the variable called.
|
||||
|
||||
What next?
|
||||
----------
|
||||
|
||||
- Experiment with the worked examples in the QL for Python tutorial topics: :doc:`Statements and expressions <statements-expressions>`, :doc:`Control flow <control-flow>`, :doc:`Points-to analysis and type inference <pointsto-type-infer>`.
|
||||
- Find out more about QL in the `QL language handbook <https://help.semmle.com/QL/ql-handbook/index.html>`__ and `QL language specification <https://help.semmle.com/QL/QLLanguageSpecification.html>`__.
|
||||
329
docs/language/learn-ql/python/introduce-libraries-python.rst
Normal file
329
docs/language/learn-ql/python/introduce-libraries-python.rst
Normal file
@@ -0,0 +1,329 @@
|
||||
Introducing the QL libraries for Python
|
||||
=======================================
|
||||
|
||||
These libraries have been created to help you analyze Python code, providing an object-oriented layer on top of the raw data in the snapshot database. They are written in standard QL.
|
||||
|
||||
The QL libraries all have a ``.qll`` extension, to signify that they contain QL library code but no actual queries. Each file contains a QL class or hierarchy of classes.
|
||||
|
||||
You can include all of the standard libraries by beginning each query with this statement:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
The rest of this tutorial summarizes the contents of the standard QL libraries. We recommend that you read this and then work through the practical examples in the Python tutorials shown at the end of the page.
|
||||
|
||||
Overview of the library
|
||||
-----------------------
|
||||
|
||||
The QL Python library incorporates a large number of classes, each class corresponding either to one kind of entity in Python source code or to an entity that can be derived form the source code using static analysis. These classes can be divided into four categories:
|
||||
|
||||
- **Syntactic** - classes that represent entities in the Python source code.
|
||||
- **Control flow** - classes that represent entities from the control flow graphs.
|
||||
- **Data flow** - classes that assist in performing data flow analyses on Python source code.
|
||||
- **Type inference** - classes that represent the inferred types of entities in the Python source code.
|
||||
|
||||
Syntactic classes
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
This part of the library represents the Python source code. The ``Module``, ``Class`` and ``Function`` classes correspond to Python modules, classes and functions respectively, collectively these are known as ``Scope`` classes. Each ``Scope`` contains a list of statements each of which is represented by a subclass of the class ``Stmt``. Statements themselves can contain other statements or expressions which are represented by subclasses of ``Expr``. Finally, there are a few additional classes for the parts of more complex expressions such as list comprehensions. Collectively these classes are subclasses of ``AstNode`` and form an `Abstract syntax tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`__ (AST). The root of each AST is a ``Module``.
|
||||
|
||||
`Symbolic information <http://en.wikipedia.org/wiki/Symbol_table>`__ is attached to the AST in the form of variables (represented by the class ``Variable``).
|
||||
|
||||
Scope
|
||||
^^^^^
|
||||
|
||||
A Python program is a group of modules. Technically a module is just a list of statements, but we often think of it as composed of classes and functions. These top-level entities, the module, class and function are represented by the three classes (`Module <https://help.semmle.com/qldoc/python/semmle/python/Module.qll/type.Module$Module.html>`__, `Class <https://help.semmle.com/qldoc/python/semmle/python/Class.qll/type.Class$Class.html>`__ and `Function <https://help.semmle.com/qldoc/python/semmle/python/Function.qll/type.Function$Function.html>`__ which are all subclasses of ``Scope``.
|
||||
|
||||
- ``Scope``
|
||||
|
||||
- ``Module``
|
||||
- ``Class``
|
||||
- ``Function``
|
||||
|
||||
All scopes are basically a list of statements, although ``Scope`` classes have additional attributes such as names. For example, the following query finds all functions whose scope (the scope in which they are declared) is also a function:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Function f
|
||||
where f.getScope() instanceof Function
|
||||
select f
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/665620040/>`__. Many projects have nested functions.
|
||||
|
||||
Statement
|
||||
^^^^^^^^^
|
||||
|
||||
A statement is represented by the `Stmt <https://help.semmle.com/qldoc/python/semmle/python/Stmts.qll/type.Stmts$Stmt.html>`__ class which has about 20 subclasses representing the various kinds of statements, such as the ``Pass`` statement, the ``Return`` statement or the ``For`` statement. Statements are usually made up of parts. The most common of these is the expression, represented by the ``Expr`` class. For example, take the following Python ``for`` statement:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
for var in seq:
|
||||
pass
|
||||
else:
|
||||
return 0
|
||||
|
||||
The QL `For <https://help.semmle.com/qldoc/python/semmle/python/Stmts.qll/type.Stmts$For.html>`__ class representing the ``for`` statement has a number of member predicates to access its parts:
|
||||
|
||||
- ``getTarget()`` returns the ``Expr`` representing the variable ``var``.
|
||||
- ``getIter()`` returns the ``Expr`` resenting the variable ``seq``.
|
||||
- ``getBody()`` returns the statement list body.
|
||||
- ``getStmt(0)`` returns the pass ``Stmt``.
|
||||
- ``getOrElse()`` returns the ``StmtList`` containing the return statement.
|
||||
|
||||
Expression
|
||||
^^^^^^^^^^
|
||||
|
||||
Most statements are made up of expressions. The `Expr <https://help.semmle.com/qldoc/python/semmle/python/Exprs.qll/type.Exprs$Expr.html>`__ class is the superclass of all expression classes, of which there are about 30 including calls, comprehensions, tuples, lists and arithmetic operations. For example, the Python expression ``a+2`` is represented by the ``BinaryExpr`` class:
|
||||
|
||||
- ``getLeft()`` returns the ``Expr`` representing the ``a``.
|
||||
- ``getRight()`` returns the ``Expr`` representing the ``2``.
|
||||
|
||||
As an example, to find expressions of the form ``a+2`` where the left is a simple name and the right is a numeric constant we can use the following query:
|
||||
|
||||
**Finding expressions of the form "a+2"**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from BinaryExpr bin
|
||||
where bin.getLeft() instanceof Name and bin.getRight() instanceof Num
|
||||
select bin
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/669950026/>`__. Many projects include examples of this pattern.
|
||||
|
||||
Variable
|
||||
^^^^^^^^
|
||||
|
||||
Variables are represented by the `Variable <https://help.semmle.com/qldoc/python/semmle/python/Variables.qll/type.Variables$Variable.html>`__ class in the Python QL library. There are two subclasses, ``LocalVariable`` for function-level and class-level variables and ``GlobalVariable`` for module-level variables.
|
||||
|
||||
Other source code elements
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Although the meaning of the program is encoded by the syntactic elements, ``Scope``, ``Stmt`` and ``Expr`` there are some parts of the source code not covered by the abstract syntax tree. The most useful of these is the `Comment <https://help.semmle.com/qldoc/python/semmle/python/Comment.qll/type.Comment$Comment.html>`__ class which describes comments in the source code.
|
||||
|
||||
Examples
|
||||
^^^^^^^^
|
||||
|
||||
Each syntactic element in Python source is recorded in the snapshot. These can be queried via the corresponding class. Let us start with a couple of simple examples.
|
||||
|
||||
1. Finding all finally blocks
|
||||
'''''''''''''''''''''''''''''
|
||||
|
||||
For our first example, we can find all ``finally`` blocks by using the ``Try`` class:
|
||||
|
||||
**Find all ``finally`` blocks**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Try t
|
||||
select t.getFinalbody()
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/659662193/>`__. Many projects include examples of this pattern.
|
||||
|
||||
2. Finding 'except' blocks that do nothing
|
||||
''''''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
For our second example, we can use a simplified version of a query from the standard query set. We look for all ``except`` blocks that do nothing.
|
||||
|
||||
A block that does nothing is one that contains no statements except ``pass`` statements. We can encode this as:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
not exists(Stmt s | s = ex.getAStmt() | not s instanceof Pass)
|
||||
|
||||
where ``ex`` is an ``ExceptStmt`` and ``Pass`` is the class representing ``pass`` statements. Instead of using the double negative, **"no**\ *statements that are*\ **not**\ *pass statements"*, this can also be expressed positively, "all statements must be pass statements." The positive form is expressed in QL using the ``forall`` quantifier:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
forall(Stmt s | s = ex.getAStmt() | s instanceof Pass)
|
||||
|
||||
Both forms are equivalent. Using the positive QL expression, the whole query looks like this:
|
||||
|
||||
**Find pass-only ``except`` blocks**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from ExceptStmt ex
|
||||
where forall(Stmt s | s = ex.getAStmt() | s instanceof Pass)
|
||||
select ex
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/690010036/>`__. Many projects include pass-only ``except`` blocks.
|
||||
|
||||
Summary
|
||||
^^^^^^^
|
||||
|
||||
The most commonly used standard QL library classes in the syntactic part of the library are organized as follows:
|
||||
|
||||
``Module``, ``Class``, ``Function``, ``Stmt`` and ``Expr`` - they are all subclasses of `AstNode <https://help.semmle.com/qldoc/python/semmle/python/AST.qll/type.AST$AstNode.html>`__.
|
||||
|
||||
Abstract syntax tree
|
||||
''''''''''''''''''''
|
||||
|
||||
- ``AstNode``
|
||||
|
||||
- ``Module`` – A Python module
|
||||
- ``Class`` – The body of a class definition
|
||||
- ``Function`` – The body of a function definition
|
||||
- ``Stmt`` – A statement
|
||||
|
||||
- ``Assert`` – An ``assert`` statement
|
||||
- ``Assign`` – An assignment
|
||||
|
||||
- ``AssignStmt`` – An assignment statement, ``x = y``
|
||||
- ``ClassDef`` – A class definition statement
|
||||
- ``FunctionDef`` – A function definition statement
|
||||
|
||||
- ``AugAssign`` – An augmented assignment, ``x += y``
|
||||
- ``Break`` – A ``break`` statement
|
||||
- ``Continue`` – A ``continue`` statement
|
||||
- ``Delete`` – A ``del`` statement
|
||||
- ``ExceptStmt`` – The ``except`` part of a ``try`` statement
|
||||
- ``Exec`` – An exec statement
|
||||
- ``For`` – A ``for`` statement
|
||||
- ``If`` – An ``if`` statement
|
||||
- ``Pass`` – A ``pass`` statement
|
||||
- ``Print`` – A ``print`` statement (Python 2 only)
|
||||
- ``Raise`` – A raise statement
|
||||
- ``Return`` – A ``return`` statement
|
||||
- ``Try`` – A ``try`` statement
|
||||
- ``While`` – A ``while`` statement
|
||||
- ``With`` – A ``with`` statement
|
||||
|
||||
- ``Expr`` – An expression
|
||||
|
||||
- ``Attribute`` – An attribute, ``obj.attr``
|
||||
- ``Call`` – A function call, ``f(arg)``
|
||||
- ``IfExp`` – A conditional expression, ``x if cond else y``
|
||||
- ``Lambda – A lambda expression``
|
||||
- ``Yield`` – A ``yield`` expression
|
||||
- ``Bytes`` – A bytes literal, ``b"x"`` or (in Python 2) ``"x"``
|
||||
- ``Unicode`` – A unicode literal, ``u"x"`` or (in Python 3) ``"x"``
|
||||
- ``Num`` – A numeric literal, ``3`` or ``4.2``
|
||||
|
||||
- ``IntegerLiteral``
|
||||
- ``FloatLiteral``
|
||||
- ``ImaginaryLiteral``
|
||||
|
||||
- ``Dict`` – A dictionary literal, ``{'a': 2}``
|
||||
- ``Set`` – A set literal, ``{'a', 'b'}``
|
||||
- ``List`` – A list literal, ``['a', 'b']``
|
||||
- ``Tuple`` – A tuple literal, ``('a', 'b')``
|
||||
- ``DictComp`` – A dictionary comprehension, ``{k: v for ...}``
|
||||
- ``SetComp`` – A set comprehension, ``{x for ...}``
|
||||
- ``ListComp`` – A list comprehension, ``[x for ...]``
|
||||
- ``GenExpr`` – A generator expression, ``(x for ...)``
|
||||
- ``Subscript`` – A subscript operation, ``seq[index]``
|
||||
- ``Name`` – A reference to a variable, ``var``
|
||||
- ``UnaryExpr`` – A unary operation, ``-x``
|
||||
- ``BinaryExpr`` – A binary operation, ``x+y``
|
||||
- ``Compare`` – A comparison operation, ``0 < x < 10``
|
||||
- ``BoolExpr`` – Short circuit logical operations, ``x and y``, ``x or y``
|
||||
|
||||
Variables
|
||||
'''''''''
|
||||
|
||||
- ``Variable`` – A variable
|
||||
|
||||
- ``LocalVariable`` – A variable local to a function or a class
|
||||
- ``GlobalVariable`` – A module level variable
|
||||
|
||||
Other
|
||||
'''''
|
||||
|
||||
- ``Comment`` – A comment
|
||||
|
||||
Control flow classes
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This part of the library represents the control flow graph of each ``Scope`` (classes, functions, and modules). Each ``Scope`` contains a graph of ``ControlFlowNode`` elements. Each scope has a single entry point and at least one (potentially many) exit points. To speed up control and data flow analysis, control flow nodes are grouped into `basic blocks <http://en.wikipedia.org/wiki/Basic_block>`__.
|
||||
|
||||
As an example, we might want to find the longest sequence of code without any branches. A ``BasicBlock`` is, by definition, a sequence of code without any branches, so we just need to find the longest ``BasicBlock``.
|
||||
|
||||
First of all we introduce a simple predicate ``bb_length()`` which relates ``BasicBlock``\ s to their length.
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
int bb_length(BasicBlock b) {
|
||||
result = max(int i | exists(b.getNode(i))) + 1
|
||||
}
|
||||
|
||||
Each ``ControlFlowNode`` within a ``BasicBlock`` is numbered consecutively, starting from zero, therefore the length of a ``BasicBlock`` is equal to one more than the largest index within that ``BasicBlock``.
|
||||
|
||||
Using this predicate we can select the longest ``BasicBlock`` by selecting the ``BasicBlock`` whose length is equal to the maximum length of any ``BasicBlock``:
|
||||
|
||||
**Find the longest sequence of code without branches**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
int bb_length(BasicBlock b) {
|
||||
result = max(int i | exists(b.getNode(i)) | i) + 1
|
||||
}
|
||||
|
||||
from BasicBlock b
|
||||
where bb_length(b) = max(bb_length(_))
|
||||
select b
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/666730036/>`__. When we ran it on the LGTM.com demo projects, the *openstack/nova* and *ytdl-org/youtube-dl* projects both contained source code results for this query.
|
||||
|
||||
.. pull-quote::
|
||||
|
||||
Note
|
||||
|
||||
The special underscore variable ``_`` means any value; so ``bb_length(_)`` is the length of any block.
|
||||
|
||||
Summary
|
||||
^^^^^^^
|
||||
|
||||
The classes in the control-flow part of the library are:
|
||||
|
||||
- `ControlFlowNode <https://help.semmle.com/qldoc/python/semmle/python/Flow.qll/type.Flow$ControlFlowNode.html>`__ – A control-flow node. There is a one-to-many relation between AST nodes and control-flow nodes.
|
||||
- `BasicBlock <https://help.semmle.com/qldoc/python/semmle/python/Flow.qll/type.Flow$BasicBlock.html>`__ – A non branching list of control-flow nodes.
|
||||
|
||||
Data flow
|
||||
~~~~~~~~~
|
||||
|
||||
The ``SsaVariable`` class represents `static single assignment form <http://en.wikipedia.org/wiki/Static_single_assignment_form>`__ variables (SSA variables). There is a one-to-many relation between variables and SSA variables. The ``SsaVariable`` class provides an accurate and fast means of tracking data flow from definition to use; the ``SsaVariable`` class is an important element for building data flow analyses, including type inference.
|
||||
|
||||
Type-inference classes
|
||||
----------------------
|
||||
|
||||
The QL library for Python also supplies some classes for accessing the inferred types of values. The classes ``Object`` and ``ClassObject`` allow you to query the possible classes that an expression may have at runtime. For example, which ``ClassObjects`` are iterable can be determined using the query:
|
||||
|
||||
**Find iterable ``ClassObjects``**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from ClassObject cls
|
||||
where cls.hasAttribute("__iter__")
|
||||
select cls
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/688180005/>`__ This query returns a list of classes for the projects analyzed. If you want to include the results for `builtin classes <http://docs.python.org/library/stdtypes.html>`__, which do not have any Python source code, show the non-source results.
|
||||
|
||||
Summary
|
||||
~~~~~~~
|
||||
|
||||
- `Object <https://help.semmle.com/qldoc/python/semmle/python/types/Object.qll/type.Object$Object.html>`__
|
||||
|
||||
- ``ClassObject``
|
||||
- ``FunctionObject``
|
||||
- ``ModuleObject``
|
||||
|
||||
These classes are explained in more detail in :doc:`Tutorial: Points-to analysis and type inference <pointsto-type-infer>`.
|
||||
|
||||
What next?
|
||||
----------
|
||||
|
||||
- Experiment with the worked examples in the QL for Python tutorial topics: :doc:`Functions <functions>`, :doc:`Statements and expressions <statements-expressions>`, :doc:`Control flow <control-flow>` and :doc:`Points-to analysis and type inference <pointsto-type-infer>`.
|
||||
- Find out more about QL in the `QL language handbook <https://help.semmle.com/QL/ql-handbook/index.html>`__ and `QL language specification <https://help.semmle.com/QL/QLLanguageSpecification.html>`__.
|
||||
230
docs/language/learn-ql/python/pointsto-type-infer.rst
Normal file
230
docs/language/learn-ql/python/pointsto-type-infer.rst
Normal file
@@ -0,0 +1,230 @@
|
||||
Tutorial: Points-to analysis and type inference
|
||||
===============================================
|
||||
|
||||
This topic contains worked examples of how to write queries using the standard QL library classes for Python type inference.
|
||||
|
||||
The ``Object`` class
|
||||
--------------------
|
||||
|
||||
The ``Object`` class and its subclasses ``FunctionObject``, ``ClassObject`` and ``ModuleObject`` represent the values an expression may hold at runtime.
|
||||
|
||||
Summary
|
||||
~~~~~~~
|
||||
|
||||
Class hierarchy for ``Object``:
|
||||
|
||||
- `Object <https://help.semmle.com/qldoc/python/semmle/python/types/Object.qll/type.Object$Object.html>`__
|
||||
|
||||
- ``ClassObject``
|
||||
- ``FunctionObject``
|
||||
- ``ModuleObject``
|
||||
|
||||
Points-to analysis and type inference
|
||||
-------------------------------------
|
||||
|
||||
Points-to analysis, sometimes known as `pointer analysis <http://en.wikipedia.org/wiki/Pointer_analysis>`__, allows us to determine which objects an expression may "point to" at runtime.
|
||||
|
||||
`Type inference <http://en.wikipedia.org/wiki/Type_inference>`__ allows us to infer what the types (classes) of an expression may be at runtime.
|
||||
|
||||
The predicate ``ControlFlowNode.refersTo(...)`` shows which object a control flow node may "refer to" at runtime.
|
||||
|
||||
``ControlFlowNode.refersTo(...)`` has three variants:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
predicate refersTo(Object object)
|
||||
predicate refersTo(Object object, ControlFlowNode origin)
|
||||
predicate refersTo(Object object, ClassObject cls, ControlFlowNode origin)
|
||||
|
||||
``object`` is an object that the control flow node refers to, ``origin`` is where the object comes from, which is useful for displaying meaningful results, and ``cls`` is the inferred class of the ``object``.
|
||||
|
||||
.. pull-quote::
|
||||
|
||||
Note
|
||||
|
||||
``ControlFlowNode.refersTo()`` cannot find all objects that a control flow node might point to as it impossible to be accurate and find all possible values. We prefer precision (no incorrect values) over recall (finding as many values as possible). We do this because queries based on points-to analysis have fewer false positives and are thus more useful.
|
||||
|
||||
For complex data flow analyses, involving multiple stages, the ``ControlFlowNode`` version is more precise, but for simple use cases the ``Expr`` based version is easier to use. For convenience, the ``Expr`` class also has the same three predicates. ``Expr.refersTo(...)`` also has three variants:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
predicate refersTo(Object object)
|
||||
predicate refersTo(Object object, AstNode origin)
|
||||
predicate refersTo(Object object, ClassObject cls, AstNode origin)
|
||||
|
||||
Using points-to analysis
|
||||
------------------------
|
||||
|
||||
In this example we use points-to analysis to build a more complex query. This query is included in the standard query set.
|
||||
|
||||
We want to find ``except`` blocks in a ``try`` statement that are in the wrong order. That is, where a more general exception type precedes a more specific one, which is a problem as the second ``except`` handler will never be executed.
|
||||
|
||||
First we can write a query to find ordered pairs of ``except`` blocks for a ``try`` statement.
|
||||
|
||||
**Ordered except blocks in same ``try`` statement**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Try t, ExceptStmt ex1, ExceptStmt ex2
|
||||
where
|
||||
exists(int i, int j |
|
||||
ex1 = t.getHandler(i) and ex2 = t.getHandler(j) and i < j
|
||||
)
|
||||
select t, ex1, ex2
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/672320024/>`__. Many projects contain ordered ``except`` blocks in a ``try`` statement.
|
||||
|
||||
Here ``ex1`` and ``ex2`` are both ``except`` handlers in the ``try`` statement ``t``. By using the indices ``i`` and ``j`` we can also ensure that ``ex1`` precedes ``ex2``.
|
||||
|
||||
The results of this query need to be filtered to return only results where ``ex1`` is more general than ``ex2``. We can use the fact that an ``except`` block is more general than another block if the class it handles is a superclass of the other.
|
||||
|
||||
**More general ``except`` block**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
exists(ClassObject cls1, ClassObject cls2 |
|
||||
ex1.getType().refersTo(cls1) and
|
||||
ex2.getType().refersTo(cls2) |
|
||||
cls1 = cls2.getASuperType()
|
||||
)
|
||||
|
||||
The line:
|
||||
|
||||
::
|
||||
|
||||
ex1.getType().refersTo(cls1)
|
||||
|
||||
ensures that ``cls1`` is a ``ClassObject`` that the ``except`` block would handle.
|
||||
|
||||
Combining the parts of the query we get this:
|
||||
|
||||
**More general ``except`` block precedes more specific**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Try t, ExceptStmt ex1, ExceptStmt ex2
|
||||
where
|
||||
exists(int i, int j |
|
||||
ex1 = t.getHandler(i) and ex2 = t.getHandler(j) and i < j
|
||||
)
|
||||
and
|
||||
exists(ClassObject cls1, ClassObject cls2 |
|
||||
ex1.getType().refersTo(cls1) and
|
||||
ex2.getType().refersTo(cls2) |
|
||||
cls1 = cls2.getASuperType()
|
||||
)
|
||||
select t, ex1, ex2
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/669950027/>`__. This query finds only one result in the demo projects on LGTM.com (`youtube-dl <https://lgtm.com/projects/g/ytdl-org/youtube-dl/rev/39e9d524e5fe289936160d4c599a77f10f6e9061/files/devscripts/buildserver.py?sort=name&dir=ASC&mode=heatmap#L413>`__). The result is also highlighted by the standard query: `Unreachable 'except' block <https://lgtm.com/rules/7900089>`__.
|
||||
|
||||
.. pull-quote::
|
||||
|
||||
Note
|
||||
|
||||
If you want to submit a query for use in LGTM, then the format must be of the form ``select`` ``element`` ``message``. For example, you might replace the ``select`` statement with: ``select t, "Incorrect order of except blocks; more general precedes more specific"``
|
||||
|
||||
Using type inference
|
||||
--------------------
|
||||
|
||||
In this example we use type inference to determine when an object is used as a sequence in a ``for`` statement, but that object might not be an ``"iterable"``.
|
||||
|
||||
First of all find what object is used in the ``for`` loop:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
from For loop, Object iter
|
||||
where loop.getIter().refersTo(iter)
|
||||
select loop, iter
|
||||
|
||||
Then we need to determine if a ``ClassObject`` is iterable. ``ClassObject`` provides the predicate ``isIterable()`` which we can combine with the longer form of ``ControlFlowNode.refersTo()`` to get the class of the loop iterator, giving us this:
|
||||
|
||||
**Find non-iterable object used as a loop iterator**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from For loop, Object iter, ClassObject cls
|
||||
where loop.getIter().refersTo(iter, cls, _)
|
||||
and not cls.isIterable()
|
||||
select loop, cls
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/670720182/>`__. Many projects use a non-iterable as a loop iterator.
|
||||
|
||||
Many of the results shown will have ``cls`` as ``NoneType``. It is more informative to show where these ``None`` values may come from. To do this we use the final field of ``refersTo``, as follows:
|
||||
|
||||
**Find non-iterable object used as a loop iterator 2**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from For loop, Object iter, ClassObject cls, AstNode origin
|
||||
where loop.getIter().refersTo(iter, cls, origin)
|
||||
and not cls.isIterable()
|
||||
select loop, cls, origin
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/672230046/>`__. This reports the same results, but with a third column showing the source of the ``None`` values.
|
||||
|
||||
Finding calls to functions using call-graph analysis
|
||||
----------------------------------------------------
|
||||
|
||||
The ``FunctionObject`` class is a subclass of ``Object`` and corresponds to function objects in Python, in much the same way as the ``ClassObject`` class corresponds to class objects in Python.
|
||||
|
||||
The ``FunctionObject`` class has a method ``getACall()`` which allows us to find calls to a particular function (including builtin functions).
|
||||
|
||||
Returning to an example from :doc:`Tutorial: Functions <functions>`, we wish to find calls to the ``input`` function.
|
||||
|
||||
The original query looked this:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Call call, Name name
|
||||
where call.getFunc() = name and name.getId() = "input"
|
||||
select call, "call to 'input'."
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/690010037/>`__. Two of the demo projects on LGTM.com have calls that match this pattern.
|
||||
|
||||
There are two problems with this query:
|
||||
|
||||
- It assumes that any call to something named "input" is a call to the builtin ``input`` function, which may result in some false positive results.
|
||||
- It assumes that ``input`` cannot be referred to by any other name, which may result in some false negative results.
|
||||
|
||||
We can get much more accurate results using call-graph analysis. First, we can precisely identify the ``FunctionObject`` for the ``input`` function, by using the ``builtin_object`` QL predicate as follows:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from FunctionObject input
|
||||
where input = builtin_object("input")
|
||||
select input
|
||||
|
||||
Then we can use ``FunctionObject.getACall()`` to identify calls to the ``input`` function, as follows:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from ControlFlowNode call, FunctionObject input
|
||||
where input = builtin_object("input") and
|
||||
call = input.getACall()
|
||||
select call, "call to 'input'."
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/670490037/>`__. This accurately identifies calls to the builtin ``input`` function even when they are referred to using an alternative name. Any false positive results with calls to other ``input`` functions, reported by the original query, have been eliminated. It finds one result in files referenced by the *saltstack/salt* project.
|
||||
|
||||
What next?
|
||||
----------
|
||||
|
||||
For more information on writing QL, see:
|
||||
|
||||
- `QL language handbook <https://help.semmle.com/QL/ql-handbook/index.html>`__ - an introduction to the concepts of QL.
|
||||
- :doc:`Learning QL <../../index>` - an overview of the resources for learning how to write your own QL queries.
|
||||
- `Database generation <https://lgtm.com/help/lgtm/generate-database>`__ - an overview of the process that creates a snapshot from source code.
|
||||
- :doc:`What's in a snapshot? <../snapshot>` - a description of the snapshot database.
|
||||
37
docs/language/learn-ql/python/ql-for-python.rst
Normal file
37
docs/language/learn-ql/python/ql-for-python.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
QL for Python
|
||||
=============
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:hidden:
|
||||
|
||||
introduce-libraries-python
|
||||
functions
|
||||
statements-expressions
|
||||
control-flow
|
||||
control-flow-graph
|
||||
taint-tracking
|
||||
pointsto-type-infer
|
||||
|
||||
The following tutorials and worked examples are designed to help you learn how to write effective and efficient QL queries for Python projects. You should work through these topics in the order displayed.
|
||||
|
||||
- `Basic Python QL query <https://lgtm.com/help/lgtm/console/ql-python-basic-example>`__ describes how to write and run queries using LGTM.
|
||||
|
||||
- :doc:`Introducing the QL libraries for Python <introduce-libraries-python>` an introduction to the standard QL libraries used to write queries for Python code.
|
||||
|
||||
- :doc:`Tutorial: Functions <functions>` worked examples of how to write queries using the standard QL library classes for Python functions.
|
||||
|
||||
- :doc:`Tutorial: Statements and expressions <statements-expressions>` worked examples of how to write queries using the standard QL library classes for Python statements and expressions.
|
||||
|
||||
- :doc:`Tutorial: Control flow <control-flow>` worked examples of how to write queries using the standard QL library classes for Python control flow.
|
||||
|
||||
- :doc:`Tutorial: Points-to analysis and type inference <pointsto-type-infer>` worked examples of how to write queries using the standard QL library classes for Python type inference.
|
||||
|
||||
- :doc:`Taint tracking and data flow analysis in Python <taint-tracking>` worked examples of how to write queries using the standard taint tracking and data flow QL libraries for Python.
|
||||
|
||||
Other resources
|
||||
---------------
|
||||
|
||||
- For examples of how to query common Python elements, see the `Python QL cookbook <https://help.semmle.com/wiki/display/CBPYTHON>`__
|
||||
- For the queries used in LGTM, display a `Python query <https://lgtm.com/search?q=language%3Apython&t=rules>`__ and click **Open in query console** to see the QL code used to find alerts
|
||||
- For more information about the Python QL library see the `QL library for Python <https://help.semmle.com/qldoc/python>`__.
|
||||
286
docs/language/learn-ql/python/statements-expressions.rst
Normal file
286
docs/language/learn-ql/python/statements-expressions.rst
Normal file
@@ -0,0 +1,286 @@
|
||||
Tutorial: Statements and expressions
|
||||
====================================
|
||||
|
||||
Statements
|
||||
----------
|
||||
|
||||
The bulk of Python code takes the form of statements. Each different type of statement in Python is represented by a separate class in QL.
|
||||
|
||||
Here is the full class hierarchy:
|
||||
|
||||
- ``Stmt`` – A statement
|
||||
|
||||
- ``Assert`` – An ``assert`` statement
|
||||
- ``Assign``
|
||||
|
||||
- ``AssignStmt`` – An assignment statement, ``x = y``
|
||||
- ``ClassDef`` – A class definition statement
|
||||
- ``FunctionDef`` – A function definition statement
|
||||
|
||||
- ``AugAssign`` – An augmented assignment, ``x += y``
|
||||
- ``Break`` – A ``break`` statement
|
||||
- ``Continue`` – A ``continue`` statement
|
||||
- ``Delete`` – A ``del`` statement
|
||||
- ``ExceptStmt`` – The ``except`` part of a ``try`` statement
|
||||
- ``Exec`` – An ``exec`` statement
|
||||
- ``For`` – A ``for`` statement
|
||||
- ``Global`` – A ``global`` statement
|
||||
- ``If`` – An ``if`` statement
|
||||
- ``ImportStar`` – A ``from xxx import *`` statement
|
||||
- ``Import`` – Any other ``import`` statement
|
||||
- ``Nonlocal`` – A ``nonlocal`` statement
|
||||
- ``Pass`` – A ``pass`` statement
|
||||
- ``Print`` – A ``print`` statement (Python 2 only)
|
||||
- ``Raise`` – A ``raise`` statement
|
||||
- ``Return`` – A ``return`` statement
|
||||
- ``Try`` – A ``try`` statement
|
||||
- ``While`` – A ``while`` statement
|
||||
- ``With`` – A ``with`` statement
|
||||
|
||||
Example: Finding redundant 'global' statements
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``global`` statement in Python declares a variable with a global (module-level) scope, when it would otherwise be local. Using the ``global`` statement outside a class or function is redundant as the variable is already global.
|
||||
|
||||
**Finding redundant global statements**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Global g
|
||||
where g.getScope() instanceof Module
|
||||
select g
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/686330052/>`__. None of the demo projects on LGTM.com has a global statement that matches this pattern.
|
||||
|
||||
The line: ``g.getScope() instanceof Module`` ensures that the ``Scope`` of ``Global g`` is a ``Module``, rather than a class or function.
|
||||
|
||||
Example: Finding 'if' statements with redundant branches
|
||||
--------------------------------------------------------
|
||||
|
||||
An ``if`` statement where one branch is composed of just ``pass`` statements could be simplified by negating the condition and dropping the ``else`` clause.
|
||||
|
||||
**An 'if' statement that could be simplified**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if cond():
|
||||
pass
|
||||
else:
|
||||
do_something
|
||||
|
||||
To find statements like this we can run the following query:
|
||||
|
||||
**Find ``if`` statements with empty branches**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from If i, StmtList l
|
||||
where (l = i.getBody() or l = i.getOrelse())
|
||||
and forall(Stmt p | p = l.getAnItem() | p instanceof Pass)
|
||||
select i
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/672230053/>`__. Many projects have some ``if`` statements that match this pattern.
|
||||
|
||||
The line: ``(l = i.getBody() or l = i.getOrelse())`` restricts the ``StmtList l`` to branches of the ``if`` statement.
|
||||
|
||||
The line: ``forall(Stmt p | p = l.getAnItem() | p instanceof Pass)`` ensures that all statements in ``l`` are ``pass`` statements.
|
||||
|
||||
Expressions
|
||||
-----------
|
||||
|
||||
Each kind of Python expression has its own class. Here is the full class hierarchy:
|
||||
|
||||
- ``Expr`` – An expression
|
||||
|
||||
- ``Attribute`` – An attribute, ``obj.attr``
|
||||
- ``BinaryExpr`` – A binary operation, ``x+y``
|
||||
- ``BoolExpr`` – Short circuit logical operations, ``x and y``, ``x or y``
|
||||
- ``Bytes`` – A bytes literal, ``b"x"`` or (in Python 2) ``"x"``
|
||||
- ``Call`` – A function call, ``f(arg)``
|
||||
- ``ClassExpr`` – An artificial expression representing the right hand side a ``ClassDef`` assignment
|
||||
- ``Compare`` – A comparison operation, ``0 < x < 10``
|
||||
- ``Dict`` – A dictionary literal, ``{'a': 2}``
|
||||
- ``DictComp`` – A dictionary comprehension, ``{k: v for ...}``
|
||||
- ``Ellipsis`` – An ellipsis expression, ``...``
|
||||
- ``FunctionExpr`` – An artificial expression representing the right hand side a ``FunctionDef`` assignment
|
||||
- ``GeneratorExp`` – A generator expression
|
||||
- ``IfExp`` – A conditional expression, ``x if cond else y``
|
||||
- ``ImportExpr`` – An artificial expression representing the module imported
|
||||
- ``ImportMember – A``\ n artificial expression representing importing a value from a module (part of an ``from xxx import *`` statement)
|
||||
- ``Lambda – A lambda expression``
|
||||
- ``List`` – A list literal, ``['a', 'b']``
|
||||
- ``ListComp`` – A list comprehension, ``[x for ...]``
|
||||
- ``Name`` – A reference to a variable, ``var``
|
||||
- ``Num`` – A numeric literal, ``3`` or ``4.2``
|
||||
|
||||
- ``FloatLiteral``
|
||||
- ``ImaginaryLiteral``
|
||||
- ``IntegerLiteral``
|
||||
|
||||
- ``Repr`` – A backticks expression, ``x`` (Python 2 only)
|
||||
- ``Set`` – A set literal, ``{'a', 'b'}``
|
||||
- ``SetComp`` – A set comprehension, ``{x for ...}``
|
||||
- ``Slice`` – A slice; the ``0:1`` in the expression ``seq[0:1]``
|
||||
- ``Starred`` – A starred expression, ``*x`` in the context of a multiple assignment: ``y, *x = 1,2,3`` (Python 3 only)
|
||||
- ``StrConst`` – A string literal. In Python 2 either bytes or unicode. In Python 3 only unicode.
|
||||
- ``Subscript`` – A subscript operation, ``seq[index]``
|
||||
- ``UnaryExpr`` – A unary operation, ``-x``
|
||||
- ``Unicode`` – A unicode literal, ``u"x"`` or (in Python 3) ``"x"``
|
||||
- ``Yield`` – A ``yield`` expression
|
||||
- ``YieldFrom`` – A ``yield from`` expression (Python 3.3+)
|
||||
|
||||
Example: Finding comparisons to integer or string literals using 'is'
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Python implementations commonly cache small integers and single character strings, which means that comparisons such as the following often work correctly, but this is not guaranteed and we might want to check for them.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
x is 10
|
||||
x is "A"
|
||||
|
||||
We can check for these as follows:
|
||||
|
||||
**Find comparisons to integer or string literals using ``is``**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Compare cmp, Expr literal
|
||||
where (literal instanceof StrConst or literal instanceof Num)
|
||||
and cmp.getOp(0) instanceof Is and cmp.getComparator(0) = literal
|
||||
select cmp
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/688180010/>`__. Two of the demo projects on LGTM.com use this pattern: *saltstack/salt* and *openstack/nova*.
|
||||
|
||||
The clause ``cmp.getOp(0) instanceof Is and cmp.getComparator(0) = literal`` checks that the first comparison operator is "is" and that the first comparator is a literal.
|
||||
|
||||
Tip
|
||||
|
||||
We have to use ``cmp.getOp(0)`` and ``cmp.getComparator(0)``\ as there is no ``cmp.getOp()`` or ``cmp.getComparator()``. The reason for this is that a ``Compare`` expression can have multiple operators. For example, the expression ``3 < x < 7`` has two operators and two comparators. You use ``cmp.getComparator(0)`` to get the first comparator (in this example the ``3``) and ``cmp.getComparator(1)`` to get the second comparator (in this example the ``7``).
|
||||
|
||||
Example: Duplicates in dictionary literals
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If there are duplicate keys in a Python dictionary, then the second key will overwrite the first, which is almost certainly a mistake. We can find these duplicates in QL, but the query is more complex than previous examples and will require us to write a ``predicate`` as a helper.
|
||||
|
||||
Here is the query:
|
||||
|
||||
**Find duplicate dictionary keys**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
predicate same_key(Expr k1, Expr k2) {
|
||||
k1.(Num).getN() = k2.(Num).getN()
|
||||
or
|
||||
k1.(StrConst).getText() = k2.(StrConst).getText()
|
||||
}
|
||||
|
||||
from Dict d, Expr k1, Expr k2
|
||||
where k1 = d.getAKey() and k2 = d.getAKey()
|
||||
and k1 != k2 and same_key(k1, k2)
|
||||
select k1, "Duplicate key in dict literal"
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/663330305/>`__. When we ran this query on LGTM.com, the source code of the *saltstack/salt* project contained an example of duplicate dictionary keys. The results were also highlighted as alerts by the standard `Duplicate key in dict literal <https://lgtm.com/rules/3980087>`__ query. Two of the other demo projects on LGTM.com refer to duplicate dictionary keys in library files.
|
||||
|
||||
The supporting predicate ``same_key`` checks that the keys have the same identifier. Separating this part of the logic into a supporting predicate, instead of directly including it in the query, makes it easier to understand the query as a whole. The casts defined in the predicate restrict the expression to the type specified and allow predicates to be called on the type that is cast-to. For example:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
x = k1.(Num).getN()
|
||||
|
||||
is equivalent to
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
exists(Num num | num = k1 | x = num.getN())
|
||||
|
||||
The short version is usually used as this is easier to read.
|
||||
|
||||
Example: Finding Java-style getters
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Returning to the example from :doc:`Tutorial: Functions <functions>`, the query identified all methods with a single line of code and a name starting with ``get``:
|
||||
|
||||
**Basic: Find Java-style getters**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Function f
|
||||
where f.getName().matches("get%") and f.isMethod()
|
||||
and count(f.getAStmt()) = 1
|
||||
select f, "This function is (probably) a getter."
|
||||
|
||||
This basic query can be improved by checking that the one line of code is of the form ``return self.attr``
|
||||
|
||||
**Improved: Find Java-style getters**
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import python
|
||||
|
||||
from Function f, Return ret, Attribute attr, Name self
|
||||
where f.getName().matches("get%") and f.isMethod()
|
||||
and ret = f.getStmt(0) and ret.getValue() = attr
|
||||
and attr.getObject() = self and self.getId() = "self"
|
||||
select f, "This function is a Java-style getter."
|
||||
|
||||
➤ `See this in the query console <https://lgtm.com/query/669220054/>`__. Of the demo projects on LGTM.com, only the *openstack/nova* project has examples of functions that appear to be Java-style getters.
|
||||
|
||||
In this query, the condition:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
ret = f.getStmt(0) and ret.getValue() = attr
|
||||
|
||||
checks that the first line in the method is a return statement and that the expression returned (``ret.getValue()``) is an ``Attribute`` expression. Note that the equality ``ret.getValue() = attr`` means that ``ret.getValue()`` is restricted to ``Attribute``\ s, since ``attr`` is an ``Attribute``.
|
||||
|
||||
The condition:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
attr.getObject() = self and self.getId() = "self"
|
||||
|
||||
checks that the value of the attribute (the expression to the left of the dot in ``value.attr``) is an access to a variable called ``"self"``.
|
||||
|
||||
Class and function definitions
|
||||
------------------------------
|
||||
|
||||
As Python is a dynamically typed language, class, and function definitions are executable statements. This means that a class statement is both a statement and a scope containing statements. To represent this cleanly the class definition is broken into a number of parts. At runtime, when a class definition is executed a class object is created and then assigned to a variable of the same name in the scope enclosing the class. This class is created from a code-object representing the source code for the body of the class. To represent this the ``ClassDef`` class (which represents a ``class`` statement) subclasses ``Assign``. The right hand side of the ``ClassDef`` is a ``ClassExpr`` representing the creation of the class. The ``Class`` class, which represents the body of the class, can be accessed via the ``ClassExpr.getInnerScope()``
|
||||
|
||||
``FunctionDef``, ``FunctionExpr`` and ``Function`` are handled similarly.
|
||||
|
||||
Here is the relevant part of the class hierarchy:
|
||||
|
||||
- ``Stmt``
|
||||
|
||||
- ``Assign``
|
||||
|
||||
- ``ClassDef``
|
||||
- ``FunctionDef``
|
||||
|
||||
- ``Expr``
|
||||
|
||||
- ``ClassExp``
|
||||
|
||||
- ``FunctionExpr``
|
||||
|
||||
- ``Scope``
|
||||
|
||||
- ``Class``
|
||||
- ``Function``
|
||||
|
||||
What next?
|
||||
----------
|
||||
|
||||
- Experiment with the worked examples in the QL for Python tutorial topics: :doc:`Control flow <control-flow>`, :doc:`Points-to analysis and type inference <pointsto-type-infer>`.
|
||||
- Find out more about QL in the `QL language handbook <https://help.semmle.com/QL/ql-handbook/index.html>`__ and `QL language specification <https://help.semmle.com/QL/QLLanguageSpecification.html>`__.
|
||||
258
docs/language/learn-ql/python/taint-tracking.rst
Normal file
258
docs/language/learn-ql/python/taint-tracking.rst
Normal file
@@ -0,0 +1,258 @@
|
||||
Taint tracking and data flow analysis in Python
|
||||
===============================================
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Taint tracking is used to analyze how potentially insecure, or 'tainted' data flows throughout a program at runtime.
|
||||
You can use taint tracking to find out whether user-controlled input can be used in a malicious way,
|
||||
whether dangerous arguments are passed to vulnerable functions, and whether confidential or sensitive data can leak.
|
||||
You can also use it to track invalid, insecure, or untrusted data in other analyses.
|
||||
|
||||
Taint tracking differs from basic data flow in that it considers non-value-preserving steps in addition to 'normal' data flow steps.
|
||||
For example, in the assignment ``dir = path + "/"``, if ``path`` is tainted then ``dir`` is also tainted,
|
||||
even though there is no data flow from ``path`` to ``path + "/"``.
|
||||
|
||||
Separate QL libraries have been written to handle 'normal' data flow and taint tracking in :doc:`C/C++ <../cpp/dataflow>`, :doc:`C# <../csharp/dataflow>`, :doc:`Java <../java/dataflow>`, and :doc:`JavaScript <../javascript/dataflow>`. You can access the appropriate classes and predicates that reason about these different modes of data flow by importing the appropriate QL library in your query.
|
||||
In Python analysis, we can use the same taint tracking library to model both 'normal' data flow and taint flow, but we are still able make the distinction between steps that preserve value and those that don't by defining additional data flow properties.
|
||||
|
||||
For further information on data flow and taint tracking in QL, see :doc:`Introduction to data flow <../intro-to-data-flow>`.
|
||||
|
||||
Fundamentals of taint tracking and data flow analysis
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The taint tracking library is in the `TaintTracking <https://help.semmle.com/qldoc/python/semmle/python/security/TaintTracking.qll/module.TaintTracking.html>`__ module.
|
||||
Any taint tracking or data flow analysis query has three explicit components, one of which is optional, and an implicit component.
|
||||
The explicit components are:
|
||||
|
||||
1. One or more ``sources`` of potentially insecure or unsafe data, represented by the `TaintTracking::Source <https://help.semmle.com/qldoc/python/semmle/python/security/TaintTracking.qll/type.TaintTracking$TaintSource.html>`__ class.
|
||||
2. One or more ``sinks``, to where the data or taint may flow, represented by the `TaintTracking::Sink <https://help.semmle.com/qldoc/python/semmle/python/security/TaintTracking.qll/type.TaintTracking$TaintSink.html>`__ class.
|
||||
3. Zero or more ``sanitizers``, represented by the `Sanitizer <https://help.semmle.com/qldoc/python/semmle/python/security/TaintTracking.qll/type.TaintTracking$Sanitizer.html>`__ class.
|
||||
|
||||
A taint tracking or data flow query gives results when there is the flow of data from a source to a sink, which is not blocked by a sanitizer.
|
||||
|
||||
These three components are bound together using a `TaintTracking::Configuration <https://help.semmle.com/qldoc/python/semmle/python/security/TaintTracking.qll/type.TaintTracking$TaintTracking$Configuration.html>`__.
|
||||
The purpose of the configuration is to specify exactly which sources and sinks are relevant to the specific query.
|
||||
|
||||
The final, implicit component is the "kind" of taint, represented by the `TaintKind <https://help.semmle.com/qldoc/python/semmle/python/security/TaintTracking.qll/type.TaintTracking$TaintKind.html>`__ class.
|
||||
The kind of taint determines which non-value-preserving steps are possible, in addition to value-preserving steps that are built into the analysis.
|
||||
In the above example ``dir = path + "/"``, taint flows from ``path`` to ``dir`` if the taint represents a string, but not if the taint is ``None``.
|
||||
|
||||
Limitations
|
||||
~~~~~~~~~~~
|
||||
|
||||
Although taint tracking is a powerful technique, it is worth noting that it depends on the underlying data flow graphs.
|
||||
Creating a data flow graph that is both accurate and covers a large enough part of a program is a challenge,
|
||||
especially for a dynamic language like Python. The call graph might be incomplete, the reachability of code is an approximation,
|
||||
and certain constructs, like ``eval``, are just too dynamic to analyze.
|
||||
|
||||
|
||||
Using taint-tracking for Python
|
||||
-------------------------------
|
||||
|
||||
A simple taint tracking query has the basic form:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
/**
|
||||
* @name ...
|
||||
* @description ...
|
||||
* @kind problem
|
||||
*/
|
||||
|
||||
import semmle.python.security.TaintTracking
|
||||
|
||||
class MyConfiguration extends TaintTracking::Configuration {
|
||||
|
||||
MyConfiguration() { this = "My example configuration" }
|
||||
|
||||
override predicate isSource(TaintTracking::Source src) { ... }
|
||||
|
||||
override predicate isSink(TaintTracking::Sink sink) { ... }
|
||||
|
||||
/* optionally */
|
||||
override predicate isExtension(Extension extension) { ... }
|
||||
|
||||
}
|
||||
|
||||
from MyConfiguration config, TaintTracking::Source src, TaintTracking::Sink sink
|
||||
where config.hasFlow(src, sink)
|
||||
select sink, "Alert message, including reference to $@.", src, "string describing the source"
|
||||
|
||||
As a contrived example, here is a query that looks for flow from a HTTP request to a function called ``"unsafe"``.
|
||||
The sources are predefined and accessed by importing library ``semmle.python.web.HttpRequest``.
|
||||
The sink is defined by using a custom ``TaintTracking::Sink`` class.
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
/* Import the string taint kind needed by our custom sink */
|
||||
import semmle.python.security.strings.Untrusted
|
||||
|
||||
/* Sources */
|
||||
import semmle.python.web.HttpRequest
|
||||
|
||||
/* Sink */
|
||||
/** A class representing any argument in a call to a function called "unsafe" */
|
||||
class UnsafeSink extends TaintTracking::Sink {
|
||||
|
||||
UnsafeSink() {
|
||||
exists(FunctionObject unsafe |
|
||||
unsafe.getName() = "unsafe" and
|
||||
unsafe.getACall().(CallNode).getAnArg() = this
|
||||
)
|
||||
}
|
||||
|
||||
override predicate sinks(TaintKind kind) {
|
||||
kind instanceof StringKind
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
class HttpToUnsafeConfiguration extends TaintTracking::Configuration {
|
||||
|
||||
HttpToUnsafeConfiguration() {
|
||||
this = "Example config finding flow from http request to 'unsafe' function"
|
||||
}
|
||||
|
||||
override predicate isSource(TaintTracking::Source src) { src instanceof HttpRequestTaintSource }
|
||||
|
||||
override predicate isSink(TaintTracking::Sink sink) { sink instanceof UnsafeSink }
|
||||
|
||||
}
|
||||
|
||||
from HttpToUnsafeConfiguration config, TaintTracking::Source src, TaintTracking::Sink sink
|
||||
where config.hasFlow(src, sink)
|
||||
select sink, "This argument to 'unsafe' depends on $@.", src, "a user-provided value"
|
||||
|
||||
|
||||
|
||||
Implementing path queries
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Although the taint tracking query above tells which sources flow to which sinks, it doesn't tell us how.
|
||||
For that we need a path query.
|
||||
|
||||
A standard taint tracking query can be converted to a path query by changing ``@kind problem`` to ``@kind path-problem``,
|
||||
adding an import and changing the format of the query clauses.
|
||||
The import is simply:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
import semmle.python.security.Paths
|
||||
|
||||
And the format of the query becomes:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
from Configuration config, TaintedPathSource src, TaintedPathSink sink
|
||||
where config.hasFlowPath(src, sink)
|
||||
select sink.getSink(), src, sink, "Alert message, including reference to $@.", src.getSource(), "string describing the source"
|
||||
|
||||
Thus, our example query becomes:
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
/**
|
||||
* ...
|
||||
* @kind path-problem
|
||||
* ...
|
||||
*/
|
||||
|
||||
/* This computes the paths */
|
||||
import semmle.python.security.Paths
|
||||
|
||||
/* Expose the string taint kinds needed by our custom sink */
|
||||
import semmle.python.security.strings.Untrusted
|
||||
|
||||
/* Sources */
|
||||
import semmle.python.web.HttpRequest
|
||||
|
||||
/* Sink */
|
||||
/** A class representing any argument in a call to a function called "unsafe" */
|
||||
class UnsafeSink extends TaintTracking::Sink {
|
||||
|
||||
UnsafeSink() {
|
||||
exists(FunctionObject unsafe |
|
||||
unsafe.getName() = "unsafe" and
|
||||
unsafe.getACall().(CallNode).getAnArg() = this
|
||||
)
|
||||
}
|
||||
|
||||
override predicate sinks(TaintKind kind) {
|
||||
kind instanceof StringKind
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
class HttpToUnsafeConfiguration extends TaintTracking::Configuration {
|
||||
|
||||
HttpToUnsafeConfiguration() {
|
||||
this = "Example config finding flow from http request to 'unsafe' function"
|
||||
}
|
||||
|
||||
override predicate isSource(TaintTracking::Source src) { src instanceof HttpRequestTaintSource }
|
||||
|
||||
override predicate isSink(TaintTracking::Sink sink) { sink instanceof UnsafeSink }
|
||||
|
||||
}
|
||||
|
||||
from HttpToUnsafeConfiguration config, TaintedPathSource src, TaintedPathSink sink
|
||||
where config.hasFlowPath(src, sink)
|
||||
select sink.getSink(), src, sink, "This argument to 'unsafe' depends on $@.", src.getSource(), "a user-provided value"
|
||||
|
||||
|
||||
|
||||
Custom taint kinds and flows
|
||||
----------------------------
|
||||
|
||||
In the above examples, we have assumed the existence of a suitable ``TaintKind``,
|
||||
but sometimes it is necessary to model the flow of other objects, such as database connections, or ``None``.
|
||||
|
||||
The ``TaintTracking::Source`` and ``TaintTracking::Sink`` classes have predicates that determine which kind of taint the source and sink model, respectively.
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
abstract class Source {
|
||||
abstract predicate isSourceOf(TaintKind kind);
|
||||
...
|
||||
}
|
||||
|
||||
abstract class Sink {
|
||||
abstract predicate sinks(TaintKind taint);
|
||||
...
|
||||
}
|
||||
|
||||
The ``TaintKind`` itself is just a string (a QL string, not a QL entity representing a Python string),
|
||||
which provides methods to extend flow and allow the kind of taint to change along the path.
|
||||
The ``TaintKind`` class has many predicates allowing flow to be modified.
|
||||
This simplest ``TaintKind`` does not override any predicates, meaning that it only flows as opaque data.
|
||||
An example of this is the `Hard-coded credentials query <https://lgtm.com/query/rule:1506421276400/lang:python/>`_,
|
||||
which defines the simplest possible taint kind class, ``HardcodedValue``, and custom source and sink classes.
|
||||
|
||||
.. code-block:: ql
|
||||
|
||||
class HardcodedValue extends TaintKind {
|
||||
HardcodedValue() {
|
||||
this = "hard coded value"
|
||||
}
|
||||
}
|
||||
|
||||
class HardcodedValueSource extends TaintTracking::Source {
|
||||
...
|
||||
override predicate isSourceOf(TaintKind kind) {
|
||||
kind instanceof HardcodedValue
|
||||
}
|
||||
}
|
||||
|
||||
class CredentialSink extends TaintTracking::Sink {
|
||||
...
|
||||
override predicate sinks(TaintKind kind) {
|
||||
kind instanceof HardcodedValue
|
||||
}
|
||||
}
|
||||
|
||||
What next?
|
||||
----------
|
||||
|
||||
- Experiment with the worked examples in the QL for Python tutorial topics: :doc:`Control flow <control-flow>`, and :doc:`Points-to analysis and type inference <pointsto-type-infer>`.
|
||||
- Find out more about QL in the `QL language handbook <https://help.semmle.com/QL/ql-handbook/index.html>`__ and `QL language specification <https://help.semmle.com/QL/QLLanguageSpecification.html>`__.
|
||||
Reference in New Issue
Block a user