Merge pull request #5357 from yoff/python-rework-documentation

Python: rework documentation
This commit is contained in:
Rasmus Wriedt Larsen
2021-03-17 14:25:23 +01:00
committed by GitHub
10 changed files with 422 additions and 637 deletions

View File

@@ -3,7 +3,7 @@
About data flow analysis
########################
Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.
Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.
Overview
********
@@ -20,13 +20,13 @@ See the following tutorials for more information about analyzing data flow in sp
- ":ref:`Analyzing data flow in C# <analyzing-data-flow-in-csharp>`"
- ":ref:`Analyzing data flow in Java <analyzing-data-flow-in-java>`"
- ":ref:`Analyzing data flow in JavaScript/TypeScript <analyzing-data-flow-in-javascript-and-typescript>`"
- ":ref:`Analyzing data flow and tracking tainted data in Python <analyzing-data-flow-and-tracking-tainted-data-in-python>`"
- ":ref:`Analyzing data flow in Python <analyzing-data-flow-in-python>`"
.. pull-quote::
Note
Data flow analysis is used extensively in path queries. To learn more about path queries, see ":doc:`Creating path queries <creating-path-queries>`."
Data flow analysis is used extensively in path queries. To learn more about path queries, see ":doc:`Creating path queries <creating-path-queries>`."
.. _data-flow-graph:
@@ -78,11 +78,10 @@ The normal data flow libraries are used to analyze the information flow in which
For example, if you are tracking an insecure object ``x`` (which might be some untrusted or potentially malicious data), a step in the program may 'change' its value. So, in a simple process such as ``y = x + 1``, a normal data flow analysis will highlight the use of ``x``, but not ``y``.
However, since ``y`` is derived from ``x``, it is influenced by the untrusted or 'tainted' information, and therefore it is also tainted. Analyzing the flow of the taint from ``x`` to ``y`` is known as taint tracking.
In QL, taint tracking extends data flow analysis by including steps in which the data values are not necessarily preserved, but the potentially insecure object is still propagated.
In QL, taint tracking extends data flow analysis by including steps in which the data values are not necessarily preserved, but the potentially insecure object is still propagated.
These flow steps are modeled in the taint-tracking library using predicates that hold if taint is propagated between nodes.
Further reading
***************
- ":ref:`Exploring data flow with path queries <exploring-data-flow-with-path-queries>`"

View File

@@ -8,11 +8,11 @@ You can create path queries to visualize the flow of information through a codeb
Overview
========
Security researchers are particularly interested in the way that information flows in a program. Many vulnerabilities are caused by seemingly benign data flowing to unexpected locations, and being used in a malicious way.
Security researchers are particularly interested in the way that information flows in a program. Many vulnerabilities are caused by seemingly benign data flowing to unexpected locations, and being used in a malicious way.
Path queries written with CodeQL are particularly useful for analyzing data flow as they can be used to track the path taken by a variable from its possible starting points (``source``) to its possible end points (``sink``).
To model paths, your query must provide information about the ``source`` and the ``sink``, as well as the data flow steps that link them.
This topic provides information on how to structure a path query file so you can explore the paths associated with the results of data flow analysis.
This topic provides information on how to structure a path query file so you can explore the paths associated with the results of data flow analysis.
.. pull-quote::
@@ -26,30 +26,30 @@ For more language-specific information on analyzing data flow, see:
- ":ref:`Analyzing data flow in C/C++ <analyzing-data-flow-in-cpp>`"
- ":ref:`Analyzing data flow in C# <analyzing-data-flow-in-csharp>`"
- ":ref:`Analyzing data flow in Java <analyzing-data-flow-in-java>`"
- ":ref:`Analyzing data flow in Java <analyzing-data-flow-in-java>`"
- ":ref:`Analyzing data flow in JavaScript/TypeScript <analyzing-data-flow-in-javascript-and-typescript>`"
- ":ref:`Analyzing data flow and tracking tainted data in Python <analyzing-data-flow-and-tracking-tainted-data-in-python>`"
- ":ref:`Analyzing data flow in Python <analyzing-data-flow-in-python>`"
Path query examples
*******************
The easiest way to get started writing your own path query is to modify one of the existing queries. For more information, see the `CodeQL query help <https://codeql.github.com/codeql-query-help>`__.
The Security Lab researchers have used path queries to find security vulnerabilities in various open source projects. To see articles describing how these queries were written, as well as other posts describing other aspects of security research such as exploiting vulnerabilities, see the `GitHub Security Lab website <https://securitylab.github.com/research>`__.
Constructing a path query
=========================
Path queries require certain metadata, query predicates, and ``select`` statement structures.
Path queries require certain metadata, query predicates, and ``select`` statement structures.
Many of the built-in path queries included in CodeQL follow a simple structure, which depends on how the language you are analyzing is modeled with CodeQL.
For C/C++, C#, Java, and JavaScript you should use the following template:
You should use the following template:
.. code-block:: ql
/**
* ...
* ...
* @kind path-problem
* ...
*/
@@ -66,33 +66,10 @@ Where:
- ``DataFlow::Pathgraph`` is the path graph module you need to import from the standard CodeQL libraries.
- ``source`` and ``sink`` are nodes on the `path graph <https://en.wikipedia.org/wiki/Path_graph>`__, and ``DataFlow::PathNode`` is their type.
- ``Configuration`` is a class containing the predicates which define how data may flow between the ``source`` and the ``sink``.
For Python you should use a slightly different template:
.. code-block:: ql
/**
* ...
* @kind path-problem
* ...
*/
import python
import semmle.python.security.Paths
...
from TaintedPathSource source, TaintedPathSink sink
where source.flowsTo(sink)
select sink.getNode(), source, sink, "<message>"
Where:
- ``semmle.python.security.Paths`` is the path graph module imported from the standard CodeQL libraries.
- ``source`` and ``sink`` are nodes on the path graph, ``TaintedPathSource source`` and ``TaintedPathSink`` are their respective types. Note, you do not need to declare a configuration class to define the data flow from the ``source`` to the ``sink`` in a Python path query.
- ``Configuration`` is a class containing the predicates which define how data may flow between the ``source`` and the ``sink``.
The following sections describe the main requirements for a valid path query.
The following sections describe the main requirements for a valid path query.
Path query metadata
*******************
@@ -105,22 +82,14 @@ Generating path explanations
In order to generate path explanations, your query needs to compute a `path graph <https://en.wikipedia.org/wiki/Path_graph>`__.
To do this you need to define a :ref:`query predicate <query-predicates>` called ``edges`` in your query.
This predicate defines the edge relations of the graph you are computing, and it is used to compute the paths related to each result that your query generates.
You can import a predefined ``edges`` predicate from a path graph module in one of the standard data flow libraries. In addition to the path graph module, the data flow libraries contain the other ``classes``, ``predicates``, and ``modules`` that are commonly used in data flow analysis. The import statement to use depends on the language that you are analyzing.
For C/C++, C#, Java, and JavaScript you would use:
This predicate defines the edge relations of the graph you are computing, and it is used to compute the paths related to each result that your query generates.
You can import a predefined ``edges`` predicate from a path graph module in one of the standard data flow libraries. In addition to the path graph module, the data flow libraries contain the other ``classes``, ``predicates``, and ``modules`` that are commonly used in data flow analysis.
.. code-block:: ql
import DataFlow::PathGraph
This statement imports the ``PathGraph`` module from the data flow library (``DataFlow.qll``), in which ``edges`` is defined.
For Python, the ``Paths`` module contains the ``edges`` predicate:
.. code-block:: ql
import semmle.python.security.Paths
This statement imports the ``PathGraph`` module from the data flow library (``DataFlow.qll``), in which ``edges`` is defined.
You can also import libraries specifically designed to implement data flow analysis in various common frameworks and environments, and many additional libraries are included with CodeQL. To see examples of the different libraries used in data flow analysis, see the links to the built-in queries above or browse the `standard libraries <https://codeql.github.com/codeql-standard-libraries>`__.
@@ -150,44 +119,31 @@ For example:
.. code-block:: ql
from Configuration config, DataFlow::PathNode source, DataFlow::PathNode sink
from Configuration config, DataFlow::PathNode source, DataFlow::PathNode sink
The configuration class is accessed by importing the data flow library. This class contains the predicates which define how data flow is treated in the query:
- ``isSource()`` defines where data may flow from.
- ``isSink()`` defines where data may flow to.
For more information on using the configuration class in your analysis see the sections on global data flow in ":ref:`Analyzing data flow in C/C++ <analyzing-data-flow-in-cpp>`" and ":ref:`Analyzing data flow in C# <analyzing-data-flow-in-csharp>`."
For more information on using the configuration class in your analysis see the sections on global data flow in ":ref:`Analyzing data flow in C/C++ <analyzing-data-flow-in-cpp>`," ":ref:`Analyzing data flow in C# <analyzing-data-flow-in-csharp>`," and ":ref:`Analyzing data flow in Python <analyzing-data-flow-in-python>`."
You can also create a configuration for different frameworks and environments by extending the ``Configuration`` class. For more information, see ":ref:`Types <defining-a-class>`" in the QL language reference.
If you are querying Python code (and you have used ``import semmle.python.security.Paths`` in your query) you should declare ``TaintedPathSource source, TaintedPathSink sink`` in your ``from`` statement. You do not need to declare a ``Configuration`` class as the definitions of the ``TaintedPathSource`` and ``TaintedPathSink`` contain all of the type information that is required:
.. code-block:: ql
from TaintedPathSource source, TaintedPathSink sink
You can extend your query by adding different sources and sinks by either defining them in the query, or by importing predefined sources and sinks for specific frameworks and libraries. For more information, see the `CodeQL query help for Python <https://codeql.github.com/codeql-query-help/python>`__.
Defining flow conditions
************************
The ``where`` clause defines the logical conditions to apply to the variables declared in the ``from`` clause to generate your results.
This clause can use :ref:`aggregations <aggregations>`, :ref:`predicates <predicates>`, and logical :ref:`formulas <formulas>` to limit the variables of interest to a smaller set which meet the defined conditions.
The ``where`` clause defines the logical conditions to apply to the variables declared in the ``from`` clause to generate your results.
This clause can use :ref:`aggregations <aggregations>`, :ref:`predicates <predicates>`, and logical :ref:`formulas <formulas>` to limit the variables of interest to a smaller set which meet the defined conditions.
When writing a path queries, you would typically include a predicate that holds only if data flows from the ``source`` to the ``sink``.
When writing a path queries, you would typically include a predicate that holds only if data flows from the ``source`` to the ``sink``.
For C/C++, C#, Java or JavaScript, you would use the ``hasFlowPath`` predicate to define flow from the ``source`` to the ``sink`` for a given ``Configuration``:
You can use the ``hasFlowPath`` predicate to specify flow from the ``source`` to the ``sink`` for a given ``Configuration``:
.. code-block:: ql
where config.hasFlowPath(source, sink)
For Python, you would simply use the ``flowsTo`` predicate to define flow from the ``source`` to the ``sink``:
.. code-block:: ql
where source.flowsTo(sink)
Select clause
*************
@@ -196,12 +152,12 @@ Select clauses for path queries consist of four 'columns', with the following st
select element, source, sink, string
The ``element`` and ``string`` columns represent the location of the alert and the alert message respectively, as explained in ":doc:`About CodeQL queries <about-codeql-queries>`." The second and third columns, ``source`` and ``sink``, are nodes on the path graph selected by the query.
The ``element`` and ``string`` columns represent the location of the alert and the alert message respectively, as explained in ":doc:`About CodeQL queries <about-codeql-queries>`." The second and third columns, ``source`` and ``sink``, are nodes on the path graph selected by the query.
Each result generated by your query is displayed at a single location in the same way as an alert query. Additionally, each result also has an associated path, which can be viewed in LGTM or in the :ref:`CodeQL extension for VS Code <codeql-for-visual-studio-code>`.
The ``element`` that you select in the first column depends on the purpose of the query and the type of issue that it is designed to find. This is particularly important for security issues. For example, if you believe the ``source`` value to be globally invalid or malicious it may be best to display the alert at the ``source``. In contrast, you should consider displaying the alert at the ``sink`` if you believe it is the element that requires sanitization.
The alert message defined in the final column in the ``select`` statement can be developed to give more detail about the alert or path found by the query using links and placeholders. For more information, see ":doc:`Defining the results of a query <defining-the-results-of-a-query>`."
The alert message defined in the final column in the ``select`` statement can be developed to give more detail about the alert or path found by the query using links and placeholders. For more information, see ":doc:`Defining the results of a query <defining-the-results-of-a-query>`."
Further reading
***************