mirror of
https://github.com/github/codeql.git
synced 2025-12-17 01:03:14 +01:00
228 lines
11 KiB
ReStructuredText
228 lines
11 KiB
ReStructuredText
Summary-based information flow analysis
|
|
=======================================
|
|
|
|
Overview
|
|
--------
|
|
|
|
This document presents an approach for running information flow analyses (such as the standard
|
|
security queries) on an application that depends on one or more npm packages. Instead of
|
|
installing the npm packages during the snapshot build and analyzing them together with application
|
|
code, we analyze each package in isolation and compute *flow summaries* that record information
|
|
about any sources, sinks and flow steps contributed by the package's API. These flow summaries
|
|
are then imported when building a snapshot of the application (usually in the form of CSV files
|
|
added as external data), and are picked up by the standard security queries, allowing them to reason
|
|
about flow into, out of and through the npm packages as though they had been included as part of the
|
|
build.
|
|
|
|
Note that flow summaries are an experimental technology, and not ready to be used in production
|
|
queries or libraries. Also note that flow summaries do not currently work with CodeQL, but require
|
|
the legacy Semmle Core toolchain.
|
|
|
|
Motivating example
|
|
------------------
|
|
|
|
Let us take the `mkdirp <https://www.npmjs.com/package/mkdirp>`_ package as an example. It exports
|
|
a function that takes as its first argument a file system path, and creates a folder with that
|
|
path, as well as any parent folders that do not exist yet. As further arguments, the function
|
|
accepts an optional configuration object and a callback to invoke once the folder has been
|
|
created.
|
|
|
|
An application might use this package as follows:
|
|
|
|
.. code-block:: js
|
|
|
|
const mkdirp = require('mkdirp');
|
|
// ...
|
|
mkdirp(p, opts, function cb(err) {
|
|
// ...
|
|
});
|
|
|
|
If the value of ``p`` can be controlled by an untrusted user, this would allow them to create arbitrary
|
|
folders, which may not be desirable.
|
|
|
|
By analyzing the application code base together with the source code for the ``mkdirp`` package,
|
|
the default path injection analysis would be able to track taint through the call to ``mkdirp`` into its
|
|
implementation, which ultimately uses built-in Node.js file system APIs to create the folder. Since
|
|
the path injection analysis has built-in models of these APIs it would then be able to spot and flag this
|
|
vulnerability.
|
|
|
|
However, analyzing ``mkdirp`` from scratch for every client application is wasteful. Moreover, it would
|
|
in this case be undesirable to flag the location inside ``mkdirp`` where the folder is actually created
|
|
as part of the alert: the developer of the client application did not write that code and hence will
|
|
have a hard time understanding why it is being flagged.
|
|
|
|
Both of these concerns can be addressed by treating the first argument to ``mkdirp`` as a path injection
|
|
sink in its own right: the analysis no longer needs to track flow into the implementation of ``mkdirp``,
|
|
so we would no longer need to include its source code in the analysis, and the alert would flag the call
|
|
to ``mkdirp`` in application code, not its implementation in library code.
|
|
|
|
The information that the first parameter of ``mkdirp`` is interpreted as a file system path and hence should
|
|
be considered a path injection sink is an example of a *flow summary*, or more precisely a *sink summary*.
|
|
Besides sink summaries, we also consider *source summaries* and *flow-step summaries*.
|
|
|
|
In general, a sink summary states that some API interface point (such as a function parameter) should
|
|
be considered a sink for a certain analysis, so if data from a known source reaches this point without
|
|
undergoing appropriate sanitization, it should be flagged with an alert. A sink summary may also
|
|
specify which taint kind the data needs to have in order for the sink to be problematic.
|
|
|
|
Conversely, a source summary identifies some API (such as the return value of a function) as a source
|
|
of tainted data for a certain analysis, again optionally specifying a taint kind.
|
|
|
|
Finally, a flow-step summary records the fact that data that flows into the package at some point
|
|
may propagate to another point (for example, from a function parameter to its return value).
|
|
In this case, there are two relevant taint kinds, one describing the kind of taint data has that
|
|
enters, and one describing the taint of the data that emerges. In general, flow steps (like sources
|
|
and sinks) are analysis-specific, since we need to know about sanitizers.
|
|
|
|
In what follows we will first discuss how summaries are generated from a snapshot of an npm package,
|
|
and then how they are imported when analyzing client code. Finally, we will discuss the format in which
|
|
flow summaries are stored.
|
|
|
|
Note that flow summaries are considered an experimental feature at this point. Using them involves
|
|
some manual configuration, and we make no guarantee that the API will remain stable.
|
|
|
|
Generating summaries
|
|
--------------------
|
|
|
|
Flow summaries of an npm package can be generated by running special summary extraction queries
|
|
either on a snapshot of the package itself, or on a snapshot of a hand-written model of the
|
|
package. (Note that this requires a working installation of Semmle Core.)
|
|
|
|
There are three default summary extraction queries:
|
|
|
|
- Extract flow step summaries (``js/step-summary-extraction``,
|
|
``experimental/Summaries/ExtractSourceSummaries.ql``)
|
|
- Extract sink summaries (``js/sink-summary-extraction``,
|
|
``experimental/Summaries/ExtractSinkSummaries.ql``)
|
|
- Extract source summaries (``js/source-summary-extraction``,
|
|
``experimental/Summaries/ExtractSourceSummaries.ql``)
|
|
|
|
You can run these queries individually against a snapshot of the npm package you want to create
|
|
flow summaries for using ``odasa runQuery``, and store the output as CSV files named
|
|
``additional-steps.csv``, ``additional-sinks.csv`` and ``additional-sources.csv``, respectively.
|
|
|
|
For example, assuming that folder ``mkdirp-snapshot`` contains a snapshot of the ``mkdirp``
|
|
project, we can extract sink summaries using the command
|
|
|
|
.. code-block:: bash
|
|
|
|
odasa runQuery \
|
|
--query $SEMMLE_DIST/queries/semmlecode-javascript-queries/experimental/Summaries/ExtractSinkSummaries.ql \
|
|
--output-file additional-sinks.csv --snapshot mkdirp-snapshot
|
|
|
|
|
|
Instead of generating summaries directly from the package source code, you can also generate
|
|
them from a hand-written model of the package. The model should contain a ``package.json`` file
|
|
giving the correct package name, and models for the relevant API entry points. The models are
|
|
plain JavaScript with special comments annotating certain expressions as sources or sinks.
|
|
|
|
For example, a model of ``mkdirp`` might look like this:
|
|
|
|
.. code-block:: js
|
|
|
|
module.exports = function mkdirp(path) {
|
|
path /* Semmle: sink: taint, TaintedPath */
|
|
};
|
|
|
|
Annotation comments start with ``Semmle:``, and contain ``source`` and ``sink`` specifications.
|
|
Each such specification lists a flow label (in this case, ``taint``) and a configuration to which
|
|
the specification applies (in this case, ``TaintedPath``).
|
|
|
|
A source specification annotates an expression as being a source of flow with the given label
|
|
for the purposes of the given configuration, and similar for sinks. Annotation comments apply to
|
|
any expression (and more generally any data flow node) whose source location ends on the line
|
|
where the comment starts.
|
|
|
|
Using summaries
|
|
---------------
|
|
|
|
Once you have created summaries using the approach outlined above, you have two options for
|
|
including them in the analysis of a client application.
|
|
|
|
External data
|
|
:::::::::::::
|
|
|
|
Firstly, you can include the CSV files generated by running the extraction queries as external
|
|
data when building a snapshot of the client application by copying them into the
|
|
``$snapshot/external/data`` folder. This is typically done by including a command like this
|
|
in your ``project`` file:
|
|
|
|
.. code-block:: xml
|
|
|
|
<build>cp /path/to/additional-sinks.csv ${snapshot}/external/data</build>
|
|
|
|
If you want to include summaries for multiple libraries, you have to concatenate the
|
|
corresponding CSV files before copying them into the external data folder.
|
|
|
|
Additionally, you need to import the library ``Security.Summaries.ImportFromCsv`` in your
|
|
``javascript.qll``, which will pick up the summaries from external data and interpret them
|
|
as additional sources, sinks and flow steps:
|
|
|
|
.. code-block:: ql
|
|
|
|
import Security.Summaries.ImportFromCsv
|
|
|
|
After these preparatory steps, you can run your analysis without any further changes.
|
|
|
|
External predicates
|
|
:::::::::::::::::::
|
|
|
|
The second method for including flow summaries is by including the
|
|
``Security.Summaries.ImportFromExternalPredicates`` library in your analysis, which declares
|
|
three external predicates ``additionalSteps``, ``additionalSinks`` and ``additionalSources`` that
|
|
need to be instantiated with the flow summary CSV data.
|
|
|
|
This is most easily done in QL for Eclipse, which will prompt you for CSV files to populate
|
|
the three predicates.
|
|
|
|
This approach has the advantage that you do not need to include the CSV files during the
|
|
snapshot build, so you can use an existing snapshot, for example as downloaded from LGTM.com.
|
|
|
|
Summary format
|
|
--------------
|
|
|
|
Source and sink summaries are specified as tuples of the form ``(portal, kind, configuration)``,
|
|
where ``portal`` is a description of the API element being marked as a source or sink, ``kind``
|
|
is a flow label (also known as "taint kind") describing the kind of information being generated
|
|
or consumed, and ``configuration`` specifies which flow configuration the summary applies to.
|
|
|
|
If ``kind`` is empty, it defaults to ``data`` for sources and either ``data`` or ``taint`` for sinks.
|
|
If ``configuration`` is empty, the specification applies to all configurations.
|
|
The default extraction queries never produce empty ``kind`` or ``configuration`` columns.
|
|
|
|
Similarly, step summaries are tuples of the form
|
|
``(inPortal, inKind, outPortal, outKind, configuration)``, stating that information with label
|
|
``inKind`` that flows into ``inPortal`` resurfaces from ``outPortal``, now having kind ``outKind``.
|
|
As before, ``configuration`` specifies which configuration this information applies to.
|
|
|
|
In all of the above, ``portal`` is an S-expression that abstractly describes a *portal*, that is,
|
|
an API interface point by which data may enter or leave the npm package being analyzed.
|
|
|
|
Currently, we model five kinds of portals:
|
|
|
|
- ``(root <uri>)``, representing the ``module`` object of the main module of the npm package
|
|
described by ``<uri>``, which is a URL of the form ``https://www.npmjs.com/package/<pkg>``;
|
|
- ``(member <name> <base>)``, representing property ``<name>`` of an object described by
|
|
portal ``<base>``;
|
|
- ``(instance <base>)``, representing an instance of a (constructor) function or class
|
|
described by portal ``base``;
|
|
- ``(parameter <i> <base>)``, representing the ``i`` th parameter of a function described by
|
|
portal ``base``;
|
|
- ``(return <base>)``, representing the return value of a function described by portal ``base``.
|
|
|
|
In our example above, the first parameter of the default export of package ``mkdirp`` is
|
|
described by the portal
|
|
|
|
.. code-block:: lisp
|
|
|
|
(parameter 0 (member default (root https://www.npmjs.com/package/mkdirp))
|
|
|
|
As a more complicated example,
|
|
|
|
.. code-block:: lisp
|
|
|
|
(parameter 0 (parameter 1 (member then (instance (member Promise (root https://www.npmjs.com/package/bluebird))))))
|
|
|
|
describes the first parameter of a function passed as second argument to the ``then`` method of
|
|
the ``Promise`` constructor exported by package ``bluebird``.
|