codeql/javascript/documentation/flow-summaries.rst

Summary-based information flow analysis
=======================================

Overview
--------

This document presents an approach for running information flow analyses (such as the standard
security queries) on an application that depends on one or more npm packages. Instead of
installing the npm packages during the snapshot build and analyzing them together with application
code, we analyze each package in isolation and compute *flow summaries* that record information
about any sources, sinks and flow steps contributed by the package's API. These flow summaries
are then imported when building a snapshot of the application (usually in the form of CSV files
added as external data), and are picked up by the standard security queries, allowing them to reason
about flow into, out of and through the npm packages as though they had been included as part of the
build.

Note that flow summaries are an experimental technology, and not ready to be used in production
queries or libraries. Also note that flow summaries do not currently work with CodeQL, but require
the legacy Semmle Core toolchain.

Motivating example
------------------

Let us take the `mkdirp <https://www.npmjs.com/package/mkdirp>`_ package as an example. It exports
a function that takes as its first argument a file system path, and creates a folder with that
path, as well as any parent folders that do not exist yet. As further arguments, the function
accepts an optional configuration object and a callback to invoke once the folder has been
created.

An application might use this package as follows:

.. code-block:: js

  const mkdirp = require('mkdirp');
  // ...
  mkdirp(p, opts, function cb(err) {
    // ...
  });

If the value of ``p`` can be controlled by an untrusted user, this would allow them to create arbitrary
folders, which may not be desirable.

By analyzing the application code base together with the source code for the ``mkdirp`` package,
the default path injection analysis would be able to track taint through the call to ``mkdirp`` into its
implementation, which ultimately uses built-in Node.js file system APIs to create the folder. Since
the path injection analysis has built-in models of these APIs it would then be able to spot and flag this
vulnerability.

However, analyzing ``mkdirp`` from scratch for every client application is wasteful. Moreover, it would
in this case be undesirable to flag the location inside ``mkdirp`` where the folder is actually created
as part of the alert: the developer of the client application did not write that code and hence will
have a hard time understanding why it is being flagged.

Both of these concerns can be addressed by treating the first argument to ``mkdirp`` as a path injection
sink in its own right: the analysis no longer needs to track flow into the implementation of ``mkdirp``,
so we would no longer need to include its source code in the analysis, and the alert would flag the call
to ``mkdirp`` in application code, not its implementation in library code.

The information that the first parameter of ``mkdirp`` is interpreted as a file system path and hence should
be considered a path injection sink is an example of a *flow summary*, or more precisely a *sink summary*.
Besides sink summaries, we also consider *source summaries* and *flow-step summaries*.

In general, a sink summary states that some API interface point (such as a function parameter) should
be considered a sink for a certain analysis, so if data from a known source reaches this point without
undergoing appropriate sanitization, it should be flagged with an alert. A sink summary may also
specify which taint kind the data needs to have in order for the sink to be problematic.

Conversely, a source summary identifies some API (such as the return value of a function) as a source
of tainted data for a certain analysis, again optionally specifying a taint kind.

Finally, a flow-step summary records the fact that data that flows into the package at some point
may propagate to another point (for example, from a function parameter to its return value).
In this case, there are two relevant taint kinds, one describing the kind of taint data has that
enters, and one describing the taint of the data that emerges. In general, flow steps (like sources
and sinks) are analysis-specific, since we need to know about sanitizers.

In what follows we will first discuss how summaries are generated from a snapshot of an npm package,
and then how they are imported when analyzing client code. Finally, we will discuss the format in which
flow summaries are stored.

Note that flow summaries are considered an experimental feature at this point. Using them involves
some manual configuration, and we make no guarantee that the API will remain stable.

Generating summaries
--------------------

Flow summaries of an npm package can be generated by running special summary extraction queries
either on a snapshot of the package itself, or on a snapshot of a hand-written model of the
package. (Note that this requires a working installation of Semmle Core.)

There are three default summary extraction queries:

- Extract flow step summaries (``js/step-summary-extraction``,
  ``experimental/Summaries/ExtractSourceSummaries.ql``)
- Extract sink summaries (``js/sink-summary-extraction``,
  ``experimental/Summaries/ExtractSinkSummaries.ql``)
- Extract source summaries (``js/source-summary-extraction``,
  ``experimental/Summaries/ExtractSourceSummaries.ql``)

You can run these queries individually against a snapshot of the npm package you want to create
flow summaries for using ``odasa runQuery``, and store the output as CSV files named
``additional-steps.csv``, ``additional-sinks.csv`` and ``additional-sources.csv``, respectively.

For example, assuming that folder ``mkdirp-snapshot`` contains a snapshot of the ``mkdirp``
project, we can extract sink summaries using the command

.. code-block:: bash

  odasa runQuery \
        --query $SEMMLE_DIST/queries/semmlecode-javascript-queries/experimental/Summaries/ExtractSinkSummaries.ql \
        --output-file additional-sinks.csv --snapshot mkdirp-snapshot


Instead of generating summaries directly from the package source code, you can also generate
them from a hand-written model of the package. The model should contain a ``package.json`` file
giving the correct package name, and models for the relevant API entry points. The models are
plain JavaScript with special comments annotating certain expressions as sources or sinks.

For example, a model of ``mkdirp`` might look like this:

.. code-block:: js

  module.exports = function mkdirp(path) {
    path /* Semmle: sink: taint, TaintedPath */
  };

Annotation comments start with ``Semmle:``, and contain ``source`` and ``sink`` specifications.
Each such specification lists a flow label (in this case, ``taint``) and a configuration to which
the specification applies (in this case, ``TaintedPath``).

A source specification annotates an expression as being a source of flow with the given label
for the purposes of the given configuration, and similar for sinks. Annotation comments apply to
any expression (and more generally any data flow node) whose source location ends on the line
where the comment starts.

Using summaries
---------------

Once you have created summaries using the approach outlined above, you have two options for
including them in the analysis of a client application.

External data
:::::::::::::

Firstly, you can include the CSV files generated by running the extraction queries as external
data when building a snapshot of the client application by copying them into the
``$snapshot/external/data`` folder. This is typically done by including a command like this
in your ``project`` file:

.. code-block:: xml

  <build>cp /path/to/additional-sinks.csv ${snapshot}/external/data</build>

If you want to include summaries for multiple libraries, you have to concatenate the
corresponding CSV files before copying them into the external data folder.

Additionally, you need to import the library ``Security.Summaries.ImportFromCsv`` in your
``javascript.qll``, which will pick up the summaries from external data and interpret them
as additional sources, sinks and flow steps:

.. code-block:: ql

  import Security.Summaries.ImportFromCsv

After these preparatory steps, you can run your analysis without any further changes.

External predicates
:::::::::::::::::::

The second method for including flow summaries is by including the
``Security.Summaries.ImportFromExternalPredicates`` library in your analysis, which declares
three external predicates ``additionalSteps``, ``additionalSinks`` and ``additionalSources`` that
need to be instantiated with the flow summary CSV data.

This is most easily done in QL for Eclipse, which will prompt you for CSV files to populate
the three predicates.

This approach has the advantage that you do not need to include the CSV files during the
snapshot build, so you can use an existing snapshot, for example as downloaded from LGTM.com.

Summary format
--------------

Source and sink summaries are specified as tuples of the form ``(portal, kind, configuration)``,
where ``portal`` is a description of the API element being marked as a source or sink, ``kind``
is a flow label (also known as "taint kind") describing the kind of information being generated
or consumed, and ``configuration`` specifies which flow configuration the summary applies to.

If ``kind`` is empty, it defaults to ``data`` for sources and either ``data`` or ``taint`` for sinks.
If ``configuration`` is empty, the specification applies to all configurations.
The default extraction queries never produce empty ``kind`` or ``configuration`` columns.

Similarly, step summaries are tuples of the form
``(inPortal, inKind, outPortal, outKind, configuration)``, stating that information with label
``inKind`` that flows into ``inPortal`` resurfaces from ``outPortal``, now having kind ``outKind``.
As before, ``configuration`` specifies which configuration this information applies to.

In all of the above, ``portal`` is an S-expression that abstractly describes a *portal*, that is,
an API interface point by which data may enter or leave the npm package being analyzed.

Currently, we model five kinds of portals:

- ``(root <uri>)``, representing the ``module`` object of the main module of the npm package
  described by ``<uri>``, which is a URL of the form ``https://www.npmjs.com/package/<pkg>``;
- ``(member <name> <base>)``, representing property ``<name>`` of an object described by
  portal ``<base>``;
- ``(instance <base>)``, representing an instance of a (constructor) function or class
  described by portal ``base``;
- ``(parameter <i> <base>)``, representing the ``i`` th parameter of a function described by
  portal ``base``;
- ``(return <base>)``, representing the return value of a function described by portal ``base``.

In our example above, the first parameter of the default export of package ``mkdirp`` is
described by the portal

.. code-block:: lisp

  (parameter 0 (member default (root https://www.npmjs.com/package/mkdirp))

As a more complicated example,

.. code-block:: lisp

  (parameter 0 (parameter 1 (member then (instance (member Promise (root https://www.npmjs.com/package/bluebird))))))

describes the first parameter of a function passed as second argument to the ``then`` method of
the ``Promise`` constructor exported by package ``bluebird``.