Summary-based information flow analysis ======================================= Overview -------- This document presents an approach for running information flow analyses (such as the standard Semmle security queries) on an application that depends on one or more npm packages. Instead of installing the npm packages during the snapshot build and analyzing them together with application code, we analyze each package in isolation and compute *flow summaries* that record information about any sources, sinks and flow steps contributed by the package's API. These flow summaries are then imported when building a snapshot of the application (usually in the form of CSV files added as external data), and are picked up by the standard security queries, allowing them to reason about flow into, out of and through the npm packages as though they had been included as part of the build. Motivating example ------------------ Let us take the `mkdirp `_ package as an example. It exports a function that takes as its first argument a file system path, and creates a folder with that path, as well as any parent folders that do not exist yet. As further arguments, the function accepts an optional configuration object and a callback to invoke once the folder has been created. An application might use this package as follows: .. code-block:: js const mkdirp = require('mkdirp'); // ... mkdirp(p, opts, function cb(err) { // ... }); If the value of ``p`` can be controlled by an untrusted user, this would allow them to create arbitrary folders, which may not be desirable. By analyzing the application code base together with the source code for the ``mkdirp`` package, Semmle's default path injection analysis would be able to track taint through the call to ``mkdirp`` into its implementation, which ultimately uses built-in Node.js file system APIs to create the folder. Since the path injection analysis has built-in models of these APIs it would then be able to spot and flag this vulnerability. However, analyzing ``mkdirp`` from scratch for every client application is wasteful. Moreover, it would in this case be undesirable to flag the location inside ``mkdirp`` where the folder is actually created as part of the alert: the developer of the client application did not write that code and hence will have a hard time understanding why it is being flagged. Both of these concerns can be addressed by treating the first argument to ``mkdirp`` as a path injection sink in its own right: the analysis no longer needs to track flow into the implementation of ``mkdirp``, so we would no longer need to include its source code in the analysis, and the alert would flag the call to ``mkdirp`` in application code, not its implementation in library code. The information that the first parameter of ``mkdirp`` is interpreted as a file system path and hence should be considered a path injection sink is an example of a *flow summary*, or more precisely a *sink summary*. Besides sink summaries, we also consider *source summaries* and *flow-step summaries*. In general, a sink summary states that some API interface point (such as a function parameter) should be considered a sink for a certain analysis, so if data from a known source reaches this point without undergoing appropriate sanitization, it should be flagged with an alert. A sink summary may also specify which taint kind the data needs to have in order for the sink to be problematic. Conversely, a source summary identifies some API (such as the return value of a function) as a source of tainted data for a certain analysis, again optionally specifying a taint kind. Finally, a flow-step summary records the fact that data that flows into the package at some point may propagate to another point (for example, from a function parameter to its return value). In this case, there are two relevant taint kinds, one describing the kind of taint data has that enters, and one describing the taint of the data that emerges. In general, flow steps (like sources and sinks) are analysis-specific, since we need to know about sanitizers. In what follows we will first discuss how summaries are generated from a snapshot of an npm package, and then how they are imported when analyzing client code. Finally, we will discuss the format in which flow summaries are stored. Note that flow summaries are considered an experimental feature at this point. Using them involves some manual configuration, and we make no guarantee that the API will remain stable. Generating summaries -------------------- Flow summaries of an npm package can be generated by running special summary extraction queries either on a snapshot of the package itself, or on a snapshot of a hand-written model of the package. (Note that this requires a working installation of Semmle Core.) There are three default summary extraction queries: - Extract flow step summaries (``js/step-summary-extraction``, ``Security/Summaries/ExtractSourceSummaries.ql``) - Extract sink summaries (``js/sink-summary-extraction``, ``Security/Summaries/ExtractSinkSummaries.ql``) - Extract source summaries (``js/source-summary-extraction``, ``Security/Summaries/ExtractSourceSummaries.ql``) You can run these queries individually against a snapshot of the npm package you want to create flow summaries for using ``odasa runQuery``, and store the output as CSV files named ``additional-steps.csv``, ``additional-sinks.csv`` and ``additional-sources.csv``, respectively. For example, assuming that folder ``mkdirp-snapshot`` contains a snapshot of the ``mkdirp`` project, we can extract sink summaries using the command .. code-block:: bash odasa runQuery \ --query $SEMMLE_DIST/queries/semmlecode-javascript-queries/Security/Summaries/ExtractSinkSummaries.ql \ --output-file additional-sinks.csv --snapshot mkdirp-snapshot Instead of generating summaries directly from the package source code, you can also generate them from a hand-written model of the package. The model should contain a ``package.json`` file giving the correct package name, and models for the relevant API entry points. The models are plain JavaScript with special comments annotating certain expressions as sources or sinks. For example, a model of ``mkdirp`` might look like this: .. code-block:: js module.exports = function mkdirp(path) { path /* Semmle: sink: taint, TaintedPath */ }; Annotation comments start with ``Semmle:``, and contain ``source`` and ``sink`` specifications. Each such specification lists a flow label (in this case, ``taint``) and a configuration to which the specification applies (in this case, ``TaintedPath``). A source specification annotates an expression as being a source of flow with the given label for the purposes of the given configuration, and similar for sinks. Annotation comments apply to any expression (and more generally any data flow node) whose source location ends on the line where the comment starts. Using summaries --------------- Once you have created summaries using the approach outlined above, you have two options for including them in the analysis of a client application. External data ::::::::::::: Firstly, you can include the CSV files generated by running the extraction queries as external data when building a snapshot of the client application by copying them into the ``$snapshot/external/data`` folder. This is typically done by including a command like this in your ``project`` file: .. code-block:: xml cp /path/to/additional-sinks.csv ${snapshot}/external/data If you want to include summaries for multiple libraries, you have to concatenate the corresponding CSV files before copying them into the external data folder. Additionally, you need to import the library ``Security.Summaries.ImportFromCsv`` in your ``javascript.qll``, which will pick up the summaries from external data and interpret them as additional sources, sinks and flow steps: .. code-block:: ql import Security.Summaries.ImportFromCsv After these preparatory steps, you can run your analysis without any further changes. External predicates ::::::::::::::::::: The second method for including flow summaries is by including the ``Security.Summaries.ImportFromExternalPredicates`` library in your analysis, which declares three external predicates ``additionalSteps``, ``additionalSinks`` and ``additionalSources`` that need to be instantiated with the flow summary CSV data. This is most easily done in QL for Eclipse, which will prompt you for CSV files to populate the three predicates. This approach has the advantage that you do not need to include the CSV files during the snapshot build, so you can use an existing snapshot, for example as downloaded from LGTM.com. Summary format -------------- Source and sink summaries are specified as tuples of the form ``(portal, kind, configuration)``, where ``portal`` is a description of the API element being marked as a source or sink, ``kind`` is a flow label (also known as "taint kind") describing the kind of information being generated or consumed, and ``configuration`` specifies which flow configuration the summary applies to. If ``kind`` is empty, it defaults to ``data`` for sources and either ``data`` or ``taint`` for sinks. If ``configuration`` is empty, the specification applies to all configurations. The default extraction queries never produce empty ``kind`` or ``configuration`` columns. Similarly, step summaries are tuples of the form ``(inPortal, inKind, outPortal, outKind, configuration)``, stating that information with label ``inKind`` that flows into ``inPortal`` resurfaces from ``outPortal``, now having kind ``outKind``. As before, ``configuration`` specifies which configuration this information applies to. In all of the above, ``portal`` is an S-expression that abstractly describes a *portal*, that is, an API interface point by which data may enter or leave the npm package being analyzed. Currently, we model five kinds of portals: - ``(root )``, representing the ``module`` object of the main module of the npm package described by ````, which is a URL of the form ``https://www.npmjs.com/package/``; - ``(member )``, representing property ```` of an object described by portal ````; - ``(instance )``, representing an instance of a (constructor) function or class described by portal ``base``; - ``(parameter )``, representing the ``i`` th parameter of a function described by portal ``base``; - ``(return )``, representing the return value of a function described by portal ``base``. In our example above, the first parameter of the default export of package ``mkdirp`` is described by the portal .. code-block:: lisp (parameter 0 (member default (root https://www.npmjs.com/package/mkdirp)) As a more complicated example, .. code-block:: lisp (parameter 0 (parameter 1 (member then (instance (member Promise (root https://www.npmjs.com/package/bluebird)))))) describes the first parameter of a function passed as second argument to the ``then`` method of the ``Promise`` constructor exported by package ``bluebird``.