Python: Copy Python extractor to codeql repo

2025-12-16 08:43:11 +01:00 · 2024-02-28 15:15:21 +00:00
parent 297a17975d
commit 6dec323cfc
369 changed files with 165346 additions and 0 deletions
--- a/python/extractor/BUILD.bazel
+++ b/python/extractor/BUILD.bazel
@@ -0,0 +1,49 @@
+load("//:dist.bzl", "pack_zip")
+
+py_binary(
+    name = "make-zips-py",
+    srcs = [
+        "make_zips.py",
+        "python_tracer.py",
+        "unparse.py",
+    ],
+    data = [
+        "LICENSE-PSF.md",
+        "__main__.py",
+        "imp.py",
+    ] + glob([
+        "blib2to3/**",
+        "buildtools/**",
+        "lark/**",
+        "semmle/**",
+    ]),
+    # On @criemen's machine, without this, make-zips.py can't find its imports from
+    # python_tracer. The problem didn't show for some reason on Windows CI machines, though.
+    imports = ["."],
+    main = "make_zips.py",
+)
+
+genrule(
+    name = "python3src",
+    outs = [
+        "python3src.zip",
+    ],
+    cmd = "PYTHON_INSTALLER_OUTPUT=\"$(RULEDIR)\" $(location :make-zips-py)",
+    tools = [":make-zips-py"],
+)
+
+pack_zip(
+    name = "extractor-python",
+    srcs = [
+        "LICENSE-PSF.md",  # because we distribute imp.py
+        "convert_setup.py",
+        "get_venv_lib.py",
+        "imp.py",
+        "index.py",
+        "python_tracer.py",
+        "setup.py",
+        ":python3src",
+    ] + glob(["data/**"]),
+    prefix = "tools",
+    visibility = ["//visibility:public"],
+)
--- a/python/extractor/LICENSE-PSF.md
+++ b/python/extractor/LICENSE-PSF.md
@@ -0,0 +1,257 @@
+Parts of the Python extractor are derived from code in the CPython
+distribution. Its license is reproduced below.
+
+A. HISTORY OF THE SOFTWARE
+==========================
+
+Python was created in the early 1990s by Guido van Rossum at Stichting
+Mathematisch Centrum (CWI, see http://www.cwi.nl) in the Netherlands
+as a successor of a language called ABC.  Guido remains Python's
+principal author, although it includes many contributions from others.
+
+In 1995, Guido continued his work on Python at the Corporation for
+National Research Initiatives (CNRI, see http://www.cnri.reston.va.us)
+in Reston, Virginia where he released several versions of the
+software.
+
+In May 2000, Guido and the Python core development team moved to
+BeOpen.com to form the BeOpen PythonLabs team.  In October of the same
+year, the PythonLabs team moved to Digital Creations, which became
+Zope Corporation.  In 2001, the Python Software Foundation (PSF, see
+https://www.python.org/psf/) was formed, a non-profit organization
+created specifically to own Python-related Intellectual Property.
+Zope Corporation was a sponsoring member of the PSF.
+
+All Python releases are Open Source (see http://www.opensource.org for
+the Open Source Definition).  Historically, most, but not all, Python
+releases have also been GPL-compatible; the table below summarizes
+the various releases.
+
+    Release         Derived     Year        Owner       GPL-
+                    from                                compatible? (1)
+
+    0.9.0 thru 1.2              1991-1995   CWI         yes
+    1.3 thru 1.5.2  1.2         1995-1999   CNRI        yes
+    1.6             1.5.2       2000        CNRI        no
+    2.0             1.6         2000        BeOpen.com  no
+    1.6.1           1.6         2001        CNRI        yes (2)
+    2.1             2.0+1.6.1   2001        PSF         no
+    2.0.1           2.0+1.6.1   2001        PSF         yes
+    2.1.1           2.1+2.0.1   2001        PSF         yes
+    2.1.2           2.1.1       2002        PSF         yes
+    2.1.3           2.1.2       2002        PSF         yes
+    2.2 and above   2.1.1       2001-now    PSF         yes
+
+Footnotes:
+
+(1) GPL-compatible doesn't mean that we're distributing Python under
+    the GPL.  All Python licenses, unlike the GPL, let you distribute
+    a modified version without making your changes open source.  The
+    GPL-compatible licenses make it possible to combine Python with
+    other software that is released under the GPL; the others don't.
+
+(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
+    because its license has a choice of law clause.  According to
+    CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
+    is "not incompatible" with the GPL.
+
+Thanks to the many outside volunteers who have worked under Guido's
+direction to make these releases possible.
+
+
+B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
+===============================================================
+
+PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
+--------------------------------------------
+
+1. This LICENSE AGREEMENT is between the Python Software Foundation
+("PSF"), and the Individual or Organization ("Licensee") accessing and
+otherwise using this software ("Python") in source or binary form and
+its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, PSF hereby
+grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
+analyze, test, perform and/or display publicly, prepare derivative works,
+distribute, and otherwise use Python alone or in any derivative version,
+provided, however, that PSF's License Agreement and PSF's notice of copyright,
+i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019 Python Software Foundation;
+All Rights Reserved" are retained in Python alone or in any derivative version
+prepared by Licensee.
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python.
+
+4. PSF is making Python available to Licensee on an "AS IS"
+basis.  PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. Nothing in this License Agreement shall be deemed to create any
+relationship of agency, partnership, or joint venture between PSF and
+Licensee.  This License Agreement does not grant permission to use PSF
+trademarks or trade name in a trademark sense to endorse or promote
+products or services of Licensee, or any third party.
+
+8. By copying, installing or otherwise using Python, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
+-------------------------------------------
+
+BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
+
+1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
+office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
+Individual or Organization ("Licensee") accessing and otherwise using
+this software in source or binary form and its associated
+documentation ("the Software").
+
+2. Subject to the terms and conditions of this BeOpen Python License
+Agreement, BeOpen hereby grants Licensee a non-exclusive,
+royalty-free, world-wide license to reproduce, analyze, test, perform
+and/or display publicly, prepare derivative works, distribute, and
+otherwise use the Software alone or in any derivative version,
+provided, however, that the BeOpen Python License is retained in the
+Software, alone or in any derivative version prepared by Licensee.
+
+3. BeOpen is making the Software available to Licensee on an "AS IS"
+basis.  BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
+SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
+AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
+DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+5. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+6. This License Agreement shall be governed by and interpreted in all
+respects by the law of the State of California, excluding conflict of
+law provisions.  Nothing in this License Agreement shall be deemed to
+create any relationship of agency, partnership, or joint venture
+between BeOpen and Licensee.  This License Agreement does not grant
+permission to use BeOpen trademarks or trade names in a trademark
+sense to endorse or promote products or services of Licensee, or any
+third party.  As an exception, the "BeOpen Python" logos available at
+http://www.pythonlabs.com/logos.html may be used according to the
+permissions granted on that web page.
+
+7. By copying, installing or otherwise using the software, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
+---------------------------------------
+
+1. This LICENSE AGREEMENT is between the Corporation for National
+Research Initiatives, having an office at 1895 Preston White Drive,
+Reston, VA 20191 ("CNRI"), and the Individual or Organization
+("Licensee") accessing and otherwise using Python 1.6.1 software in
+source or binary form and its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, CNRI
+hereby grants Licensee a nonexclusive, royalty-free, world-wide
+license to reproduce, analyze, test, perform and/or display publicly,
+prepare derivative works, distribute, and otherwise use Python 1.6.1
+alone or in any derivative version, provided, however, that CNRI's
+License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
+1995-2001 Corporation for National Research Initiatives; All Rights
+Reserved" are retained in Python 1.6.1 alone or in any derivative
+version prepared by Licensee.  Alternately, in lieu of CNRI's License
+Agreement, Licensee may substitute the following text (omitting the
+quotes): "Python 1.6.1 is made available subject to the terms and
+conditions in CNRI's License Agreement.  This Agreement together with
+Python 1.6.1 may be located on the Internet using the following
+unique, persistent identifier (known as a handle): 1895.22/1013.  This
+Agreement may also be obtained from a proxy server on the Internet
+using the following URL: http://hdl.handle.net/1895.22/1013".
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python 1.6.1 or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python 1.6.1.
+
+4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
+basis.  CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. This License Agreement shall be governed by the federal
+intellectual property law of the United States, including without
+limitation the federal copyright law, and, to the extent such
+U.S. federal law does not apply, by the law of the Commonwealth of
+Virginia, excluding Virginia's conflict of law provisions.
+Notwithstanding the foregoing, with regard to derivative works based
+on Python 1.6.1 that incorporate non-separable material that was
+previously distributed under the GNU General Public License (GPL), the
+law of the Commonwealth of Virginia shall govern this License
+Agreement only as to issues arising under or with respect to
+Paragraphs 4, 5, and 7 of this License Agreement.  Nothing in this
+License Agreement shall be deemed to create any relationship of
+agency, partnership, or joint venture between CNRI and Licensee.  This
+License Agreement does not grant permission to use CNRI trademarks or
+trade name in a trademark sense to endorse or promote products or
+services of Licensee, or any third party.
+
+8. By clicking on the "ACCEPT" button where indicated, or by copying,
+installing or otherwise using Python 1.6.1, Licensee agrees to be
+bound by the terms and conditions of this License Agreement.
+
+        ACCEPT
+
+
+CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
+--------------------------------------------------
+
+Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
+The Netherlands.  All rights reserved.
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose and without fee is hereby granted,
+provided that the above copyright notice appear in all copies and that
+both that copyright notice and this permission notice appear in
+supporting documentation, and that the name of Stichting Mathematisch
+Centrum or CWI not be used in advertising or publicity pertaining to
+distribution of the software without specific, written prior
+permission.
+
+STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
+THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
+FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
+OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
--- a/python/extractor/Makefile
+++ b/python/extractor/Makefile
@@ -0,0 +1,61 @@
+.PHONY: all
+.DEFAULT: all
+all:
+
+OS = $(shell uname)
+
+GIT_ROOT = $(shell git rev-parse --show-toplevel)
+
+TOKENIZER_FILE = semmle/python/parser/tokenizer.py
+TOKENIZER_DEPS = tokenizer_generator/state_transition.txt tokenizer_generator/tokenizer_template.py
+# Must use the same Python version as on jenkins, since output differs per version.
+# However, output is unstable on Python 3.5 (which jenkins uses)
+TOKENIZER_CMD = python3 -m tokenizer_generator.gen_state_machine $(TOKENIZER_DEPS)
+
+.PHONY: tokenizer
+tokenizer: $(TOKENIZER_FILE)
+
+$(TOKENIZER_FILE): $(TOKENIZER_DEPS)
+	$(TOKENIZER_CMD) > $@
+
+
+MASTER_FILE = semmle/python/master.py
+
+DBSCHEME_FILE = $(GIT_ROOT)/ql/python/ql/lib/semmlecode.python.dbscheme
+
+.PHONY: dbscheme
+dbscheme: $(MASTER_FILE)
+	python3 -m semmle.dbscheme_gen $(DBSCHEME_FILE)
+
+AST_GENERATED_DIR = $(GIT_ROOT)/ql/python/ql/lib/semmle/python/
+AST_GENERATED_FILE = $(AST_GENERATED_DIR)AstGenerated.qll
+
+.PHONY: ast
+ast: $(MASTER_FILE)
+	python3 -m semmle.query_gen $(AST_GENERATED_DIR)
+	$(GIT_ROOT)/target/intree/codeql/codeql query format --in-place $(AST_GENERATED_FILE)
+
+################################################################################
+# Tests
+################################################################################
+
+.PHONY: test-all
+test-all: test-3
+
+.PHONY: test-3
+test-3: pytest-3 test-tokenizer
+
+.PHONY: test-tokenizer
+test-tokenizer: SHELL:=/bin/bash
+test-tokenizer:
+	@echo Not running test-tokenizer as jenkins uses Python 3.5
+	# TODO: Enable again once we run Python > 3.5 on Jenkins
+	# diff -u $(TOKENIZER_FILE) <($(TOKENIZER_CMD))
+
+.PHONY: pytest-3
+pytest-3:
+	poetry run pytest
+
+.PHONY: pytest-3-deprecation-error
+pytest-3-deprecation-error:
+	PYTHONWARNINGS='error::DeprecationWarning' poetry run pytest
--- a/python/extractor/README.md
+++ b/python/extractor/README.md
@@ -0,0 +1,211 @@
+# Python extraction
+
+Python extraction happens in two phases:
+
+1. [Setup](#1-Setup-Phase)
+   - determine which version to analyze the project as
+   - creating virtual environment (only LGTM.com)
+   - determine python import path
+   - invoking the actual python extractor
+2. [The actual Python extractor](#2-The-actual-Python-extractor)
+   - walks files and folders, and performs extraction
+
+The rule for `pack_zip('python-extractor')` in `build` defines what files are included in a distribution and in the CodeQL CLI. After building the CodeQL CLI locally, the files are in `target/intree/codeql/python/tools`.
+
+## Local development
+
+This project uses
+
+- [poetry](https://python-poetry.org/) as the package manager
+- [tox](https://tox.wiki/en) together with [pytest](https://docs.pytest.org/en/) to run tests across multiple versions
+
+You can install both tools with [`pipx`](https://pypa.github.io/pipx/), like so
+
+```sh
+pipx install poetry
+pipx inject poetry virtualenv-pyenv # to allow poetry to find python versions from pyenv
+pipx install tox
+pipx inject tox virtualenv-pyenv # to allow tox to find python versions from pyenv
+```
+
+Once you've installed poetry, you can do this:
+
+```sh
+# install required packages
+$ poetry install
+
+# to run tests against python version used by poetry
+$ poetry run pytest
+
+# or
+$ poetry shell # activate poetry environment
+$ pytest # so now pytest is available
+
+# to run tests against all support python versions
+$ tox
+
+# to run against specific version (Python 3.9)
+$ tox -e py39
+```
+
+To install multiple python versions locally, we recommend you use [`pyenv`](https://github.com/pyenv/pyenv)
+
+_(don't try to use `tox run-parallel`, our tests are not set up for this to work 😅)_
+
+### Zip files
+
+Currently we distribute our code in an obfuscated way, by including the code in the subfolders in a zip file that is imported at run-time (by the python files in the top level of this directory).
+
+The one exception is the `data` directory (used for stubs) which is included directly in the `tools` folder.
+
+The zip creation is managed by [`make_zips.py`](./make_zips.py), and currently we make one zipfile for Python 2 (which is byte compiled), and one for Python 3 (which has source files, but they are stripped of comments and docstrings).
+
+### A note about Python versions
+
+We expect to be able to run our tools (setup phase) with either Python 2 or Python 3, and after determining which version to analyze the code as, we run the extractor with that version. So we must support:
+
+- Setup tools run using Python 2:
+  - Extracting code using Python 2
+  - Extracting code using Python 3
+- Setup tools run using Python 3:
+  - Extracting code using Python 2
+  - Extracting code using Python 3
+
+# 1. Setup phase
+
+**For extraction with the CodeQL CLI locally** (`codeql database create --language python`)
+
+- Runs [`language-packs/python/tools/autobuild.sh`](/language-packs/python/tools/autobuild.sh) and this script runs [`index.py`](./index.py)
+
+### Overview of control flow for [`setup.py`](./setup.py)
+
+The representation of the code in the figure below has in some cases been altered slightly, but is accurate as of 2020-03-20.
+
+<details open>
+
+<!-- This file can be opened with diagrams.net directly -->
+
+![python extraction overiew](./docs/extractor-python-setup.svg)
+
+</details>
+
+### Overview of control flow for [`index.py`](./index.py)
+
+The representation of the code in the figure below has in some cases been altered slightly, but is accurate as of 2020-03-20.
+
+<details open>
+
+<!-- This file can be opened with diagrams.net directly -->
+
+![python extraction overiew](./docs/extractor-python-index.svg)
+
+</details>
+
+# 2. The actual Python extractor
+
+## Overview
+
+The entrypoint of the actual Python extractor is [`python_tracer.py`](./python_tracer.py).
+
+The usual way to invoke the extractor is to pass a directory of Python files to the launcher. The extractor extracts code from those files and their dependencies, producing TRAP files, and copies the source code to a source archive.
+Alternatively, for highly distributed systems, it is possible to pass a single file to the per extractor invocation; invoking it many times.
+The extractor recognizes Python source code files and Thrift IDL files.
+Other types of file can be added to the database, by passing the `--filter` option to the extractor, but they'll be stored as text blobs.
+
+The extractor expects the `CODEQL_EXTRACTOR_PYTHON_TRAP_DIR` and
+`CODEQL_EXTRACTOR_PYTHON_SOURCE_ARCHIVE_DIR` environment variables to be set (which determine,
+respectively, where it puts TRAP files and the source archive). However, the location of the TRAP
+folder and source archive can be specified on the command-line instead.
+
+The extractor outputs the following information as TRAP files:
+
+- A file containing per-interpreter data, such as version information and the contents of the `builtins` module.
+- One file per extractor process containing the file and folder information for all processed files and all enclosing folders.
+- Per Python or template file:
+  - The AST.
+  - Scopes and variables, attached to the AST.
+  - The control-flow graph, selectively split when repeated tests are seen.
+
+## How it works
+
+### Overall Architecture
+
+Once started, the extractor consists of three sets of communicating processes.
+
+1.  The front-end: A single process which walks the files and folders specified on the command-line, enqueuing those files plus any additional modules requested by the extractor processes.
+2.  The extractors: Typically one process per CPU. Takes file and module descriptions from the queue, producing TRAP files and copies of the source.
+3.  The logging process. To avoid message interleaving and avoid deadlock, all log messages are queued up to be sent to a logging process which formats and prints the messages.
+
+The front-end -> worker message queue has quite limited capacity (2 per process) to ensure rapid shutdown when interrupted. The capacity of the worker -> front-end message queue must be at least twice that size to prevent deadlock, and is in fact much larger to prevent workers being blocked on the queue.
+
+Experiments suggest that the extractor scales almost linearly to at least 20 processes (on linux).
+
+The component that walks the file system is known as the "traverser" and is designed to be pluggable.
+Its interface is simply an iterable of file descriptions. See `semmle/traverser.py`.
+
+### Lifetime of the extractor
+
+1. Parse the command-line options and read environment variables.
+2. The main process creates:
+   1. the logging queue and process,
+   2. the message queues, and
+   3. the extractor processes.
+3. The main process, now the front-end, starts traversing the file system, by iterating over the traverser.
+4. Until it has exhausted the traverser, it concurrently:
+   - Adds module descriptions from the traverser to the message queue
+   - Reads the reply queue and for any `"IMPORT"` message received adds the module to the message queue if that module has not been seen before.
+5. Until a `"SUCCESS"` message has been received on the reply queue for each module description that has been enqueued:
+   - Reads the reply queue and adds those module descriptions it hasn't seen before to the message queue.
+6. Add one `None` message to the message queue for each extractor.
+7. Wait for all extractors to halt.
+8. Stop the logging process and halt.
+
+### Lifetime of an extractor process
+
+1. Read messages from the message queue until a `None` message is received. For each message:
+   1. Parse the file or module.
+   2. Send an "IMPORT" message for all modules imported by the module being processed.
+   3. Write out TRAP and source archive for the file.
+   4. Send a "SUCCESS" message for the file.
+2. Emit file and folder TRAP for all files and modules processed.
+3. Halt.
+
+### TRAP caching
+
+An important consequence of local extraction is that, except for the file path information, the contents of the TRAP file are functionally determined by:
+
+- The contents of the file.
+- Some command-line options (those determining name hashing and CFG splitting).
+- The extractor version.
+
+Caching of TRAP files can reduce the time to extract a large project with few changes by an order of magnitude.
+
+### Extraction
+
+Each extractor process runs a loop which extracts files or modules from the queue, one at a time.
+Each file or module description is passed, in turn, to one of the extractor objects which will either extract it or reject it for the next extractor object to try.
+Currently the default extractors are:
+
+- Builtin module extractor: Extracts built-in modules like `sys`.
+- Thrift extractor: Extracts Thrift IDL files.
+- Python extractor: Extracts Python source code files.
+- Package extractor: Extracts minimal information for package folders.
+- General file extractor: Any files rejected by the above passes are added to the database as a text blob.
+
+#### Python extraction
+
+The Python extractor is the most interesting of the processes mentioned above.
+The Python extractor takes a path to a Python file. It emits TRAP to the specified folder and a UTF-8 encoded version of the source to the source archive.
+It consists of the following passes:
+
+1. Ingestion and decoding: Read the contents of the file as bytes, determine its encoding, and decode it to text.
+2. Tokenizing: Tokenize the source text, including whitespace and comment tokens.
+3. Parsing: Create a concrete parse tree from the list of tokens.
+4. Rewriting: Rewrite the concrete parse tree to an AST, annotated with scope, variable information, and locations.
+5. Write out lexical and AST information as TRAP.
+6. Generate and emit TRAP for control-flow graphs. This is done one scope at a time to minimize memory consumption.
+7. Emit ancillary information, like TRAP for comments.
+
+#### Template file extraction
+
+Most Python template languages work by either translating the template into Python or by fairly closely mimicking the behavior of Python. This means that we can extract template files by converting them to the same AST used internally by the Python extractor and then passing that AST to the backend of the Python extractor to determine imports, and generate TRAP files including control-flow information.
--- a/python/extractor/main.py
+++ b/python/extractor/main.py
@@ -0,0 +1,4 @@
+import semmle.populator
+
+if __name__ == "__main__":
+    semmle.populator.main()
--- a/python/extractor/blib2to3/Grammar.txt
+++ b/python/extractor/blib2to3/Grammar.txt
@@ -0,0 +1,224 @@
+# Grammar for 2to3. This grammar supports Python 2.x and 3.x.
+
+# NOTE WELL: You should also follow all the steps listed at
+# https://devguide.python.org/grammar/
+
+# Start symbols for the grammar:
+#   file_input is a module or sequence of commands read from an input file;
+#   single_input is a single interactive statement;
+#   eval_input is the input for the eval() and input() functions.
+# NB: compound_stmt in single_input is followed by extra NEWLINE!
+file_input: (NEWLINE | stmt)* ENDMARKER
+single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
+eval_input: testlist NEWLINE* ENDMARKER
+
+decorator: '@' namedexpr_test NEWLINE
+decorators: decorator+
+decorated: decorators (classdef | funcdef | async_funcdef)
+async_funcdef: 'async' funcdef
+funcdef: 'def' NAME parameters ['->' test] ':' suite
+parameters: '(' [typedargslist] ')'
+
+# The following definition for typedarglist is equivalent to this set of rules:
+#
+#     arguments = argument (',' argument)*
+#     argument = tfpdef ['=' test]
+#     kwargs = '**' tname [',']
+#     args = '*' [tname]
+#     kwonly_kwargs = (',' argument)* [',' [kwargs]]
+#     args_kwonly_kwargs = args kwonly_kwargs | kwargs
+#     poskeyword_args_kwonly_kwargs = arguments [',' [args_kwonly_kwargs]]
+#     typedargslist_no_posonly  = poskeyword_args_kwonly_kwargs | args_kwonly_kwargs
+#     typedarglist = arguments ',' '/' [',' [typedargslist_no_posonly]])|(typedargslist_no_posonly)"
+#
+# It needs to be fully expanded to allow our LL(1) parser to work on it.
+
+typedargslist: tfpdef ['=' test] (',' tfpdef ['=' test])* ',' '/' [
+                     ',' [((tfpdef ['=' test] ',')* ('*' [tname] (',' tname ['=' test])*
+                            [',' ['**' tname [',']]] | '**' tname [','])
+                     | tfpdef ['=' test] (',' tfpdef ['=' test])* [','])]
+                ] | ((tfpdef ['=' test] ',')* ('*' [tname] (',' tname ['=' test])*
+                     [',' ['**' tname [',']]] | '**' tname [','])
+                     | tfpdef ['=' test] (',' tfpdef ['=' test])* [','])
+
+tname: NAME [':' test]
+tfpdef: tname | '(' tfplist ')'
+tfplist: tfpdef (',' tfpdef)* [',']
+
+# The following definition for varargslist is equivalent to this set of rules:
+#
+#     arguments = argument (',' argument )*
+#     argument = vfpdef ['=' test]
+#     kwargs = '**' vname [',']
+#     args = '*' [vname]
+#     kwonly_kwargs = (',' argument )* [',' [kwargs]]
+#     args_kwonly_kwargs = args kwonly_kwargs | kwargs
+#     poskeyword_args_kwonly_kwargs = arguments [',' [args_kwonly_kwargs]]
+#     vararglist_no_posonly = poskeyword_args_kwonly_kwargs | args_kwonly_kwargs
+#     varargslist = arguments ',' '/' [','[(vararglist_no_posonly)]] | (vararglist_no_posonly)
+#
+# It needs to be fully expanded to allow our LL(1) parser to work on it.
+
+varargslist: vfpdef ['=' test ](',' vfpdef ['=' test])* ',' '/' [',' [
+                     ((vfpdef ['=' test] ',')* ('*' [vname] (',' vname ['=' test])*
+                            [',' ['**' vname [',']]] | '**' vname [','])
+                            | vfpdef ['=' test] (',' vfpdef ['=' test])* [','])
+                     ]] | ((vfpdef ['=' test] ',')*
+                     ('*' [vname] (',' vname ['=' test])*  [',' ['**' vname [',']]]| '**' vname [','])
+                     | vfpdef ['=' test] (',' vfpdef ['=' test])* [','])
+
+vname: NAME
+vfpdef: vname | '(' vfplist ')'
+vfplist: vfpdef (',' vfpdef)* [',']
+
+stmt: simple_stmt | compound_stmt
+simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
+small_stmt: (expr_stmt | print_stmt  | del_stmt | pass_stmt | flow_stmt |
+             import_stmt | global_stmt | exec_stmt | assert_stmt)
+expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist) |
+                     ('=' (yield_expr|testlist_star_expr))*)
+annassign: ':' test ['=' test]
+testlist_star_expr: (test|star_expr) (',' (test|star_expr))* [',']
+augassign: ('+=' | '-=' | '*=' | '@=' | '/=' | '%=' | '&=' | '|=' | '^=' |
+            '<<=' | '>>=' | '**=' | '//=')
+# For normal and annotated assignments, additional restrictions enforced by the interpreter
+print_stmt: 'print' ( [ test (',' test)* [','] ] |
+                      '>>' test [ (',' test)+ [','] ] )
+del_stmt: 'del' del_list
+del_list: (expr|star_expr) (',' (expr|star_expr))* [',']
+pass_stmt: 'pass'
+flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt
+break_stmt: 'break'
+continue_stmt: 'continue'
+return_stmt: 'return' [testlist_star_expr]
+yield_stmt: yield_expr
+raise_stmt: 'raise' [test ['from' test | ',' test [',' test]]]
+import_stmt: import_name | import_from
+import_name: 'import' dotted_as_names
+import_from: ('from' ('.'* dotted_name | '.'+)
+              'import' ('*' | '(' import_as_names ')' | import_as_names))
+import_as_name: NAME ['as' NAME]
+dotted_as_name: dotted_name ['as' NAME]
+import_as_names: import_as_name (',' import_as_name)* [',']
+dotted_as_names: dotted_as_name (',' dotted_as_name)*
+dotted_name: NAME ('.' NAME)*
+global_stmt: ('global' | 'nonlocal') NAME (',' NAME)*
+exec_stmt: 'exec' expr ['in' test [',' test]]
+assert_stmt: 'assert' test [',' test]
+
+compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
+async_stmt: 'async' (funcdef | with_stmt | for_stmt)
+if_stmt: 'if' namedexpr_test ':' suite ('elif' namedexpr_test ':' suite)* ['else' ':' suite]
+while_stmt: 'while' namedexpr_test ':' suite ['else' ':' suite]
+for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]
+try_stmt: ('try' ':' suite
+           ((except_clause ':' suite)+
+           ['else' ':' suite]
+           ['finally' ':' suite] |
+           'finally' ':' suite))
+with_stmt: 'with' with_item (',' with_item)*  ':' suite
+with_item: test ['as' expr]
+with_var: 'as' expr
+# NB compile.c makes sure that the default except clause is last
+except_clause: 'except' [test [(',' | 'as') test]]
+suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
+
+# Backward compatibility cruft to support:
+# [ x for x in lambda: True, lambda: False if x() ]
+# even while also allowing:
+# lambda x: 5 if x else 2
+# (But not a mix of the two)
+testlist_safe: old_test [(',' old_test)+ [',']]
+old_test: or_test | old_lambdef
+old_lambdef: 'lambda' [varargslist] ':' old_test
+
+namedexpr_test: test [':=' test]
+test: or_test ['if' or_test 'else' test] | lambdef
+or_test: and_test ('or' and_test)*
+and_test: not_test ('and' not_test)*
+not_test: 'not' not_test | comparison
+comparison: expr (comp_op expr)*
+comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
+star_expr: '*' expr
+expr: xor_expr ('|' xor_expr)*
+xor_expr: and_expr ('^' and_expr)*
+and_expr: shift_expr ('&' shift_expr)*
+shift_expr: arith_expr (('<<'|'>>') arith_expr)*
+arith_expr: term (('+'|'-') term)*
+term: factor (('*'|'@'|'/'|'%'|'//') factor)*
+factor: ('+'|'-'|'~') factor | power
+power: ['await'] atom trailer* ['**' factor]
+atom: ('(' [yield_expr|testlist_gexp] ')' |
+       '[' [listmaker] ']' |
+       '{' [dictsetmaker] '}' |
+       '`' testlist1 '`' |
+       NAME | NUMBER | string | '.' '.' '.' | special_operation
+       )
+
+string: (fstring_part | STRING)+
+fstring_part: FSTRING_START testlist ['='] [ CONVERSION ] [ format_specifier ] (FSTRING_MID testlist ['='] [ CONVERSION ] [ format_specifier ] )* FSTRING_END
+format_specifier: ':' (FSTRING_SPEC test [ CONVERSION ] [ format_specifier ] )* FSTRING_SPEC
+
+listmaker: (namedexpr_test|star_expr) ( old_comp_for | (',' (namedexpr_test|star_expr))* [','] )
+testlist_gexp: (namedexpr_test|star_expr) ( old_comp_for | (',' (namedexpr_test|star_expr))* [','] )
+lambdef: 'lambda' [varargslist] ':' test
+trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME
+subscriptlist: subscript (',' subscript)* [',']
+subscript: test | [test] ':' [test] [ ':' [test] ]
+exprlist: (expr|star_expr) (',' (expr|star_expr))* [',']
+testlist: test (',' test)* [',']
+dictsetmaker: ( ((test ':' test | '**' expr)
+                 (comp_for | (',' (test ':' test | '**' expr))* [','])) |
+                ((test | star_expr)
+                (comp_for | (',' (test | star_expr))* [','])) )
+
+classdef: 'class' NAME ['(' [arglist] ')'] ':' suite
+
+arglist: argument (',' argument)* [',']
+
+# "test '=' test" is really "keyword '=' test", but we have no such token.
+# These need to be in a single rule to avoid grammar that is ambiguous
+# to our LL(1) parser. Even though 'test' includes '*expr' in star_expr,
+# we explicitly match '*' here, too, to give it proper precedence.
+# Illegal combinations and orderings are blocked in ast.c:
+# multiple (test comp_for) arguments are blocked; keyword unpackings
+# that precede iterable unpackings are blocked; etc.
+argument: ( test [comp_for] |
+            test ':=' test |
+            test '=' test |
+            '**' test |
+            '*' test )
+
+comp_iter: comp_for | comp_if
+comp_for: ['async'] 'for' exprlist 'in' or_test [comp_iter]
+comp_if: 'if' old_test [comp_iter]
+
+# As noted above, testlist_safe extends the syntax allowed in list
+# comprehensions and generators. We can't use it indiscriminately in all
+# derivations using a comp_for-like pattern because the testlist_safe derivation
+# contains comma which clashes with trailing comma in arglist.
+#
+# This was an issue because the parser would not follow the correct derivation
+# when parsing syntactically valid Python code. Since testlist_safe was created
+# specifically to handle list comprehensions and generator expressions enclosed
+# with parentheses, it's safe to only use it in those. That avoids the issue; we
+# can parse code like set(x for x in [],).
+#
+# The syntax supported by this set of rules is not a valid Python 3 syntax,
+# hence the prefix "old".
+#
+# See https://bugs.python.org/issue27494
+old_comp_iter: old_comp_for | old_comp_if
+old_comp_for: ['async'] 'for' exprlist 'in' testlist_safe [old_comp_iter]
+old_comp_if: 'if' old_test [old_comp_iter]
+
+testlist1: test (',' test)*
+
+# not used in grammar, but may appear in "node" passed from Parser to Compiler
+encoding_decl: NAME
+
+yield_expr: 'yield' [yield_arg]
+yield_arg: 'from' test | testlist_star_expr
+
+special_operation: DOLLARNAME '(' [testlist] ')'
+
--- a/python/extractor/blib2to3/LICENSE
+++ b/python/extractor/blib2to3/LICENSE
@@ -0,0 +1,254 @@
+A. HISTORY OF THE SOFTWARE
+==========================
+
+Python was created in the early 1990s by Guido van Rossum at Stichting
+Mathematisch Centrum (CWI, see http://www.cwi.nl) in the Netherlands
+as a successor of a language called ABC.  Guido remains Python's
+principal author, although it includes many contributions from others.
+
+In 1995, Guido continued his work on Python at the Corporation for
+National Research Initiatives (CNRI, see http://www.cnri.reston.va.us)
+in Reston, Virginia where he released several versions of the
+software.
+
+In May 2000, Guido and the Python core development team moved to
+BeOpen.com to form the BeOpen PythonLabs team.  In October of the same
+year, the PythonLabs team moved to Digital Creations, which became
+Zope Corporation.  In 2001, the Python Software Foundation (PSF, see
+https://www.python.org/psf/) was formed, a non-profit organization
+created specifically to own Python-related Intellectual Property.
+Zope Corporation was a sponsoring member of the PSF.
+
+All Python releases are Open Source (see http://www.opensource.org for
+the Open Source Definition).  Historically, most, but not all, Python
+releases have also been GPL-compatible; the table below summarizes
+the various releases.
+
+    Release         Derived     Year        Owner       GPL-
+                    from                                compatible? (1)
+
+    0.9.0 thru 1.2              1991-1995   CWI         yes
+    1.3 thru 1.5.2  1.2         1995-1999   CNRI        yes
+    1.6             1.5.2       2000        CNRI        no
+    2.0             1.6         2000        BeOpen.com  no
+    1.6.1           1.6         2001        CNRI        yes (2)
+    2.1             2.0+1.6.1   2001        PSF         no
+    2.0.1           2.0+1.6.1   2001        PSF         yes
+    2.1.1           2.1+2.0.1   2001        PSF         yes
+    2.1.2           2.1.1       2002        PSF         yes
+    2.1.3           2.1.2       2002        PSF         yes
+    2.2 and above   2.1.1       2001-now    PSF         yes
+
+Footnotes:
+
+(1) GPL-compatible doesn't mean that we're distributing Python under
+    the GPL.  All Python licenses, unlike the GPL, let you distribute
+    a modified version without making your changes open source.  The
+    GPL-compatible licenses make it possible to combine Python with
+    other software that is released under the GPL; the others don't.
+
+(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
+    because its license has a choice of law clause.  According to
+    CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
+    is "not incompatible" with the GPL.
+
+Thanks to the many outside volunteers who have worked under Guido's
+direction to make these releases possible.
+
+
+B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
+===============================================================
+
+PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
+--------------------------------------------
+
+1. This LICENSE AGREEMENT is between the Python Software Foundation
+("PSF"), and the Individual or Organization ("Licensee") accessing and
+otherwise using this software ("Python") in source or binary form and
+its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, PSF hereby
+grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
+analyze, test, perform and/or display publicly, prepare derivative works,
+distribute, and otherwise use Python alone or in any derivative version,
+provided, however, that PSF's License Agreement and PSF's notice of copyright,
+i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 Python Software Foundation; All
+Rights Reserved" are retained in Python alone or in any derivative version
+prepared by Licensee.
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python.
+
+4. PSF is making Python available to Licensee on an "AS IS"
+basis.  PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. Nothing in this License Agreement shall be deemed to create any
+relationship of agency, partnership, or joint venture between PSF and
+Licensee.  This License Agreement does not grant permission to use PSF
+trademarks or trade name in a trademark sense to endorse or promote
+products or services of Licensee, or any third party.
+
+8. By copying, installing or otherwise using Python, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
+-------------------------------------------
+
+BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
+
+1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
+office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
+Individual or Organization ("Licensee") accessing and otherwise using
+this software in source or binary form and its associated
+documentation ("the Software").
+
+2. Subject to the terms and conditions of this BeOpen Python License
+Agreement, BeOpen hereby grants Licensee a non-exclusive,
+royalty-free, world-wide license to reproduce, analyze, test, perform
+and/or display publicly, prepare derivative works, distribute, and
+otherwise use the Software alone or in any derivative version,
+provided, however, that the BeOpen Python License is retained in the
+Software, alone or in any derivative version prepared by Licensee.
+
+3. BeOpen is making the Software available to Licensee on an "AS IS"
+basis.  BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
+SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
+AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
+DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+5. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+6. This License Agreement shall be governed by and interpreted in all
+respects by the law of the State of California, excluding conflict of
+law provisions.  Nothing in this License Agreement shall be deemed to
+create any relationship of agency, partnership, or joint venture
+between BeOpen and Licensee.  This License Agreement does not grant
+permission to use BeOpen trademarks or trade names in a trademark
+sense to endorse or promote products or services of Licensee, or any
+third party.  As an exception, the "BeOpen Python" logos available at
+http://www.pythonlabs.com/logos.html may be used according to the
+permissions granted on that web page.
+
+7. By copying, installing or otherwise using the software, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
+---------------------------------------
+
+1. This LICENSE AGREEMENT is between the Corporation for National
+Research Initiatives, having an office at 1895 Preston White Drive,
+Reston, VA 20191 ("CNRI"), and the Individual or Organization
+("Licensee") accessing and otherwise using Python 1.6.1 software in
+source or binary form and its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, CNRI
+hereby grants Licensee a nonexclusive, royalty-free, world-wide
+license to reproduce, analyze, test, perform and/or display publicly,
+prepare derivative works, distribute, and otherwise use Python 1.6.1
+alone or in any derivative version, provided, however, that CNRI's
+License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
+1995-2001 Corporation for National Research Initiatives; All Rights
+Reserved" are retained in Python 1.6.1 alone or in any derivative
+version prepared by Licensee.  Alternately, in lieu of CNRI's License
+Agreement, Licensee may substitute the following text (omitting the
+quotes): "Python 1.6.1 is made available subject to the terms and
+conditions in CNRI's License Agreement.  This Agreement together with
+Python 1.6.1 may be located on the Internet using the following
+unique, persistent identifier (known as a handle): 1895.22/1013.  This
+Agreement may also be obtained from a proxy server on the Internet
+using the following URL: http://hdl.handle.net/1895.22/1013".
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python 1.6.1 or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python 1.6.1.
+
+4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
+basis.  CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. This License Agreement shall be governed by the federal
+intellectual property law of the United States, including without
+limitation the federal copyright law, and, to the extent such
+U.S. federal law does not apply, by the law of the Commonwealth of
+Virginia, excluding Virginia's conflict of law provisions.
+Notwithstanding the foregoing, with regard to derivative works based
+on Python 1.6.1 that incorporate non-separable material that was
+previously distributed under the GNU General Public License (GPL), the
+law of the Commonwealth of Virginia shall govern this License
+Agreement only as to issues arising under or with respect to
+Paragraphs 4, 5, and 7 of this License Agreement.  Nothing in this
+License Agreement shall be deemed to create any relationship of
+agency, partnership, or joint venture between CNRI and Licensee.  This
+License Agreement does not grant permission to use CNRI trademarks or
+trade name in a trademark sense to endorse or promote products or
+services of Licensee, or any third party.
+
+8. By clicking on the "ACCEPT" button where indicated, or by copying,
+installing or otherwise using Python 1.6.1, Licensee agrees to be
+bound by the terms and conditions of this License Agreement.
+
+        ACCEPT
+
+
+CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
+--------------------------------------------------
+
+Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
+The Netherlands.  All rights reserved.
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose and without fee is hereby granted,
+provided that the above copyright notice appear in all copies and that
+both that copyright notice and this permission notice appear in
+supporting documentation, and that the name of Stichting Mathematisch
+Centrum or CWI not be used in advertising or publicity pertaining to
+distribution of the software without specific, written prior
+permission.
+
+STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
+THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
+FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
+OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
--- a/python/extractor/blib2to3/README
+++ b/python/extractor/blib2to3/README
@@ -0,0 +1,20 @@
+This code is derived from the black code formatter, 
+which itself was derived from the lib2to3 package in the Python standard library.
+
+We (Semmle) have modified this further to ease conversion to our multi-version AST.
+
+Original README from black:
+
+A subset of lib2to3 taken from Python 3.7.0b2.
+Commit hash: 9c17e3a1987004b8bcfbe423953aad84493a7984
+
+Reasons for forking:
+- consistent handling of f-strings for users of Python < 3.6.2
+- backport of BPO-33064 that fixes parsing files with trailing commas after
+  *args and **kwargs
+- backport of GH-6143 that restores the ability to reformat legacy usage of
+  `async`
+- support all types of string literals
+- better ability to debug (better reprs)
+- INDENT and DEDENT don't hold whitespace and comment prefixes
+- ability to Cythonize
--- a/python/extractor/blib2to3/README.md
+++ b/python/extractor/blib2to3/README.md
@@ -0,0 +1,67 @@
+# Building Concrete Parse Trees using the Python grammar
+
+This grammar is mostly reusing existing code:
+
+- `lib2to3` is a part of the `2to3` utility (included in the CPython
+  distribution) aimed at automatically converting Python 2 code to equivalent
+  Python 3 code. Because it needs to be idempotent when applied to Python 3
+  code, this grammar must be capable of parsing both Python 2 and 3 (with
+  certain restrictions).
+- `blib2to3` is part of the `black` formatter for Python. It adds a few
+  extensions on top of `lib2to3`.
+- Finally, we extend this grammar even further, in order to support things like
+  f-strings even when the extractor is run using Python 2. (In this respect,
+  `blib2to3` "cheats" by requiring Python 3 if you want to parse Python 3 code.
+  We do not have this luxury.)
+
+The grammar of Python is described in `Grammar.txt` in the style of an EBNF:
+
+- Rules have the form `nonterminal_name: production` (where traditionally, one
+  would use `::=` instead of `:`)
+- Productions can contain
+    - Literal strings, enclosed in single quotes.
+    - Alternation, indicated by an infix `|`.
+    - Repetition, indicated by a postfixed `*` for "zero or more" and `+` for
+      "one or more".
+    - Optional parts, indicated by these being surrounded by square brackets.
+    - Parentheseses to indicate grouping, and to allow productions to span several lines.
+
+>Note: You may wonder: How is `Grammar.txt` parsed? The answer to this is that
+>it is used to parse itself. In particular, it uses the same tokenizer as that
+>for Python, and hence every symbol appearing in the grammar must be a valid
+>Python token. This is why rules use `:` instead of `::=`. This also explains
+>why parentheses must be used when a production spans multiple lines, as the
+>presence of parentheses affects the tokenization.
+
+The concrete parse tree built based on these rules has a simple form: Each node
+has a `name` attribute, equal to that of the corresponding nonterminal, and a
+`children` attribute, which contains a list of all of the children of the node.
+These come directly from the production on the right hand side of the rule for
+the given nonterminal. Thus, something like 
+
+```
+testlist: test (',' test)* [',']
+```
+
+will result in a node with name `testlist`, and its attribute `children` will be
+a list where the first element is a `test` node, the second (if any) is a node
+for `','`, etc. Note in particular that _every_ part of the production is
+included in the children, even parts that are just static tokens.
+
+The leaves of the concrete parse tree (corresponding to the terminals of the
+grammar) will have an associated `value` attribute. This contains the underlying
+string for this token (in particular, for a `NAME` token, its value will be the
+underlying identifier).
+
+## From Concrete to Abstract
+
+To turn the concrete parse tree into an asbstract parse tree, we _walk_ the tree
+using the visitor pattern. Thus, for every nonterminal (e.g. `testlist`) we have
+a method (in this case `visit_testlist`) that takes care of visiting nodes of
+this type in the concrete parse tree. In doing so, we build up the abstract
+parse tree, eliding any nodes that are not relevant in terms of the abstract
+syntax.
+
+>TO DO: 
+>- Why we parse everything four times (`async` et al.)
+
--- a/python/extractor/blib2to3/init.py
+++ b/python/extractor/blib2to3/init.py
@@ -0,0 +1 @@
+#empty
--- a/python/extractor/blib2to3/pgen2/init.py
+++ b/python/extractor/blib2to3/pgen2/init.py
--- a/python/extractor/blib2to3/pgen2/driver.py
+++ b/python/extractor/blib2to3/pgen2/driver.py
@@ -0,0 +1,37 @@
+# Copyright 2004-2005 Elemental Security, Inc. All Rights Reserved.
+# Licensed to PSF under a Contributor Agreement.
+
+# Modifications:
+# Copyright 2006 Google, Inc. All Rights Reserved.
+# Licensed to PSF under a Contributor Agreement.
+
+"""Parser driver.
+
+This provides a high-level interface to parse a file into a syntax tree.
+
+"""
+
+__author__ = "Guido van Rossum <guido@python.org>"
+
+__all__ = ["load_grammar"]
+
+# Python imports
+import os
+import logging
+import pkgutil
+import sys
+
+# Pgen imports
+from . import grammar, pgen
+
+if sys.version < "3":
+    from cStringIO import StringIO
+else:
+    from io import StringIO
+
+def load_grammar(package, grammar):
+    """Load the grammar (maybe from a pickle)."""
+    data = pkgutil.get_data(package, grammar)
+    stream = StringIO(data.decode("utf8"))
+    g = pgen.generate_grammar(grammar, stream)
+    return g
--- a/python/extractor/blib2to3/pgen2/grammar.py
+++ b/python/extractor/blib2to3/pgen2/grammar.py
@@ -0,0 +1,188 @@
+# Copyright 2004-2005 Elemental Security, Inc. All Rights Reserved.
+# Licensed to PSF under a Contributor Agreement.
+
+"""This module defines the data structures used to represent a grammar.
+
+These are a bit arcane because they are derived from the data
+structures used by Python's 'pgen' parser generator.
+
+There's also a table here mapping operators to their names in the
+token module; the Python tokenize module reports all operators as the
+fallback token code OP, but the parser needs the actual token code.
+
+"""
+
+# Python imports
+import pickle
+
+# Local imports
+from . import token
+
+
+class Grammar(object):
+    """Pgen parsing tables conversion class.
+
+    Once initialized, this class supplies the grammar tables for the
+    parsing engine implemented by parse.py.  The parsing engine
+    accesses the instance variables directly.  The class here does not
+    provide initialization of the tables; several subclasses exist to
+    do this (see the conv and pgen modules).
+
+    The load() method reads the tables from a pickle file, which is
+    much faster than the other ways offered by subclasses.  The pickle
+    file is written by calling dump() (after loading the grammar
+    tables using a subclass).  The report() method prints a readable
+    representation of the tables to stdout, for debugging.
+
+    The instance variables are as follows:
+
+    symbol2number -- a dict mapping symbol names to numbers.  Symbol
+                     numbers are always 256 or higher, to distinguish
+                     them from token numbers, which are between 0 and
+                     255 (inclusive).
+
+    number2symbol -- a dict mapping numbers to symbol names;
+                     these two are each other's inverse.
+
+    states        -- a list of DFAs, where each DFA is a list of
+                     states, each state is a list of arcs, and each
+                     arc is a (i, j) pair where i is a label and j is
+                     a state number.  The DFA number is the index into
+                     this list.  (This name is slightly confusing.)
+                     Final states are represented by a special arc of
+                     the form (0, j) where j is its own state number.
+
+    dfas          -- a dict mapping symbol numbers to (DFA, first)
+                     pairs, where DFA is an item from the states list
+                     above, and first is a set of tokens that can
+                     begin this grammar rule (represented by a dict
+                     whose values are always 1).
+
+    labels        -- a list of (x, y) pairs where x is either a token
+                     number or a symbol number, and y is either None
+                     or a string; the strings are keywords.  The label
+                     number is the index in this list; label numbers
+                     are used to mark state transitions (arcs) in the
+                     DFAs.
+
+    start         -- the number of the grammar's start symbol.
+
+    keywords      -- a dict mapping keyword strings to arc labels.
+
+    tokens        -- a dict mapping token numbers to arc labels.
+
+    """
+
+    def __init__(self):
+        self.symbol2number = {}
+        self.number2symbol = {}
+        self.states = []
+        self.dfas = {}
+        self.labels = [(0, "EMPTY")]
+        self.keywords = {}
+        self.tokens = {}
+        self.symbol2label = {}
+        self.start = 256
+
+    def dump(self, filename):
+        """Dump the grammar tables to a pickle file."""
+        with open(filename, "wb") as f:
+            pickle.dump(self.__dict__, f, pickle.HIGHEST_PROTOCOL)
+
+    def load(self, filename):
+        """Load the grammar tables from a pickle file."""
+        with open(filename, "rb") as f:
+            d = pickle.load(f)
+        self.__dict__.update(d)
+
+    def loads(self, pkl):
+        """Load the grammar tables from a pickle bytes object."""
+        self.__dict__.update(pickle.loads(pkl))
+
+    def copy(self):
+        """
+        Copy the grammar.
+        """
+        new = self.__class__()
+        for dict_attr in ("symbol2number", "number2symbol", "dfas", "keywords",
+                          "tokens", "symbol2label"):
+            setattr(new, dict_attr, getattr(self, dict_attr).copy())
+        new.labels = self.labels[:]
+        new.states = self.states[:]
+        new.start = self.start
+        return new
+
+    def report(self):
+        """Dump the grammar tables to standard output, for debugging."""
+        from pprint import pprint
+        print("s2n")
+        pprint(self.symbol2number)
+        print("n2s")
+        pprint(self.number2symbol)
+        print("states")
+        pprint(self.states)
+        print("dfas")
+        pprint(self.dfas)
+        print("labels")
+        pprint(self.labels)
+        print("start", self.start)
+
+
+# Map from operator to number (since tokenize doesn't do this)
+
+opmap_raw = """
+( LPAR
+) RPAR
+[ LSQB
+] RSQB
+: COLON
+, COMMA
+; SEMI
+ PLUS
+- MINUS
+* STAR
+/ SLASH
+| VBAR
+& AMPER
+< LESS
+> GREATER
+= EQUAL
+. DOT
+% PERCENT
+` BACKQUOTE
+{ LBRACE
+} RBRACE
+@ AT
+@= ATEQUAL
+== EQEQUAL
+!= NOTEQUAL
+<> NOTEQUAL
+<= LESSEQUAL
+>= GREATEREQUAL
+~ TILDE
+^ CIRCUMFLEX
+<< LEFTSHIFT
+>> RIGHTSHIFT
+** DOUBLESTAR
+= PLUSEQUAL
+-= MINEQUAL
+*= STAREQUAL
+/= SLASHEQUAL
+%= PERCENTEQUAL
+&= AMPEREQUAL
+|= VBAREQUAL
+^= CIRCUMFLEXEQUAL
+<<= LEFTSHIFTEQUAL
+>>= RIGHTSHIFTEQUAL
+**= DOUBLESTAREQUAL
+// DOUBLESLASH
+//= DOUBLESLASHEQUAL
+-> RARROW
+:= COLONEQUAL
+"""
+
+opmap = {}
+for line in opmap_raw.splitlines():
+    if line:
+        op, name = line.split()
+        opmap[op] = getattr(token, name)
--- a/python/extractor/blib2to3/pgen2/parse.py
+++ b/python/extractor/blib2to3/pgen2/parse.py
@@ -0,0 +1,201 @@
+# Copyright 2004-2005 Elemental Security, Inc. All Rights Reserved.
+# Licensed to PSF under a Contributor Agreement.
+
+"""Parser engine for the grammar tables generated by pgen.
+
+The grammar table must be loaded first.
+
+See Parser/parser.c in the Python distribution for additional info on
+how this parsing engine works.
+
+"""
+
+# Local imports
+from . import token
+
+class ParseError(Exception):
+    """Exception to signal the parser is stuck."""
+
+    def __init__(self, msg, type, value, context):
+        Exception.__init__(self, "%s: type=%r, value=%r, context=%r" %
+                           (msg, type, value, context))
+        self.msg = msg
+        self.type = type
+        self.value = value
+        self.context = context
+
+class Parser(object):
+    """Parser engine.
+
+    The proper usage sequence is:
+
+    p = Parser(grammar, [converter])  # create instance
+    p.setup([start])                  # prepare for parsing
+    <for each input token>:
+        if p.addtoken(...):           # parse a token; may raise ParseError
+            break
+    root = p.rootnode                 # root of abstract syntax tree
+
+    A Parser instance may be reused by calling setup() repeatedly.
+
+    A Parser instance contains state pertaining to the current token
+    sequence, and should not be used concurrently by different threads
+    to parse separate token sequences.
+
+    See driver.py for how to get input tokens by tokenizing a file or
+    string.
+
+    Parsing is complete when addtoken() returns True; the root of the
+    abstract syntax tree can then be retrieved from the rootnode
+    instance variable.  When a syntax error occurs, addtoken() raises
+    the ParseError exception.  There is no error recovery; the parser
+    cannot be used after a syntax error was reported (but it can be
+    reinitialized by calling setup()).
+
+    """
+
+    def __init__(self, grammar, convert=None):
+        """Constructor.
+
+        The grammar argument is a grammar.Grammar instance; see the
+        grammar module for more information.
+
+        The parser is not ready yet for parsing; you must call the
+        setup() method to get it started.
+
+        The optional convert argument is a function mapping concrete
+        syntax tree nodes to abstract syntax tree nodes.  If not
+        given, no conversion is done and the syntax tree produced is
+        the concrete syntax tree.  If given, it must be a function of
+        two arguments, the first being the grammar (a grammar.Grammar
+        instance), and the second being the concrete syntax tree node
+        to be converted.  The syntax tree is converted from the bottom
+        up.
+
+        A concrete syntax tree node is a (type, value, context, nodes)
+        tuple, where type is the node type (a token or symbol number),
+        value is None for symbols and a string for tokens, context is
+        None or an opaque value used for error reporting (typically a
+        (lineno, offset) pair), and nodes is a list of children for
+        symbols, and None for tokens.
+
+        An abstract syntax tree node may be anything; this is entirely
+        up to the converter function.
+
+        """
+        self.grammar = grammar
+        self.convert = convert or (lambda grammar, node: node)
+
+    def setup(self, start=None):
+        """Prepare for parsing.
+
+        This *must* be called before starting to parse.
+
+        The optional argument is an alternative start symbol; it
+        defaults to the grammar's start symbol.
+
+        You can use a Parser instance to parse any number of programs;
+        each time you call setup() the parser is reset to an initial
+        state determined by the (implicit or explicit) start symbol.
+
+        """
+        if start is None:
+            start = self.grammar.start
+        # Each stack entry is a tuple: (dfa, state, node).
+        # A node is a tuple: (type, value, context, children),
+        # where children is a list of nodes or None, and context may be None.
+        newnode = (start, None, None, [])
+        stackentry = (self.grammar.dfas[start], 0, newnode)
+        self.stack = [stackentry]
+        self.rootnode = None
+        self.used_names = set() # Aliased to self.rootnode.used_names in pop()
+
+    def addtoken(self, type, value, context):
+        """Add a token; return True iff this is the end of the program."""
+        # Map from token to label
+        ilabel = self.classify(type, value, context)
+        # Loop until the token is shifted; may raise exceptions
+        while True:
+            dfa, state, node = self.stack[-1]
+            states, first = dfa
+            arcs = states[state]
+            # Look for a state with this label
+            for i, newstate in arcs:
+                t, v = self.grammar.labels[i]
+                if ilabel == i:
+                    # Look it up in the list of labels
+                    assert t < 256
+                    # Shift a token; we're done with it
+                    self.shift(type, value, newstate, context)
+                    # Pop while we are in an accept-only state
+                    state = newstate
+                    while states[state] == [(0, state)]:
+                        self.pop()
+                        if not self.stack:
+                            # Done parsing!
+                            return True
+                        dfa, state, node = self.stack[-1]
+                        states, first = dfa
+                    # Done with this token
+                    return False
+                elif t >= 256:
+                    # See if it's a symbol and if we're in its first set
+                    itsdfa = self.grammar.dfas[t]
+                    itsstates, itsfirst = itsdfa
+                    if ilabel in itsfirst:
+                        # Push a symbol
+                        self.push(t, self.grammar.dfas[t], newstate, context)
+                        break # To continue the outer while loop
+            else:
+                if (0, state) in arcs:
+                    # An accepting state, pop it and try something else
+                    self.pop()
+                    if not self.stack:
+                        # Done parsing, but another token is input
+                        raise ParseError("too much input",
+                                         type, value, context)
+                else:
+                    # No success finding a transition
+                    raise ParseError("bad input", type, value, context)
+
+    def classify(self, type, value, context):
+        """Turn a token into a label.  (Internal)"""
+        if type == token.NAME:
+            # Keep a listing of all used names
+            self.used_names.add(value)
+            # Check for reserved words
+            ilabel = self.grammar.keywords.get(value)
+            if ilabel is not None:
+                return ilabel
+        ilabel = self.grammar.tokens.get(type)
+        if ilabel is None:
+            raise ParseError("bad token", type, value, context)
+        return ilabel
+
+    def shift(self, type, value, newstate, context):
+        """Shift a token.  (Internal)"""
+        dfa, state, node = self.stack[-1]
+        newnode = (type, value, context, None)
+        newnode = self.convert(self.grammar, newnode)
+        if newnode is not None:
+            node[-1].append(newnode)
+        self.stack[-1] = (dfa, newstate, node)
+
+    def push(self, type, newdfa, newstate, context):
+        """Push a nonterminal.  (Internal)"""
+        dfa, state, node = self.stack[-1]
+        newnode = (type, None, context, [])
+        self.stack[-1] = (dfa, newstate, node)
+        self.stack.append((newdfa, 0, newnode))
+
+    def pop(self):
+        """Pop a nonterminal.  (Internal)"""
+        popdfa, popstate, popnode = self.stack.pop()
+        newnode = self.convert(self.grammar, popnode)
+        if newnode is not None:
+            if self.stack:
+                dfa, state, node = self.stack[-1]
+                node[-1].append(newnode)
+            else:
+                self.rootnode = newnode
+                self.rootnode.used_names = self.used_names
--- a/python/extractor/blib2to3/pgen2/pgen.py
+++ b/python/extractor/blib2to3/pgen2/pgen.py
@@ -0,0 +1,386 @@
+# Copyright 2004-2005 Elemental Security, Inc. All Rights Reserved.
+# Licensed to PSF under a Contributor Agreement.
+
+# Pgen imports
+from . import grammar, token, tokenize
+
+class PgenGrammar(grammar.Grammar):
+    pass
+
+class ParserGenerator(object):
+
+    def __init__(self, filename, stream=None):
+        close_stream = None
+        if stream is None:
+            stream = open(filename)
+            close_stream = stream.close
+        self.filename = filename
+        self.stream = stream
+        self.generator = tokenize.generate_tokens(stream.readline)
+        self.gettoken() # Initialize lookahead
+        self.dfas, self.startsymbol = self.parse()
+        if close_stream is not None:
+            close_stream()
+        self.first = {} # map from symbol name to set of tokens
+        self.addfirstsets()
+
+    def make_grammar(self):
+        c = PgenGrammar()
+        names = list(self.dfas.keys())
+        names.sort()
+        names.remove(self.startsymbol)
+        names.insert(0, self.startsymbol)
+        for name in names:
+            i = 256 + len(c.symbol2number)
+            c.symbol2number[name] = i
+            c.number2symbol[i] = name
+        for name in names:
+            dfa = self.dfas[name]
+            states = []
+            for state in dfa:
+                arcs = []
+                for label, next in sorted(state.arcs.items()):
+                    arcs.append((self.make_label(c, label), dfa.index(next)))
+                if state.isfinal:
+                    arcs.append((0, dfa.index(state)))
+                states.append(arcs)
+            c.states.append(states)
+            c.dfas[c.symbol2number[name]] = (states, self.make_first(c, name))
+        c.start = c.symbol2number[self.startsymbol]
+        return c
+
+    def make_first(self, c, name):
+        rawfirst = self.first[name]
+        first = {}
+        for label in sorted(rawfirst):
+            ilabel = self.make_label(c, label)
+            ##assert ilabel not in first # XXX failed on <> ... !=
+            first[ilabel] = 1
+        return first
+
+    def make_label(self, c, label):
+        # XXX Maybe this should be a method on a subclass of converter?
+        ilabel = len(c.labels)
+        if label[0].isalpha():
+            # Either a symbol name or a named token
+            if label in c.symbol2number:
+                # A symbol name (a non-terminal)
+                if label in c.symbol2label:
+                    return c.symbol2label[label]
+                else:
+                    c.labels.append((c.symbol2number[label], None))
+                    c.symbol2label[label] = ilabel
+                    return ilabel
+            else:
+                # A named token (NAME, NUMBER, STRING)
+                itoken = getattr(token, label, None)
+                assert isinstance(itoken, int), label
+                assert itoken in token.tok_name, label
+                if itoken in c.tokens:
+                    return c.tokens[itoken]
+                else:
+                    c.labels.append((itoken, None))
+                    c.tokens[itoken] = ilabel
+                    return ilabel
+        else:
+            # Either a keyword or an operator
+            assert label[0] in ('"', "'"), label
+            value = eval(label)
+            if value[0].isalpha():
+                # A keyword
+                if value in c.keywords:
+                    return c.keywords[value]
+                else:
+                    c.labels.append((token.NAME, value))
+                    c.keywords[value] = ilabel
+                    return ilabel
+            else:
+                # An operator (any non-numeric token)
+                itoken = grammar.opmap[value] # Fails if unknown token
+                if itoken in c.tokens:
+                    return c.tokens[itoken]
+                else:
+                    c.labels.append((itoken, None))
+                    c.tokens[itoken] = ilabel
+                    return ilabel
+
+    def addfirstsets(self):
+        names = list(self.dfas.keys())
+        names.sort()
+        for name in names:
+            if name not in self.first:
+                self.calcfirst(name)
+            #print name, self.first[name].keys()
+
+    def calcfirst(self, name):
+        dfa = self.dfas[name]
+        self.first[name] = None # dummy to detect left recursion
+        state = dfa[0]
+        totalset = {}
+        overlapcheck = {}
+        for label, next in state.arcs.items():
+            if label in self.dfas:
+                if label in self.first:
+                    fset = self.first[label]
+                    if fset is None:
+                        raise ValueError("recursion for rule %r" % name)
+                else:
+                    self.calcfirst(label)
+                    fset = self.first[label]
+                totalset.update(fset)
+                overlapcheck[label] = fset
+            else:
+                totalset[label] = 1
+                overlapcheck[label] = {label: 1}
+        inverse = {}
+        for label, itsfirst in overlapcheck.items():
+            for symbol in itsfirst:
+                if symbol in inverse:
+                    raise ValueError("rule %s is ambiguous; %s is in the"
+                                     " first sets of %s as well as %s" %
+                                     (name, symbol, label, inverse[symbol]))
+                inverse[symbol] = label
+        self.first[name] = totalset
+
+    def parse(self):
+        dfas = {}
+        startsymbol = None
+        # MSTART: (NEWLINE | RULE)* ENDMARKER
+        while self.type != token.ENDMARKER:
+            while self.type == token.NEWLINE:
+                self.gettoken()
+            # RULE: NAME ':' RHS NEWLINE
+            name = self.expect(token.NAME)
+            self.expect(token.OP, ":")
+            a, z = self.parse_rhs()
+            self.expect(token.NEWLINE)
+            #self.dump_nfa(name, a, z)
+            dfa = self.make_dfa(a, z)
+            #self.dump_dfa(name, dfa)
+            oldlen = len(dfa)
+            self.simplify_dfa(dfa)
+            newlen = len(dfa)
+            dfas[name] = dfa
+            #print name, oldlen, newlen
+            if startsymbol is None:
+                startsymbol = name
+        return dfas, startsymbol
+
+    def make_dfa(self, start, finish):
+        # To turn an NFA into a DFA, we define the states of the DFA
+        # to correspond to *sets* of states of the NFA.  Then do some
+        # state reduction.  Let's represent sets as dicts with 1 for
+        # values.
+        assert isinstance(start, NFAState)
+        assert isinstance(finish, NFAState)
+        def closure(state):
+            base = {}
+            addclosure(state, base)
+            return base
+        def addclosure(state, base):
+            assert isinstance(state, NFAState)
+            if state in base:
+                return
+            base[state] = 1
+            for label, next in state.arcs:
+                if label is None:
+                    addclosure(next, base)
+        states = [DFAState(closure(start), finish)]
+        for state in states: # NB states grows while we're iterating
+            arcs = {}
+            for nfastate in state.nfaset:
+                for label, next in nfastate.arcs:
+                    if label is not None:
+                        addclosure(next, arcs.setdefault(label, {}))
+            for label, nfaset in sorted(arcs.items()):
+                for st in states:
+                    if st.nfaset == nfaset:
+                        break
+                else:
+                    st = DFAState(nfaset, finish)
+                    states.append(st)
+                state.addarc(st, label)
+        return states # List of DFAState instances; first one is start
+
+    def dump_nfa(self, name, start, finish):
+        print("Dump of NFA for", name)
+        todo = [start]
+        for i, state in enumerate(todo):
+            print("  State", i, state is finish and "(final)" or "")
+            for label, next in state.arcs:
+                if next in todo:
+                    j = todo.index(next)
+                else:
+                    j = len(todo)
+                    todo.append(next)
+                if label is None:
+                    print("    -> %d" % j)
+                else:
+                    print("    %s -> %d" % (label, j))
+
+    def dump_dfa(self, name, dfa):
+        print("Dump of DFA for", name)
+        for i, state in enumerate(dfa):
+            print("  State", i, state.isfinal and "(final)" or "")
+            for label, next in sorted(state.arcs.items()):
+                print("    %s -> %d" % (label, dfa.index(next)))
+
+    def simplify_dfa(self, dfa):
+        # This is not theoretically optimal, but works well enough.
+        # Algorithm: repeatedly look for two states that have the same
+        # set of arcs (same labels pointing to the same nodes) and
+        # unify them, until things stop changing.
+
+        # dfa is a list of DFAState instances
+        changes = True
+        while changes:
+            changes = False
+            for i, state_i in enumerate(dfa):
+                for j in range(i+1, len(dfa)):
+                    state_j = dfa[j]
+                    if state_i == state_j:
+                        #print "  unify", i, j
+                        del dfa[j]
+                        for state in dfa:
+                            state.unifystate(state_j, state_i)
+                        changes = True
+                        break
+
+    def parse_rhs(self):
+        # RHS: ALT ('|' ALT)*
+        a, z = self.parse_alt()
+        if self.value != "|":
+            return a, z
+        else:
+            aa = NFAState()
+            zz = NFAState()
+            aa.addarc(a)
+            z.addarc(zz)
+            while self.value == "|":
+                self.gettoken()
+                a, z = self.parse_alt()
+                aa.addarc(a)
+                z.addarc(zz)
+            return aa, zz
+
+    def parse_alt(self):
+        # ALT: ITEM+
+        a, b = self.parse_item()
+        while (self.value in ("(", "[") or
+               self.type in (token.NAME, token.STRING)):
+            c, d = self.parse_item()
+            b.addarc(c)
+            b = d
+        return a, b
+
+    def parse_item(self):
+        # ITEM: '[' RHS ']' | ATOM ['+' | '*']
+        if self.value == "[":
+            self.gettoken()
+            a, z = self.parse_rhs()
+            self.expect(token.OP, "]")
+            a.addarc(z)
+            return a, z
+        else:
+            a, z = self.parse_atom()
+            value = self.value
+            if value not in ("+", "*"):
+                return a, z
+            self.gettoken()
+            z.addarc(a)
+            if value == "+":
+                return a, z
+            else:
+                return a, a
+
+    def parse_atom(self):
+        # ATOM: '(' RHS ')' | NAME | STRING
+        if self.value == "(":
+            self.gettoken()
+            a, z = self.parse_rhs()
+            self.expect(token.OP, ")")
+            return a, z
+        elif self.type in (token.NAME, token.STRING):
+            a = NFAState()
+            z = NFAState()
+            a.addarc(z, self.value)
+            self.gettoken()
+            return a, z
+        else:
+            self.raise_error("expected (...) or NAME or STRING, got %s/%s",
+                             self.type, self.value)
+
+    def expect(self, type, value=None):
+        if self.type != type or (value is not None and self.value != value):
+            self.raise_error("expected %s/%s, got %s/%s",
+                             type, value, self.type, self.value)
+        value = self.value
+        self.gettoken()
+        return value
+
+    def gettoken(self):
+        tup = next(self.generator)
+        while tup[0] in (tokenize.COMMENT, tokenize.NL):
+            tup = next(self.generator)
+        self.type, self.value, self.begin, self.end, self.line = tup
+        #print token.tok_name[self.type], repr(self.value)
+
+    def raise_error(self, msg, *args):
+        if args:
+            try:
+                msg = msg % args
+            except:
+                msg = " ".join([msg] + list(map(str, args)))
+        raise SyntaxError(msg, (self.filename, self.end[0],
+                                self.end[1], self.line))
+
+class NFAState(object):
+
+    def __init__(self):
+        self.arcs = [] # list of (label, NFAState) pairs
+
+    def addarc(self, next, label=None):
+        assert label is None or isinstance(label, str)
+        assert isinstance(next, NFAState)
+        self.arcs.append((label, next))
+
+class DFAState(object):
+
+    def __init__(self, nfaset, final):
+        assert isinstance(nfaset, dict)
+        assert isinstance(next(iter(nfaset)), NFAState)
+        assert isinstance(final, NFAState)
+        self.nfaset = nfaset
+        self.isfinal = final in nfaset
+        self.arcs = {} # map from label to DFAState
+
+    def addarc(self, next, label):
+        assert isinstance(label, str)
+        assert label not in self.arcs
+        assert isinstance(next, DFAState)
+        self.arcs[label] = next
+
+    def unifystate(self, old, new):
+        for label, next in self.arcs.items():
+            if next is old:
+                self.arcs[label] = new
+
+    def __eq__(self, other):
+        # Equality test -- ignore the nfaset instance variable
+        assert isinstance(other, DFAState)
+        if self.isfinal != other.isfinal:
+            return False
+        # Can't just return self.arcs == other.arcs, because that
+        # would invoke this method recursively, with cycles...
+        if len(self.arcs) != len(other.arcs):
+            return False
+        for label, next in self.arcs.items():
+            if next is not other.arcs.get(label):
+                return False
+        return True
+
+    __hash__ = None # For Py3 compatibility.
+
+def generate_grammar(filename, stream=None):
+    p = ParserGenerator(filename, stream)
+    return p.make_grammar()
--- a/python/extractor/blib2to3/pgen2/token.py
+++ b/python/extractor/blib2to3/pgen2/token.py
@@ -0,0 +1,91 @@
+"""Token constants (from "token.h")."""
+
+#  Taken from Python (r53757) and modified to include some tokens
+#   originally monkeypatched in by pgen2.tokenize
+
+#--start constants--
+ENDMARKER = 0
+NAME = 1
+NUMBER = 2
+STRING = 3
+NEWLINE = 4
+INDENT = 5
+DEDENT = 6
+LPAR = 7
+RPAR = 8
+LSQB = 9
+RSQB = 10
+COLON = 11
+COMMA = 12
+SEMI = 13
+PLUS = 14
+MINUS = 15
+STAR = 16
+SLASH = 17
+VBAR = 18
+AMPER = 19
+LESS = 20
+GREATER = 21
+EQUAL = 22
+DOT = 23
+PERCENT = 24
+BACKQUOTE = 25
+LBRACE = 26
+RBRACE = 27
+EQEQUAL = 28
+NOTEQUAL = 29
+LESSEQUAL = 30
+GREATEREQUAL = 31
+TILDE = 32
+CIRCUMFLEX = 33
+LEFTSHIFT = 34
+RIGHTSHIFT = 35
+DOUBLESTAR = 36
+PLUSEQUAL = 37
+MINEQUAL = 38
+STAREQUAL = 39
+SLASHEQUAL = 40
+PERCENTEQUAL = 41
+AMPEREQUAL = 42
+VBAREQUAL = 43
+CIRCUMFLEXEQUAL = 44
+LEFTSHIFTEQUAL = 45
+RIGHTSHIFTEQUAL = 46
+DOUBLESTAREQUAL = 47
+DOUBLESLASH = 48
+DOUBLESLASHEQUAL = 49
+AT = 50
+ATEQUAL = 51
+OP = 52
+COMMENT = 53
+NL = 54
+RARROW = 55
+AWAIT = 56
+ASYNC = 57
+DOLLARNAME = 58
+FSTRING_START = 59
+FSTRING_MID = 60
+FSTRING_END = 61
+CONVERSION = 62
+COLONEQUAL = 63
+FSTRING_SPEC = 64
+ILLEGALINDENT = 65
+ERRORTOKEN = 66
+N_TOKENS = 67
+NT_OFFSET = 256
+#--end constants--
+
+tok_name = {}
+for _name, _value in list(globals().items()):
+    if type(_value) is type(0):
+        tok_name[_value] = _name
+
+
+def ISTERMINAL(x):
+    return x < NT_OFFSET
+
+def ISNONTERMINAL(x):
+    return x >= NT_OFFSET
+
+def ISEOF(x):
+    return x == ENDMARKER
--- a/python/extractor/blib2to3/pgen2/tokenize.py
+++ b/python/extractor/blib2to3/pgen2/tokenize.py
@@ -0,0 +1,509 @@
+# Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006 Python Software Foundation.
+# All rights reserved.
+
+"""Tokenization help for Python programs.
+
+generate_tokens(readline) is a generator that breaks a stream of
+text into Python tokens.  It accepts a readline-like method which is called
+repeatedly to get the next line of input (or "" for EOF).  It generates
+5-tuples with these members:
+
+    the token type (see token.py)
+    the token (a string)
+    the starting (row, column) indices of the token (a 2-tuple of ints)
+    the ending (row, column) indices of the token (a 2-tuple of ints)
+    the original line (string)
+
+It is designed to match the working of the Python tokenizer exactly, except
+that it produces COMMENT tokens for comments and gives type OP for all
+operators
+
+Older entry points
+    tokenize_loop(readline, tokeneater)
+    tokenize(readline, tokeneater=printtoken)
+are the same, except instead of generating tokens, tokeneater is a callback
+function to which the 5 fields described above are passed as 5 arguments,
+each time a new token is found."""
+
+__author__ = 'Ka-Ping Yee <ping@lfw.org>'
+__credits__ = \
+    'GvR, ESR, Tim Peters, Thomas Wouters, Fred Drake, Skip Montanaro'
+
+import re
+from codecs import BOM_UTF8, lookup
+from blib2to3.pgen2.token import *
+import sys
+
+from . import token
+__all__ = [x for x in dir(token) if x[0] != '_'] + ["tokenize",
+           "generate_tokens", "untokenize"]
+del token
+
+try:
+    bytes
+except NameError:
+    # Support bytes type in Python <= 2.5, so 2to3 turns itself into
+    # valid Python 3 code.
+    bytes = str
+
+def group(*choices): return '(' + '|'.join(choices) + ')'
+def any(*choices): return group(*choices) + '*'
+def maybe(*choices): return group(*choices) + '?'
+def _combinations(*l):
+    return set(
+        x + y for x in l for y in l + ("",) if x.lower() != y.lower()
+    )
+
+Whitespace = r'[ \f\t]*'
+Comment = r'#[^\r\n]*'
+Ignore = Whitespace + any(r'\\\r?\n' + Whitespace) + maybe(Comment)
+Name = r'\w+'  # this is invalid but it's fine because Name comes after Number in all groups
+DollarName = r'\$\w+'
+
+Binnumber = r'0[bB]_?[01]+(?:_[01]+)*'
+Hexnumber = r'0[xX]_?[\da-fA-F]+(?:_[\da-fA-F]+)*[lL]?'
+Octnumber = r'0[oO]?_?[0-7]+(?:_[0-7]+)*[lL]?'
+Decnumber = group(r'[1-9]\d*(?:_\d+)*[lL]?', '0[lL]?')
+Intnumber = group(Binnumber, Hexnumber, Octnumber, Decnumber)
+Exponent = r'[eE][-+]?\d+(?:_\d+)*'
+Pointfloat = group(r'\d+(?:_\d+)*\.(?:\d+(?:_\d+)*)?', r'\.\d+(?:_\d+)*') + maybe(Exponent)
+Expfloat = r'\d+(?:_\d+)*' + Exponent
+Floatnumber = group(Pointfloat, Expfloat)
+Imagnumber = group(r'\d+(?:_\d+)*[jJ]', Floatnumber + r'[jJ]')
+Number = group(Imagnumber, Floatnumber, Intnumber)
+
+# Tail end of ' string.
+Single = r"[^'\\]*(?:\\.[^'\\]*)*'"
+# Tail end of " string.
+Double = r'[^"\\]*(?:\\.[^"\\]*)*"'
+# Tail end of ''' string.
+Single3 = r"[^'\\]*(?:(?:\\.|'(?!''))[^'\\]*)*'''"
+# Tail end of """ string.
+Double3 = r'[^"\\]*(?:(?:\\.|"(?!""))[^"\\]*)*"""'
+_litprefix = r"(?:[uUrRbBfF]|[rR][fFbB]|[fFbBuU][rR])?"
+Triple = group(_litprefix + "'''", _litprefix + '"""')
+# Single-line ' or " string.
+String = group(_litprefix + r"'[^\n'\\]*(?:\\.[^\n'\\]*)*'",
+               _litprefix + r'"[^\n"\\]*(?:\\.[^\n"\\]*)*"')
+
+# Because of leftmost-then-longest match semantics, be sure to put the
+# longest operators first (e.g., if = came before ==, == would get
+# recognized as two instances of =).
+Operator = group(r"\*\*=?", r">>=?", r"<<=?", r"<>", r"!=",
+                 r"//=?", r"->",
+                 r"[+\-*/%&@|^=<>]=?",
+                 r"~")
+
+Bracket = '[][(){}]'
+Special = group(r'\r?\n', r'[:;.,`@]')
+Funny = group(Operator, Bracket, Special)
+
+PlainToken = group(Number, Funny, String, Name, DollarName)
+Token = Ignore + PlainToken
+
+# First (or only) line of ' or " string.
+ContStr = group(_litprefix + r"'[^\n'\\]*(?:\\.[^\n'\\]*)*" +
+                group("'", r'\\\r?\n'),
+                _litprefix + r'"[^\n"\\]*(?:\\.[^\n"\\]*)*' +
+                group('"', r'\\\r?\n'))
+PseudoExtras = group(r'\\\r?\n', Comment, Triple)
+PseudoToken = Whitespace + group(PseudoExtras, Number, Funny, ContStr, Name, DollarName)
+
+tokenprog = re.compile(Token, re.UNICODE)
+pseudoprog = re.compile(PseudoToken, re.UNICODE)
+single3prog = re.compile(Single3)
+double3prog = re.compile(Double3)
+
+_strprefixes = (
+    _combinations('r', 'R', 'f', 'F') |
+    _combinations('r', 'R', 'b', 'B') |
+    {'u', 'U', 'ur', 'uR', 'Ur', 'UR'}
+)
+
+endprogs = {"'": re.compile(Single), '"': re.compile(Double),
+            "'''": single3prog, '"""': double3prog,
+        }
+endprogs.update({prefix+"'''": single3prog for prefix in _strprefixes})
+endprogs.update({prefix+'"""': double3prog for prefix in _strprefixes})
+endprogs.update({prefix: None for prefix in _strprefixes})
+
+triple_quoted = (
+    {"'''", '"""'} |
+    {prefix+"'''" for prefix in _strprefixes} |
+    {prefix+'"""' for prefix in _strprefixes}
+)
+single_quoted = (
+    {"'", '"'} |
+    {prefix+"'" for prefix in _strprefixes} |
+    {prefix+'"' for prefix in _strprefixes}
+)
+
+tabsize = 8
+
+class TokenError(Exception): pass
+
+class StopTokenizing(Exception): pass
+
+def printtoken(type, token, xxx_todo_changeme, xxx_todo_changeme1, line): # for testing
+    (srow, scol) = xxx_todo_changeme
+    (erow, ecol) = xxx_todo_changeme1
+    print("%d,%d-%d,%d:\t%s\t%s" % \
+        (srow, scol, erow, ecol, tok_name[type], repr(token)))
+
+def tokenize(readline, tokeneater=printtoken):
+    """
+    The tokenize() function accepts two parameters: one representing the
+    input stream, and one providing an output mechanism for tokenize().
+
+    The first parameter, readline, must be a callable object which provides
+    the same interface as the readline() method of built-in file objects.
+    Each call to the function should return one line of input as a string.
+
+    The second parameter, tokeneater, must also be a callable object. It is
+    called once for each token, with five arguments, corresponding to the
+    tuples generated by generate_tokens().
+    """
+    try:
+        tokenize_loop(readline, tokeneater)
+    except StopTokenizing:
+        pass
+
+# backwards compatible interface
+def tokenize_loop(readline, tokeneater):
+    for token_info in generate_tokens(readline):
+        tokeneater(*token_info)
+
+if sys.version_info > (3,):
+    isidentifier = str.isidentifier
+else:
+    IDENTIFIER_RE = re.compile(r"^[^\d\W]\w*$", re.UNICODE)
+
+    def isidentifier(s):
+        return bool(IDENTIFIER_RE.match(s))
+
+ASCII = re.ASCII if sys.version_info > (3,) else 0
+cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)', ASCII)
+blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)', ASCII)
+
+def _get_normal_name(orig_enc):
+    """Imitates get_normal_name in tokenizer.c."""
+    # Only care about the first 12 characters.
+    enc = orig_enc[:12].lower().replace("_", "-")
+    if enc == "utf-8" or enc.startswith("utf-8-"):
+        return "utf-8"
+    if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \
+       enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")):
+        return "iso-8859-1"
+    return orig_enc
+
+def detect_encoding(readline):
+    """
+    The detect_encoding() function is used to detect the encoding that should
+    be used to decode a Python source file. It requires one argument, readline,
+    in the same way as the tokenize() generator.
+
+    It will call readline a maximum of twice, and return the encoding used
+    (as a string) and a list of any lines (left as bytes) it has read
+    in.
+
+    It detects the encoding from the presence of a utf-8 bom or an encoding
+    cookie as specified in pep-0263. If both a bom and a cookie are present, but
+    disagree, a SyntaxError will be raised. If the encoding cookie is an invalid
+    charset, raise a SyntaxError.  Note that if a utf-8 bom is found,
+    'utf-8-sig' is returned.
+
+    If no encoding is specified, then the default of 'utf-8' will be returned.
+    """
+    bom_found = False
+    encoding = None
+    default = 'utf-8'
+    def read_or_stop():
+        try:
+            return readline()
+        except StopIteration:
+            return bytes()
+
+    def find_cookie(line):
+        try:
+            line_string = line.decode('ascii')
+        except UnicodeDecodeError:
+            return None
+        match = cookie_re.match(line_string)
+        if not match:
+            return None
+        encoding = _get_normal_name(match.group(1))
+        try:
+            codec = lookup(encoding)
+        except LookupError:
+            # This behaviour mimics the Python interpreter
+            raise SyntaxError("unknown encoding: " + encoding)
+
+        if bom_found:
+            if codec.name != 'utf-8':
+                # This behaviour mimics the Python interpreter
+                raise SyntaxError('encoding problem: utf-8')
+            encoding += '-sig'
+        return encoding
+
+    first = read_or_stop()
+    if first.startswith(BOM_UTF8):
+        bom_found = True
+        first = first[3:]
+        default = 'utf-8-sig'
+    if not first:
+        return default, []
+
+    encoding = find_cookie(first)
+    if encoding:
+        return encoding, [first]
+    if not blank_re.match(first):
+        return default, [first]
+
+    second = read_or_stop()
+    if not second:
+        return default, [first]
+
+    encoding = find_cookie(second)
+    if encoding:
+        return encoding, [first, second]
+
+    return default, [first, second]
+
+
+def generate_tokens(readline):
+    """
+    The generate_tokens() generator requires one argument, readline, which
+    must be a callable object which provides the same interface as the
+    readline() method of built-in file objects. Each call to the function
+    should return one line of input as a string.  Alternately, readline
+    can be a callable function terminating with StopIteration:
+        readline = open(myfile).next    # Example of alternate readline
+
+    The generator produces 5-tuples with these members: the token type; the
+    token string; a 2-tuple (srow, scol) of ints specifying the row and
+    column where the token begins in the source; a 2-tuple (erow, ecol) of
+    ints specifying the row and column where the token ends in the source;
+    and the line on which the token was found. The line passed is the
+    logical line; continuation lines are included.
+    """
+    lnum = parenlev = continued = 0
+    numchars = '0123456789'
+    contstr, needcont = '', 0
+    contline = None
+    indents = [0]
+
+    # 'stashed' and 'async_*' are used for async/await parsing
+    stashed = None
+    async_def = False
+    async_def_indent = 0
+    async_def_nl = False
+
+    while 1:                                   # loop over lines in stream
+        try:
+            line = readline()
+        except StopIteration:
+            line = ''
+        lnum = lnum + 1
+        pos, max = 0, len(line)
+
+        if contstr:                            # continued string
+            if not line:
+                raise TokenError("EOF in multi-line string", strstart)
+            endmatch = endprog.match(line)
+            if endmatch:
+                pos = end = endmatch.end(0)
+                yield (STRING, contstr + line[:end],
+                       strstart, (lnum, end), contline + line)
+                contstr, needcont = '', 0
+                contline = None
+            elif needcont and line[-2:] != '\\\n' and line[-3:] != '\\\r\n':
+                yield (ERRORTOKEN, contstr + line,
+                           strstart, (lnum, len(line)), contline)
+                contstr = ''
+                contline = None
+                continue
+            else:
+                contstr = contstr + line
+                contline = contline + line
+                continue
+
+        elif parenlev == 0 and not continued:  # new statement
+            if not line: break
+            column = 0
+            while pos < max:                   # measure leading whitespace
+                if line[pos] == ' ': column = column + 1
+                elif line[pos] == '\t': column = (column//tabsize + 1)*tabsize
+                elif line[pos] == '\f': column = 0
+                else: break
+                pos = pos + 1
+            if pos == max: break
+
+            if stashed:
+                yield stashed
+                stashed = None
+
+            if line[pos] in '\r\n':            # skip blank lines
+                yield (NL, line[pos:], (lnum, pos), (lnum, len(line)), line)
+                continue
+
+            if line[pos] == '#':               # skip comments
+                comment_token = line[pos:].rstrip('\r\n')
+                nl_pos = pos + len(comment_token)
+                yield (COMMENT, comment_token,
+                        (lnum, pos), (lnum, pos + len(comment_token)), line)
+                yield (NL, line[nl_pos:],
+                        (lnum, nl_pos), (lnum, len(line)), line)
+                continue
+
+            if column > indents[-1]:           # count indents
+                indents.append(column)
+                yield (INDENT, line[:pos], (lnum, 0), (lnum, pos), line)
+
+            while column < indents[-1]:        # count dedents
+                if column not in indents:
+                    raise IndentationError(
+                        "unindent does not match any outer indentation level",
+                        ("<tokenize>", lnum, pos, line))
+                indents = indents[:-1]
+
+                if async_def and async_def_indent >= indents[-1]:
+                    async_def = False
+                    async_def_nl = False
+                    async_def_indent = 0
+
+                yield (DEDENT, '', (lnum, pos), (lnum, pos), line)
+
+            if async_def and async_def_nl and async_def_indent >= indents[-1]:
+                async_def = False
+                async_def_nl = False
+                async_def_indent = 0
+
+        else:                                  # continued statement
+            if not line:
+                raise TokenError("EOF in multi-line statement", (lnum, 0))
+            continued = 0
+
+        while pos < max:
+            pseudomatch = pseudoprog.match(line, pos)
+            if pseudomatch:                                # scan for tokens
+                start, end = pseudomatch.span(1)
+                spos, epos, pos = (lnum, start), (lnum, end), end
+                token, initial = line[start:end], line[start]
+
+                if initial in numchars or \
+                   (initial == '.' and token != '.'):      # ordinary number
+                    yield (NUMBER, token, spos, epos, line)
+                elif initial in '\r\n':
+                    newline = NEWLINE
+                    if parenlev > 0:
+                        newline = NL
+                    elif async_def:
+                        async_def_nl = True
+                    if stashed:
+                        yield stashed
+                        stashed = None
+                    yield (newline, token, spos, epos, line)
+
+                elif initial == '#':
+                    assert not token.endswith("\n")
+                    if stashed:
+                        yield stashed
+                        stashed = None
+                    yield (COMMENT, token, spos, epos, line)
+                elif token in triple_quoted:
+                    endprog = endprogs[token]
+                    endmatch = endprog.match(line, pos)
+                    if endmatch:                           # all on one line
+                        pos = endmatch.end(0)
+                        token = line[start:pos]
+                        if stashed:
+                            yield stashed
+                            stashed = None
+                        yield (STRING, token, spos, (lnum, pos), line)
+                    else:
+                        strstart = (lnum, start)           # multiple lines
+                        contstr = line[start:]
+                        contline = line
+                        break
+                elif initial in single_quoted or \
+                    token[:2] in single_quoted or \
+                    token[:3] in single_quoted:
+                    if token[-1] == '\n':                  # continued string
+                        strstart = (lnum, start)
+                        endprog = (endprogs[initial] or endprogs[token[1]] or
+                                   endprogs[token[2]])
+                        contstr, needcont = line[start:], 1
+                        contline = line
+                        break
+                    else:                                  # ordinary string
+                        if stashed:
+                            yield stashed
+                            stashed = None
+                        yield (STRING, token, spos, epos, line)
+                elif isidentifier(initial):               # ordinary name
+                    if token in ('async', 'await'):
+                        if async_def:
+                            yield (ASYNC if token == 'async' else AWAIT,
+                                   token, spos, epos, line)
+                            continue
+
+                    tok = (NAME, token, spos, epos, line)
+                    if token == 'async' and not stashed:
+                        stashed = tok
+                        continue
+
+                    if token in ('def', 'for'):
+                        if (stashed
+                                and stashed[0] == NAME
+                                and stashed[1] == 'async'):
+
+                            if token == 'def':
+                                async_def = True
+                                async_def_indent = indents[-1]
+
+                            yield (ASYNC, stashed[1],
+                                   stashed[2], stashed[3],
+                                   stashed[4])
+                            stashed = None
+
+                    if stashed:
+                        yield stashed
+                        stashed = None
+
+                    yield tok
+                elif initial == '\\':                      # continued stmt
+                    # This yield is new; needed for better idempotency:
+                    if stashed:
+                        yield stashed
+                        stashed = None
+                    yield (NL, token, spos, (lnum, pos), line)
+                    continued = 1
+                elif initial == '$':
+                    if stashed:
+                        yield stashed
+                        stashed = None
+                    yield (DOLLARNAME, token, spos, epos, line)
+                else:
+                    if initial in '([{': parenlev = parenlev + 1
+                    elif initial in ')]}': parenlev = parenlev - 1
+                    if stashed:
+                        yield stashed
+                        stashed = None
+                    yield (OP, token, spos, epos, line)
+            else:
+                yield (ERRORTOKEN, line[pos],
+                           (lnum, pos), (lnum, pos+1), line)
+                pos = pos + 1
+
+    if stashed:
+        yield stashed
+        stashed = None
+
+    for indent in indents[1:]:                 # pop remaining indent levels
+        yield (DEDENT, '', (lnum, 0), (lnum, 0), '')
+    yield (ENDMARKER, '', (lnum, 0), (lnum, 0), '')
+
+if __name__ == '__main__':                     # testing
+    import sys
+    if len(sys.argv) > 1: tokenize(open(sys.argv[1]).readline)
+    else: tokenize(sys.stdin.readline)
--- a/python/extractor/blib2to3/pygram.py
+++ b/python/extractor/blib2to3/pygram.py
@@ -0,0 +1,56 @@
+# Copyright 2006 Google, Inc. All Rights Reserved.
+# Licensed to PSF under a Contributor Agreement.
+
+"""Export the Python grammar and symbols."""
+
+# Python imports
+import os
+
+# Local imports
+from .pgen2 import token
+from .pgen2 import driver
+
+# The grammar file
+_GRAMMAR_FILE = "Grammar.txt"
+
+
+class Symbols(object):
+
+    def __init__(self, grammar):
+        """Initializer.
+
+        Creates an attribute for each grammar symbol (nonterminal),
+        whose value is the symbol's type (an int >= 256).
+        """
+        for name, symbol in grammar.symbol2number.items():
+            setattr(self, name, symbol)
+
+
+def initialize(cache_dir=None):
+    global python2_grammar
+    global python2_grammar_no_print_statement
+    global python3_grammar
+    global python3_grammar_no_async
+    global python_symbols
+
+    python_grammar = driver.load_grammar("blib2to3", _GRAMMAR_FILE)
+    python_symbols = Symbols(python_grammar)
+
+    # Python 2
+    python2_grammar = python_grammar.copy()
+    del python2_grammar.keywords["async"]
+    del python2_grammar.keywords["await"]
+
+    # Python 2 + from __future__ import print_function
+    python2_grammar_no_print_statement = python2_grammar.copy()
+    del python2_grammar_no_print_statement.keywords["print"]
+
+    # Python 3
+    python3_grammar = python_grammar
+    del python3_grammar.keywords["print"]
+    del python3_grammar.keywords["exec"]
+
+    #Python 3 wihtout async or await
+    python3_grammar_no_async = python3_grammar.copy()
+    del python3_grammar_no_async.keywords["async"]
+    del python3_grammar_no_async.keywords["await"]
--- a/python/extractor/blib2to3/pytree.py
+++ b/python/extractor/blib2to3/pytree.py
@@ -0,0 +1,29 @@
+# Copyright 2006 Google, Inc. All Rights Reserved.
+# Licensed to PSF under a Contributor Agreement.
+
+"""
+Python parse tree definitions.
+
+This is a very concrete parse tree; we need to keep every token and
+even the comments and whitespace between tokens.
+
+There's also a pattern matching implementation here.
+"""
+
+__author__ = "Guido van Rossum <guido@python.org>"
+
+import sys
+from io import StringIO
+
+HUGE = 0x7FFFFFFF  # maximum repeat count, default max
+
+_type_reprs = {}
+def type_repr(type_num):
+    global _type_reprs
+    if not _type_reprs:
+        from .pygram import python_symbols
+        # printing tokens is possible but not as useful
+        # from .pgen2 import token // token.__dict__.items():
+        for name, val in python_symbols.__dict__.items():
+            if type(val) == int: _type_reprs[val] = name
+    return _type_reprs.setdefault(type_num, type_num)
--- a/python/extractor/buildtools/init.py
+++ b/python/extractor/buildtools/init.py
--- a/python/extractor/buildtools/auto_install.py
+++ b/python/extractor/buildtools/auto_install.py
@@ -0,0 +1,106 @@
+#!/usr/bin/python3
+
+import sys
+import logging
+import os
+import os.path
+import re
+
+from packaging.specifiers import SpecifierSet
+from packaging.version import Version
+
+import buildtools.semmle.requirements as requirements
+
+logging.basicConfig(level=logging.WARNING)
+
+
+def pip_install(req, venv, dependencies=True, wheel=True):
+    venv.upgrade_pip()
+    tmp = requirements.save_to_file([req])
+    #Install the requirements using the venv python
+    args = [ "install", "-r", tmp]
+    if dependencies:
+        print("Installing %s with dependencies." % req)
+    elif wheel:
+        print("Installing %s without dependencies." % req)
+        args += [ "--no-deps"]
+    else:
+        print("Installing %s without dependencies or wheel." % req)
+        args += [ "--no-deps", "--no-binary", ":all:"]
+    print("Calling " + " ".join(args))
+    venv.pip(args)
+    os.remove(tmp)
+
+def restrict_django(reqs):
+    for req in reqs:
+        if sys.version_info[0] < 3 and req.name.lower() == "django":
+            if Version("2") in req.specifier:
+                req.specifier = SpecifierSet("<2")
+    return reqs
+
+ignored_packages = [
+    "pyobjc-.*",
+    "pypiwin32",
+    "frida",
+    "pyopenssl", # Installed by pip. Don't mess with its version.
+    "wxpython", # Takes forever to compile all the C code.
+    "cryptography", #Installed by pyOpenSSL and thus by pip. Don't mess with its version.
+    "psycopg2", #psycopg2 version 2.6 fails to install.
+]
+
+if os.name != "nt":
+    ignored_packages.append("pywin32") #Only works on Windows
+
+ignored_package_regex = re.compile("|".join(ignored_packages))
+
+def non_ignored(reqs):
+    filtered_reqs = []
+    for req in reqs:
+        if ignored_package_regex.match(req.name.lower()) is not None:
+            logging.info("Package %s is ignored. Skipping." % req.name)
+        else:
+            filtered_reqs += [req]
+    return filtered_reqs
+
+def try_install_with_deps(req, venv):
+    try:
+        pip_install(req, venv, dependencies=True)
+    except Exception as ex:
+        logging.warn("Failed to install all dependencies for " + req.name)
+        logging.info(ex)
+        try:
+            pip_install(req, venv, dependencies = False)
+        except Exception:
+            pip_install(req, venv, dependencies = False, wheel = False)
+
+def install(reqs, venv):
+    '''Attempt to install a sufficient and stable set of dependencies from the requirements.txt file.
+        First of all we 'clean' the requirements, removing contradictory version numbers.
+        Then we attempt to install the restricted version of each dependency, and , should that fail,
+        we install the unrestricted version. If that fails, the whole installation fails.
+        Once the immediate dependencies are installed, we then (attempt to ) install the dependencies.
+        Returns True if installation was successful. False otherwise.
+
+        `reqs` should be a string containing all requirements separated by newlines or a list of
+        strings with each string being a requirement.
+    '''
+    if isinstance(reqs, str):
+        reqs = reqs.split("\n")
+    reqs = requirements.parse(reqs)
+    reqs = restrict_django(reqs)
+    reqs = non_ignored(reqs)
+    cleaned = requirements.clean(reqs)
+    restricted = requirements.restrict(reqs)
+    for i, req in enumerate(restricted):
+        try:
+            try_install_with_deps(req, venv)
+        except Exception as ex1:
+            try:
+                try_install_with_deps(cleaned[i], venv)
+            except Exception as ex2:
+                logging.error("Failed to install " + req.name)
+                logging.warning(ex2)
+                return False
+            logging.info("Failed to install restricted form of " + req.name)
+            logging.info(ex1)
+    return True
--- a/python/extractor/buildtools/discover.py
+++ b/python/extractor/buildtools/discover.py
@@ -0,0 +1,65 @@
+import sys
+import os
+
+from buildtools import version
+
+DEFAULT_VERSION = 3
+
+def get_relative_root(root_identifiers):
+    if any([os.path.exists(identifier) for identifier in root_identifiers]):
+        print("Source root appears to be the real root.")
+        return "."
+
+    found = set()
+    for directory in next(os.walk("."))[1]:
+        if any([os.path.exists(os.path.join(directory, identifier)) for identifier in root_identifiers]):
+            found.add(directory)
+    if not found:
+        print("No directories containing root identifiers were found. Returning working directory as root.")
+        return "."
+    if len(found) > 1:
+        print("Multiple possible root directories found. Returning working directory as root.")
+        return "."
+
+    root = found.pop()
+    print("'%s' appears to be the root." % root)
+    return root
+
+def get_root(*root_identifiers):
+    return os.path.abspath(get_relative_root(root_identifiers))
+
+REQUIREMENTS_TAG = "LGTM_PYTHON_SETUP_REQUIREMENTS_FILES"
+
+def find_requirements(dir):
+    if REQUIREMENTS_TAG in os.environ:
+        val = os.environ[REQUIREMENTS_TAG]
+        if val == "false":
+            return []
+        paths = [ os.path.join(dir, line.strip()) for line in val.splitlines() ]
+        for p in paths:
+            if not os.path.exists(p):
+                raise IOError(p + " not found")
+        return paths
+    candidates = ["requirements.txt", "test-requirements.txt"]
+    return [ path if os.path.exists(path) else "" for path in [ os.path.join(dir, file) for file in  candidates] ]
+
+def discover(default_version=DEFAULT_VERSION):
+    """Discover things about the Python checkout and return a version, root, requirement-files triple."""
+    root = get_root("requirements.txt", "setup.py")
+    v = version.best_version(root, default_version)
+    # Unify the requirements or just get path to requirements...
+    requirement_files = find_requirements(root)
+    return v, root, requirement_files
+
+def get_version(default_version=DEFAULT_VERSION):
+    root = get_root("requirements.txt", "setup.py")
+    return version.best_version(root, default_version)
+
+def main():
+    if len(sys.argv) > 1:
+        print(discover(int(sys.argv[1])))
+    else:
+        print(discover())
+
+if __name__ == "__main__":
+    main()
--- a/python/extractor/buildtools/helper.py
+++ b/python/extractor/buildtools/helper.py
@@ -0,0 +1,16 @@
+import os
+import traceback
+import re
+
+
+SCRIPTDIR = os.path.split(os.path.dirname(__file__))[1]
+
+
+def print_exception_indented(opt=None):
+    exc_text = traceback.format_exc()
+    for line in exc_text.splitlines():
+        # remove path information that might be sensitive
+        # for example, in the .pyc files for Python 2, a traceback would contain
+        # /home/rasmus/code/target/thirdparty/python/build/extractor-python/buildtools/install.py
+        line = re.sub(r'File \".*' + SCRIPTDIR + r'(.*)\",', r'File <'+ SCRIPTDIR + r'\1>', line)
+        print('    ' + line)
--- a/python/extractor/buildtools/index.py
+++ b/python/extractor/buildtools/index.py
@@ -0,0 +1,429 @@
+import sys
+import os
+import subprocess
+import csv
+
+if sys.version_info < (3,):
+    from urlparse import urlparse
+    from urllib import url2pathname
+else:
+    from urllib.parse import urlparse
+    from urllib.request import url2pathname
+
+from buildtools import discover
+from buildtools import install
+from buildtools.version import executable, extractor_executable
+
+
+INCLUDE_TAG = "LGTM_INDEX_INCLUDE"
+EXCLUDE_TAG = "LGTM_INDEX_EXCLUDE"
+FILTER_TAG = "LGTM_INDEX_FILTERS"
+PATH_TAG = "LGTM_INDEX_IMPORT_PATH"
+REPO_FOLDERS_TAG = "LGTM_REPOSITORY_FOLDERS_CSV"
+REPO_EXCLUDE_KINDS = "metadata", "external"
+
+# These are the levels that the CodeQL CLI supports, in order of increasing verbosity.
+CLI_LOGGING_LEVELS = ['off', 'errors', 'warnings', 'progress', 'progress+', 'progress++', 'progress+++']
+
+# These are the verbosity levels used internally in the extractor. The indices of these levels
+# should match up with the corresponding constants in the semmle.logging module.
+EXTRACTOR_LOGGING_LEVELS = ['off', 'errors', 'warnings', 'info', 'debug', 'trace']
+
+def trap_cache():
+    return os.path.join(os.environ["LGTM_WORKSPACE"], "trap_cache")
+
+def split_into_options(lines, opt):
+    opts = []
+    for line in lines.split("\n"):
+        line = line.strip()
+        if line:
+            opts.append(opt)
+            opts.append(line)
+    return opts
+
+def get_include_options():
+    if INCLUDE_TAG in os.environ:
+        return split_into_options(os.environ[INCLUDE_TAG], "-R")
+    else:
+        src = os.environ["LGTM_SRC"]
+        return [ "-R", src]
+
+def get_exclude_options():
+    options = []
+    if EXCLUDE_TAG in os.environ:
+        options.extend(split_into_options(os.environ[EXCLUDE_TAG], "-Y"))
+    if REPO_FOLDERS_TAG not in os.environ:
+        return options
+    with open(os.environ[REPO_FOLDERS_TAG]) as csv_file:
+        csv_reader = csv.reader(csv_file)
+        next(csv_reader) # discard header
+        for kind, url in csv_reader:
+            if kind not in REPO_EXCLUDE_KINDS:
+                continue
+            try:
+                path = url2pathname(urlparse(url).path)
+            except:
+                print("Unable to parse '" +  url + "' as file url.")
+            else:
+                options.append("-Y")
+                options.append(path)
+    return options
+
+def get_filter_options():
+    if FILTER_TAG in os.environ:
+        return split_into_options(os.environ[FILTER_TAG], "--filter")
+    else:
+        return []
+
+def get_path_options(version):
+    # We want to stop extracting libraries, and only extract the code that is in the
+    # repo. While in the transition period for stopping to install dependencies in the
+    # codeql-action, we will need to be able to support both old and new behavior.
+    #
+    # Like PYTHONUNBUFFERED for Python, we treat any non-empty string as meaning the
+    # flag is enabled.
+    # https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUNBUFFERED
+    if os.environ.get("CODEQL_EXTRACTOR_PYTHON_DISABLE_LIBRARY_EXTRACTION"):
+        return []
+
+    # Not extracting dependencies will be default in CodeQL CLI release 2.16.0. Until
+    # 2.17.0, we provide an escape hatch to get the old behavior.
+    force_enable_envvar_name = "CODEQL_EXTRACTOR_PYTHON_FORCE_ENABLE_LIBRARY_EXTRACTION_UNTIL_2_17_0"
+    if os.environ.get(force_enable_envvar_name):
+        print("WARNING: We plan to remove the availability of the {} option in CodeQL CLI release 2.17.0 and beyond. Please let us know by submitting an issue to https://github.com/github/codeql why you needed to re-enable dependency extraction.".format(force_enable_envvar_name))
+        path_option = [ "-p", install.get_library(version)]
+        if PATH_TAG in os.environ:
+            path_option = split_into_options(os.environ[PATH_TAG], "-p") + path_option
+        return path_option
+    else:
+        print("INFO: The Python extractor has recently (from 2.16.0 CodeQL CLI release) stopped extracting dependencies by default, and therefore stopped analyzing the source code of dependencies by default. We plan to remove this entirely in CodeQL CLI release 2.17.0. If you encounter problems, please let us know by submitting an issue to https://github.com/github/codeql, so we can consider adjusting our plans. It is possible to re-enable dependency extraction by exporting '{}=1'.".format(force_enable_envvar_name))
+        return []
+
+def get_stdlib():
+    return os.path.dirname(os.__file__)
+
+
+def exclude_pip_21_3_build_dir_options():
+    """
+    Handle build/ dir from `pip install .` (new in pip 21.3)
+
+    Starting with pip 21.3, in-tree builds are now the default (see
+    https://pip.pypa.io/en/stable/news/#v21-3). This means that pip commands that build
+    the package (like `pip install .` or `pip wheel .`), will leave a copy of all the
+    package source code in `build/lib/<package-name>/`.
+
+    If that is done before invoking the extractor, we will end up extracting that copy
+    as well, which is very bad (especially for points-to performance). So with this
+    function we try to find such folders, so they can be excluded from extraction.
+
+    The only reliable sign is that inside the `build` folder, there must be a `lib`
+    subfolder, and there must not be any ordinary files.
+
+    When the `wheel` package is installed there will also be a `bdist.linux-x86_64`
+    subfolder. Although most people have the `wheel` package installed, it's not
+    required, so we don't use that in the logic.
+    """
+
+    # As a failsafe, we include logic to disable this functionality based on an
+    # environment variable.
+    #
+    # Like PYTHONUNBUFFERED for Python, we treat any non-empty string as meaning the
+    # flag is enabled.
+    # https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUNBUFFERED
+    if os.environ.get("CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_PIP_BUILD_DIR_EXCLUDE"):
+        return []
+
+    include_dirs = set(get_include_options()[1::2])
+
+    # For the purpose of exclusion, we normalize paths to their absolute path, just like
+    # we do in the actual traverser.
+    exclude_dirs = set(os.path.abspath(path) for path in get_exclude_options()[1::2])
+
+    to_exclude = list()
+
+    def walk_dir(dirpath):
+        if os.path.abspath(dirpath) in exclude_dirs:
+            return
+
+        contents = os.listdir(dirpath)
+        paths = [os.path.join(dirpath, c) for c in contents]
+        dirs = [path for path in paths if os.path.isdir(path)]
+        dirnames = [os.path.basename(path) for path in dirs]
+
+        # Allow Python package such as `mypkg.build.lib`, so if we see an `__init__.py`
+        # file in the current dir don't walk the tree further.
+        if "__init__.py" in contents:
+            return
+
+        # note that we don't require that there by a `setup.py` present beside the
+        # `build/` dir, since that is not required to build a package -- see
+        # https://pgjones.dev/blog/packaging-without-setup-py-2020
+        #
+        # Although I didn't observe `pip install .` with a package that uses `poetry` as
+        # the build-system leave behind a `build/` directory, that doesn't mean it
+        # couldn't happen.
+        if os.path.basename(dirpath) == "build" and "lib" in dirnames and dirs == paths:
+            to_exclude.append(dirpath)
+            return # no need to walk the sub directories
+
+        for dir in dirs:
+            # We ignore symlinks, as these can present infinite loops, and any folders
+            # they can point to will be handled on their own anyway.
+            if not os.path.islink(dir):
+                walk_dir(dir)
+
+    for top in include_dirs:
+        walk_dir(top)
+
+    options = []
+
+    if to_exclude:
+        print(
+            "Excluding the following directories from extraction, since they look like "
+            "in-tree build directories generated by pip: {}".format(to_exclude)
+        )
+        print(
+            "You can disable this behavior by setting the environment variable "
+            "CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_PIP_BUILD_DIR_EXCLUDE=1"
+        )
+        for dirpath in to_exclude:
+            options.append("-Y")  # `-Y` is the same as `--exclude-file`
+            options.append(dirpath)
+
+    return options
+
+
+def exclude_venvs_options():
+    """
+    If there are virtual environments (venv) present within the directory that is being
+    extracted, we don't want to recurse into all of these and extract all the Python
+    source code.
+
+    This function tries to find such venvs, and produce the right options to ignore
+    them.
+    """
+
+    # As a failsafe, we include logic to disable this functionality based on an
+    # environment variable.
+    #
+    # Like PYTHONUNBUFFERED for Python, we treat any non-empty string as meaning the
+    # flag is enabled.
+    # https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUNBUFFERED
+    if os.environ.get("CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_VENV_EXCLUDE"):
+        return []
+
+    include_dirs = set(get_include_options()[1::2])
+
+    # For the purpose of exclusion, we normalize paths to their absolute path, just like
+    # we do in the actual traverser.
+    exclude_dirs = set(os.path.abspath(path) for path in get_exclude_options()[1::2])
+
+    to_exclude = []
+
+    def walk_dir(dirpath):
+        if os.path.abspath(dirpath) in exclude_dirs:
+            return
+
+        paths = [os.path.join(dirpath, c) for c in os.listdir(dirpath)]
+        dirs = [path for path in paths if os.path.isdir(path)]
+        dirnames = [os.path.basename(path) for path in dirs]
+
+        # we look for `<venv>/Lib/site-packages` (Windows) or
+        # `<venv>/lib/python*/site-packages` (unix) without requiring any other files to
+        # be present.
+        #
+        # Initially we had implemented some more advanced logic to only ignore venvs
+        # that had a `pyvenv.cfg` or a suitable activate scripts. But reality turned out
+        # to be less reliable, so now we just ignore any venv that has a proper
+        # `site-packages` as a subfolder.
+        #
+        # This logic for detecting a virtual environment was based on the CPython implementation, see:
+        # - https://github.com/python/cpython/blob/4575c01b750cd26377e803247c38d65dad15e26a/Lib/venv/__init__.py#L122-L131
+        # - https://github.com/python/cpython/blob/4575c01b750cd26377e803247c38d65dad15e26a/Lib/venv/__init__.py#L170
+        #
+        # Some interesting examples:
+        # - windows without `activate`: https://github.com/NTUST/106-team4/tree/7f902fec29f68ca44d4f4385f2d7714c2078c937/finalPage/finalVENV/Scripts
+        # - windows with `activate`: https://github.com/Lynchie/KCM/tree/ea9eeed07e0c9eec41f9fc7480ce90390ee09876/VENV/Scripts
+        # - without `pyvenv.cfg`: https://github.com/FiacreT/M-moire/tree/4089755191ffc848614247e98bbb641c1933450d/osintplatform/testNeo/venv
+        # - without `pyvenv.cfg`: https://github.com/Lynchie/KCM/tree/ea9eeed07e0c9eec41f9fc7480ce90390ee09876/VENV
+        # - without `pyvenv.cfg`: https://github.com/mignonjia/NetworkingProject/tree/a89fe12ffbf384095766aadfe6454a4c0062d1e7/crud/venv
+        #
+        # I'm quite sure I saw some project on LGTM that had neither `pyvenv.cfg` or an activate script, but I could not find the reference again.
+
+        if "Lib" in dirnames:
+            has_site_packages_folder = os.path.exists(os.path.join(dirpath, "Lib", "site-packages"))
+        elif "lib" in dirnames:
+            lib_path = os.path.join(dirpath, "lib")
+            python_folders = [dirname for dirname in os.listdir(lib_path) if dirname.startswith("python")]
+            has_site_packages_folder = bool(python_folders) and any(
+                os.path.exists(os.path.join(dirpath, "lib", python_folder, "site-packages")) for python_folder in python_folders
+            )
+        else:
+            has_site_packages_folder = False
+
+        if has_site_packages_folder:
+            to_exclude.append(dirpath)
+            return # no need to walk the sub directories
+
+        for dir in dirs:
+            # We ignore symlinks, as these can present infinite loops, and any folders
+            # they can point to will be handled on their own anyway.
+            if not os.path.islink(dir):
+                walk_dir(dir)
+
+    for top in include_dirs:
+        walk_dir(top)
+
+    options = []
+
+    if to_exclude:
+        print(
+            "Excluding the following directories from extraction, since they look like "
+            "virtual environments: {}".format(to_exclude)
+        )
+        print(
+            "You can disable this behavior by setting the environment variable "
+            "CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_VENV_EXCLUDE=1"
+        )
+
+        for dirpath in to_exclude:
+            options.append("-Y")  # `-Y` is the same as `--exclude-file`
+            options.append(dirpath)
+
+    return options
+
+def get_extractor_logging_level(s: str):
+    """Returns a integer value corresponding to the logging level specified by the string s, or `None` if s is invalid."""
+    try:
+        return EXTRACTOR_LOGGING_LEVELS.index(s)
+    except ValueError:
+        return None
+
+def get_cli_logging_level(s: str):
+    """Returns a integer value corresponding to the logging level specified by the string s, or `None` if s is invalid."""
+    try:
+        return CLI_LOGGING_LEVELS.index(s)
+    except ValueError:
+        return None
+
+def get_logging_options():
+    # First look for the extractor-specific option
+    verbosity_level = os.environ.get("CODEQL_EXTRACTOR_PYTHON_OPTION_LOGGING_VERBOSITY", None)
+    if verbosity_level is not None:
+        level = get_extractor_logging_level(verbosity_level)
+        if level is None:
+            level = get_cli_logging_level(verbosity_level)
+            if level is None:
+                # This is unlikely to be reached in practice, as the level should be validated by the CLI.
+                raise ValueError(
+                    "Invalid verbosity level: {}. Valid values are: {}".format(
+                        verbosity_level, ", ".join(set(EXTRACTOR_LOGGING_LEVELS + CLI_LOGGING_LEVELS))
+                    )
+                )
+        return ["--verbosity", str(level)]
+
+    # Then look for the CLI-wide option
+    cli_verbosity_level = os.environ.get("CODEQL_VERBOSITY", None)
+    if cli_verbosity_level is not None:
+        level = get_cli_logging_level(cli_verbosity_level)
+        if level is None:
+            # This is unlikely to be reached in practice, as the level should be validated by the CLI.
+            raise ValueError(
+                "Invalid verbosity level: {}. Valid values are: {}".format(
+                    cli_verbosity_level, ", ".join(CLI_LOGGING_LEVELS)
+                )
+            )
+        return ["--verbosity", str(level)]
+
+    # Default behaviour: turn on verbose mode:
+    return ["-v"]
+
+
+def extractor_options(version):
+    options = []
+
+    options += get_logging_options()
+
+    # use maximum number of processes
+    options += ["-z", "all"]
+
+    # cache trap files
+    options += ["-c", trap_cache()]
+
+    options += get_path_options(version)
+    options += get_include_options()
+    options += get_exclude_options()
+    options += get_filter_options()
+    options += exclude_pip_21_3_build_dir_options()
+    options += exclude_venvs_options()
+
+    return options
+
+
+def site_flag(version):
+    #
+    # Disabling site with -S (which we do by default) has been observed to cause
+    # problems at some customers. We're not entirely sure enabling this by default is
+    # going to be 100% ok, so for now we just want to disable this flag if running with
+    # it turns out to be a problem (which we check for).
+    #
+    # see https://docs.python.org/3/library/site.html
+    #
+    # I don't see any reason for running with -S when invoking the tracer in this
+    # scenario. If we were using the executable from a virtual environment after
+    # installing PyPI packages, running without -S would allow one of those packages to
+    # influence the behavior of the extractor, as was the problem for CVE-2020-5252
+    # (described in https://github.com/akoumjian/python-safety-vuln). But since this is
+    # not the case, I don't think there is any advantage to running with -S.
+
+    # Although we have an automatic way that should detect when we should not be running
+    # with -S, we're not 100% certain that it is not possible to create _other_ strange
+    # Python installations where `gzip` could be available, but the rest of the standard
+    # library still not being available. Therefore we're going to keep this environment
+    # variable, just to make sure there is an easy fall-back in those cases.
+    #
+    # Like PYTHONUNBUFFERED for Python, we treat any non-empty string as meaning the
+    # flag is enabled.
+    # https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUNBUFFERED
+    if os.environ.get("CODEQL_EXTRACTOR_PYTHON_ENABLE_SITE"):
+        return []
+
+    try:
+        # In the cases where customers had problems, `gzip` was the first module
+        # encountered that could not be loaded, so that's the one we check for. Note
+        # that this has nothing to do with it being problematic to add GZIP support to
+        # Python :)
+        args = executable(version) + ["-S", "-c", "import gzip"]
+        subprocess.check_call(args)
+        return ["-S"]
+    except (subprocess.CalledProcessError, Exception):
+        print("Running without -S")
+        return []
+
+def get_analysis_version(major_version):
+    """Gets the version of Python that we _analyze_ the code as being written for.
+    The return value is a string, e.g. "3.11" or "2.7.18". Populating the `major_version`,
+    `minor_version` and `micro_version` predicates is done inside the CodeQL libraries.
+    """
+    # If the version is already specified, simply reuse it.
+    if "CODEQL_EXTRACTOR_PYTHON_ANALYSIS_VERSION" in os.environ:
+        return os.environ["CODEQL_EXTRACTOR_PYTHON_ANALYSIS_VERSION"]
+    elif major_version == 2:
+        return "2.7.18" # Last officially supported version
+    else:
+        return "3.12" # This should always be the latest supported version
+
+
+def main():
+    version = discover.get_version()
+    tracer = os.path.join(os.environ["SEMMLE_DIST"], "tools", "python_tracer.py")
+    args = extractor_executable() + site_flag(3) + [tracer] + extractor_options(version)
+    print("Calling " + " ".join(args))
+    sys.stdout.flush()
+    sys.stderr.flush()
+    env = os.environ.copy()
+    env["CODEQL_EXTRACTOR_PYTHON_ANALYSIS_VERSION"] = get_analysis_version(version)
+    subprocess.check_call(args, env=env)
+
+if __name__ == "__main__":
+     main()
--- a/python/extractor/buildtools/install.py
+++ b/python/extractor/buildtools/install.py
@@ -0,0 +1,123 @@
+import sys
+import os
+import subprocess
+import re
+import ast
+import tempfile
+
+from buildtools import unify_requirements
+from buildtools.version import executable
+from buildtools.version import WIN
+from buildtools.helper import print_exception_indented
+
+def call(args, cwd=None):
+    print("Calling " + " ".join(args))
+    sys.stdout.flush()
+    sys.stderr.flush()
+    subprocess.check_call(args, cwd=cwd)
+
+class Venv(object):
+
+    def __init__(self, path, version):
+        self.environ = {}
+        self.path = path
+        exe_ext = [ "Scripts", "python.exe" ] if WIN else [ "bin", "python" ]
+        self.venv_executable = os.path.join(self.path, *exe_ext)
+        self._lib = None
+        self.pip_upgraded = False
+        self.empty_folder = tempfile.mkdtemp(prefix="empty", dir=os.environ["LGTM_WORKSPACE"])
+        self.version = version
+
+    def create(self):
+        if self.version < 3:
+            venv = ["-m", "virtualenv", "--never-download"]
+        else:
+            venv = ["-m", "venv"]
+        call(executable(self.version) + venv + [self.path], cwd=self.empty_folder)
+
+    def upgrade_pip(self):
+        'Make sure that pip has been upgraded to latest version'
+        if self.pip_upgraded:
+            return
+        self.pip([ "install", "--upgrade", "pip"])
+        self.pip_upgraded = True
+
+    def pip(self, args):
+        call([self.venv_executable, "-m", "pip"] + args, cwd=self.empty_folder)
+
+    @property
+    def lib(self):
+        if self._lib is None:
+            try:
+                tools = os.path.join(os.environ['SEMMLE_DIST'], "tools")
+                get_venv_lib = os.path.join(tools, "get_venv_lib.py")
+                if os.path.exists(self.venv_executable):
+                    python_executable = [self.venv_executable]
+                else:
+                    python_executable = executable(self.version)
+                args = python_executable + [get_venv_lib]
+                print("Calling " + " ".join(args))
+                sys.stdout.flush()
+                sys.stderr.flush()
+                self._lib = subprocess.check_output(args)
+                if sys.version_info >= (3,):
+                    self._lib = str(self._lib, sys.getfilesystemencoding())
+                self._lib = self._lib.rstrip("\r\n")
+            except:
+                lib_ext = ["Lib"] if WIN else [ "lib" ]
+                self._lib = os.path.join(self.path, *lib_ext)
+                print('Error trying to run get_venv_lib (this is Python {})'.format(sys.version[:5]))
+                print_exception_indented()
+        return self._lib
+
+def venv_path():
+    return os.path.join(os.environ["LGTM_WORKSPACE"], "venv")
+
+def system_packages(version):
+    output = subprocess.check_output(executable(version) + [ "-c", "import sys; print(sys.path)"])
+    if sys.version_info >= (3,):
+        output = str(output, sys.getfilesystemencoding())
+    paths = ast.literal_eval(output.strip())
+    return [ path for path in paths if ("dist-packages" in path or "site-packages" in path) ]
+
+REQUIREMENTS_TAG = "LGTM_PYTHON_SETUP_REQUIREMENTS"
+EXCLUDE_REQUIREMENTS_TAG = "LGTM_PYTHON_SETUP_EXCLUDE_REQUIREMENTS"
+
+def main(version, root, requirement_files):
+    # We import `auto_install` here, as it has a dependency on the `packaging`
+    # module. For the CodeQL CLI (where we do not install any packages) we never
+    # run the `main` function, and so there is no need to always import this
+    # dependency.
+    from buildtools import auto_install
+    print("version, root, requirement_files", version, root, requirement_files)
+    venv = Venv(venv_path(), version)
+    venv.create()
+    if REQUIREMENTS_TAG in os.environ:
+        if not auto_install.install(os.environ[REQUIREMENTS_TAG], venv):
+            sys.exit(1)
+    requirements_from_setup = os.path.join(os.environ["LGTM_WORKSPACE"], "setup_requirements.txt")
+    args = [ venv.venv_executable, os.path.join(os.environ["SEMMLE_DIST"], "tools", "convert_setup.py"), root, requirements_from_setup] + system_packages(version)
+    print("Calling " + " ".join(args))
+    sys.stdout.flush()
+    sys.stderr.flush()
+    #We don't care if this fails, we only care if `requirements_from_setup` was created.
+    subprocess.call(args)
+    if os.path.exists(requirements_from_setup):
+        requirement_files = [ requirements_from_setup ] + requirement_files[1:]
+    print("Requirement files: " + str(requirement_files))
+    requirements = unify_requirements.gather(requirement_files)
+    if EXCLUDE_REQUIREMENTS_TAG in os.environ:
+        excludes = os.environ[EXCLUDE_REQUIREMENTS_TAG].splitlines()
+        print("Excluding ", excludes)
+        regex = re.compile("|".join(exclude + r'\b' for exclude in excludes))
+        requirements = [ req for req in requirements if not regex.match(req) ]
+    err = 0 if auto_install.install(requirements, venv) else 1
+    sys.exit(err)
+
+def get_library(version):
+    return Venv(venv_path(), version).lib
+
+if __name__ == "__main__":
+    version, root, requirement_files = sys.argv[1], sys.argv[2], sys.argv[3:]
+    version = int(version)
+    main(version, root, requirement_files)
--- a/python/extractor/buildtools/semmle/init.py
+++ b/python/extractor/buildtools/semmle/init.py
--- a/python/extractor/buildtools/semmle/requirements.py
+++ b/python/extractor/buildtools/semmle/requirements.py
@@ -0,0 +1,136 @@
+import copy
+import tempfile
+import re
+from packaging.requirements import Requirement
+from packaging.version import Version
+from packaging.specifiers import SpecifierSet
+
+IGNORED_REQUIREMENTS = re.compile("^(-e\\s+)?(git|svn|hg)(?:\\+.*)?://.*$")
+
+def parse(lines):
+    'Parse a list of requirement strings into a list of `Requirement`s'
+    res = []
+    #Process
+    for line in lines:
+        if '#' in line:
+            line, _ = line.split('#', 1)
+        if not line:
+            continue
+        if IGNORED_REQUIREMENTS.match(line):
+            continue
+        try:
+            req = Requirement(line)
+        except:
+            print("Cannot parse requirements line '%s'" % line)
+        else:
+            res.append(req)
+    return res
+
+def parse_file(filename):
+    with open(filename, 'r') as fd:
+        return parse(fd.read().splitlines())
+
+def save_to_file(reqs):
+    'Takes a list of requirements, saves them to a temporary file and returns the filename'
+    with tempfile.NamedTemporaryFile(prefix="semmle-requirements", suffix=".txt", mode="w", delete=False) as fd:
+        for req in reqs:
+            if req.url is None:
+                fd.write(str(req))
+            else:
+                fd.write(req.url)
+            fd.write("\n")
+    return fd.name
+
+def clean(reqs):
+    'Look for self-contradictory specifier groups and remove the necessary specifier parts to make them consistent'
+    result = []
+    for req in reqs:
+        specs = req.specifier
+        cleaned_specs = _clean_specs(specs)
+        req.specifier = cleaned_specs
+        result.append(Requirement(str(req)))
+        req.specifier = specs
+    return result
+
+def _clean_specs(specs):
+    ok = SpecifierSet()
+    #Choose a deterministic order such that >= comes before <=.
+    for spec in sorted(iter(specs), key=str, reverse=True):
+        for ok_spec in ok:
+            if not _compatible_specifier(ok_spec, spec):
+                break
+        else:
+            ok &= SpecifierSet(str(spec))
+    return ok
+
+def restrict(reqs):
+    '''Restrict versions to "compatible" versions.
+    For example restrict >=1.2 to all versions >= 1.2 that have 1 as the major version number.
+    >=N... becomes >=N...,==N.* and >N... requirements becomes >N..,==N.*
+    '''
+    #First of all clean the requirements
+    reqs = clean(reqs)
+    result = []
+    for req in reqs:
+        specs = req.specifier
+        req.specifier = _restrict_specs(specs)
+        result.append(Requirement(str(req)))
+        req.specifier = specs
+    return result
+
+def _restrict_specs(specs):
+    restricted = copy.deepcopy(specs)
+    #Iteration order doesn't really matter here so we choose the
+    #same as for clean, just to be consistent
+    for spec in sorted(iter(specs), key=str, reverse=True):
+        if spec.operator in ('>', '>='):
+            base_version = spec.version.split(".", 1)[0]
+            restricted &= SpecifierSet('==' + base_version + '.*')
+    return restricted
+
+def _compatible_specifier(s1, s2):
+    overlaps = 0
+    overlaps += _min_version(s1) in s2
+    overlaps += _max_version(s1) in s2
+    overlaps += _min_version(s2) in s1
+    overlaps += _max_version(s2) in s1
+    if overlaps > 1:
+        return True
+    if overlaps == 1:
+        #One overlap -- Generally compatible, but not for <x, >=x
+        return not _is_strict(s1) and not _is_strict(s2)
+    #overlaps == 0:
+    return False
+
+MIN_VERSION = Version('0.0a0')
+MAX_VERSION = Version('1000000')
+
+def _min_version(s):
+    if s.operator in ('>', '>='):
+        return s.version
+    elif s.operator in ('<', '<=', '!='):
+        return MIN_VERSION
+    elif s.operator == '==':
+        v = s.version
+        if v[-1] == '*':
+            return v[:-1] + '0'
+        else:
+            return s.version
+    else:
+        # '~='
+        return s.version
+
+def _max_version(s):
+    if s.operator in ('<', '<='):
+        return s.version
+    elif s.operator in ('>', '>=', '!='):
+        return MAX_VERSION
+    elif s.operator in ('~=', '=='):
+        v = s.version
+        if v[-1] == '*' or s.operator == '~=':
+            return v[:-1] + '1000000'
+        else:
+            return s.version
+
+def _is_strict(s):
+    return s.operator in ('>', '<')
--- a/python/extractor/buildtools/tox.ini
+++ b/python/extractor/buildtools/tox.ini
@@ -0,0 +1,14 @@
+# this is a setup file for `tox`, which allows us to run test locally against multiple python
+# versions. Simply run `tox` in the directory of this file!
+#
+# install tox with `pipx install tox` or whatever your preferred way is :)
+
+[tox]
+envlist = py27,py3
+skipsdist=True
+
+[testenv]
+# install <deps> in the virtualenv where commands will be executed
+deps = pytest
+commands =
+    pytest
--- a/python/extractor/buildtools/unify_requirements.py
+++ b/python/extractor/buildtools/unify_requirements.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python
+import os
+import re
+
+def get_requirements(file_path):
+    if not file_path:
+        return []
+    with open(file_path, "r") as requirements_file:
+        lines = requirements_file.read().splitlines()
+    for line_no, line in enumerate(lines):
+        match = re.search("^\\s*-r\\s+([^#]+)", line)
+        if match:
+            include_file_path = os.path.join(os.path.dirname(file_path), match.group(1).strip())
+            include_requirements = get_requirements(include_file_path)
+            lines[line_no:line_no+1] = include_requirements
+    return lines
+
+def deduplicate(requirements):
+    result = []
+    seen = set()
+    for req in requirements:
+        if req in seen:
+            continue
+        result.append(req)
+        seen.add(req)
+    return result
+
+def gather(requirement_files):
+    requirements = []
+    for file in requirement_files:
+        requirements += get_requirements(file)
+    requirements = deduplicate(requirements)
+    print("Requirements:")
+    for r in requirements:
+        print("    {}".format(r))
+    return requirements
--- a/python/extractor/buildtools/version.py
+++ b/python/extractor/buildtools/version.py
@@ -0,0 +1,223 @@
+import sys
+import os
+import subprocess
+import tokenize
+import re
+
+from buildtools.helper import print_exception_indented
+
+
+TROVE = re.compile(r"Programming Language\s+::\s+Python\s+::\s+(\d)")
+
+if sys.version_info > (3,):
+    import collections.abc as collections
+    file_open = tokenize.open
+else:
+    import collections
+    file_open = open
+
+WIN = sys.platform == "win32"
+
+
+if WIN:
+    # installing `py` launcher is optional when installing Python on windows, so it's
+    # possible that the user did not install it, see
+    # https://github.com/github/codeql-cli-binaries/issues/125#issuecomment-1157429430
+    # so we check whether it has been installed. Newer versions have a `--list` option,
+    # but that has only been mentioned in the docs since 3.9, so to not risk it not
+    # working on potential older versions, we'll just use `py --version` which forwards
+    # the `--version` argument to the default python executable.
+
+    try:
+        subprocess.check_call(["py", "--version"])
+    except (subprocess.CalledProcessError, Exception):
+        sys.stderr.write("The `py` launcher is required for CodeQL to work on Windows.")
+        sys.stderr.write("Please include it when installing Python for Windows.")
+        sys.stderr.write("see https://docs.python.org/3/using/windows.html#python-launcher-for-windows")
+        sys.stderr.flush()
+        sys.exit(4) # 4 was a unique exit code at the time of writing
+
+AVAILABLE_VERSIONS = []
+
+def set_available_versions():
+    """Sets the global `AVAILABLE_VERSIONS` to a list of available (major) Python versions."""
+    global AVAILABLE_VERSIONS
+    if AVAILABLE_VERSIONS:
+        return # already set
+    for version in [3, 2]:
+        try:
+            subprocess.check_call(" ".join(executable_name(version) + ["-c", "pass"]), shell=True)
+            AVAILABLE_VERSIONS.append(version)
+        except Exception:
+            pass # If not available, we simply don't add it to the list
+    if not AVAILABLE_VERSIONS:
+        # If neither 'python3' nor 'python2' is available, we'll just try 'python' and hope for the best
+        AVAILABLE_VERSIONS = ['']
+
+def executable(version):
+    """Returns the executable to use for the given Python version."""
+    global AVAILABLE_VERSIONS
+    set_available_versions()
+    if version not in AVAILABLE_VERSIONS:
+        available_version = AVAILABLE_VERSIONS[0]
+        print("Wanted to run Python %s, but it is not available. Using Python %s instead" % (version, available_version))
+        version = available_version
+    return executable_name(version)
+
+
+def executable_name(version):
+    if WIN:
+        return ["py", "-%s" % version]
+    else:
+        return ["python%s" % version]
+
+PREFERRED_PYTHON_VERSION = None
+
+def extractor_executable():
+    '''
+    Returns the executable to use for the extractor.
+    If a Python executable name is specified using the extractor option, returns that name.
+    In the absence of a user-specified executable name, returns the executable name for
+    Python 3 if it is available, and Python 2 if not.
+    '''
+    executable_name = os.environ.get("CODEQL_EXTRACTOR_PYTHON_OPTION_PYTHON_EXECUTABLE_NAME", None)
+    if executable_name is not None:
+        print("Using Python executable name provided via the python_executable_name extractor option: {}"
+            .format(executable_name)
+        )
+        return [executable_name]
+    # Call machine_version() to ensure we've set PREFERRED_PYTHON_VERSION
+    if PREFERRED_PYTHON_VERSION is None:
+        machine_version()
+    return executable(PREFERRED_PYTHON_VERSION)
+
+def machine_version():
+    """If only Python 2 or Python 3 is installed, will return that version"""
+    global PREFERRED_PYTHON_VERSION
+    print("Trying to guess Python version based on installed versions")
+    if sys.version_info > (3,):
+        this, other = 3, 2
+    else:
+        this, other = 2, 3
+    try:
+        exe = executable(other)
+        # We need `shell=True` here in order for the test framework to function correctly. For
+        # whatever reason, the `PATH` variable is ignored if `shell=False`.
+        # Also, this in turn forces us to give the whole command as a string, rather than a list.
+        # Otherwise, the effect is that the Python interpreter is invoked _as a REPL_, rather than
+        # with the given piece of code.
+        subprocess.check_call(" ".join(exe + [ "-c", "pass" ]), shell=True)
+        print("This script is running Python {}, but Python {} is also available (as '{}')"
+            .format(this, other, ' '.join(exe))
+        )
+        # If both versions are available, our preferred version is Python 3
+        PREFERRED_PYTHON_VERSION = 3
+        return None
+    except Exception:
+        print("Only Python {} installed -- will use that version".format(this))
+        PREFERRED_PYTHON_VERSION = this
+        return this
+
+def trove_version(root):
+    print("Trying to guess Python version based on Trove classifiers in setup.py")
+    try:
+        full_path = os.path.join(root, "setup.py")
+        if not os.path.exists(full_path):
+            print("Did not find setup.py (expected it to be at {})".format(full_path))
+            return None
+
+        versions = set()
+        with file_open(full_path) as fd:
+            contents = fd.read()
+        for match in TROVE.finditer(contents):
+            versions.add(int(match.group(1)))
+
+        if 2 in versions and 3 in versions:
+            print("Found Trove classifiers for both Python 2 and Python 3 in setup.py -- will use Python 3")
+            return 3
+        elif len(versions) == 1:
+            result = versions.pop()
+            print("Found Trove classifier for Python {} in setup.py -- will use that version".format(result))
+            return result
+        else:
+            print("Found no Trove classifiers for Python in setup.py")
+    except Exception:
+        print("Skipping due to exception:")
+        print_exception_indented()
+    return None
+
+def wrap_with_list(x):
+    if isinstance(x, collections.Iterable) and not isinstance(x, str):
+        return x
+    else:
+        return [x]
+
+def travis_version(root):
+    print("Trying to guess Python version based on travis file")
+    try:
+        full_paths = [os.path.join(root, filename) for filename in [".travis.yml", "travis.yml"]]
+        travis_file_paths = [path for path in full_paths if os.path.exists(path)]
+        if not travis_file_paths:
+            print("Did not find any travis files (expected them at either {})".format(full_paths))
+            return None
+
+        try:
+            import yaml
+        except ImportError:
+            print("Found a travis file, but yaml library not available")
+            return None
+
+        with open(travis_file_paths[0]) as travis_file:
+            travis_yaml = yaml.safe_load(travis_file)
+        if "python" in travis_yaml:
+            versions = wrap_with_list(travis_yaml["python"])
+        else:
+            versions = []
+
+        # 'matrix' is an alias for 'jobs' now (https://github.com/travis-ci/docs-travis-ci-com/issues/1500)
+        # If both are defined, only the last defined will be used.
+        if "matrix" in travis_yaml and "jobs" in travis_yaml:
+            print("Ignoring 'matrix' and 'jobs' in Travis file, since they are both defined (only one of them should be).")
+        else:
+            matrix = travis_yaml.get("matrix") or travis_yaml.get("jobs") or dict()
+            includes = matrix.get("include") or []
+            for include in includes:
+                if "python" in include:
+                    versions.extend(wrap_with_list(include["python"]))
+
+        found = set()
+        for version in versions:
+            # Yaml may convert version strings to numbers, convert them back.
+            version = str(version)
+            if version.startswith("2"):
+                found.add(2)
+            if version.startswith("3"):
+                found.add(3)
+
+        if len(found) == 1:
+            result = found.pop()
+            print("Only found Python {} in travis file -- will use that version".format(result))
+            return result
+        elif len(found) == 2:
+            print("Found both Python 2 and Python 3 being used in travis file -- ignoring")
+        else:
+            print("Found no Python being used in travis file")
+    except Exception:
+        print("Skipping due to exception:")
+        print_exception_indented()
+    return None
+
+VERSION_TAG = "LGTM_PYTHON_SETUP_VERSION"
+
+def best_version(root, default):
+    if VERSION_TAG in os.environ:
+        try:
+            return int(os.environ[VERSION_TAG])
+        except ValueError:
+            raise SyntaxError("Illegal value for " + VERSION_TAG)
+    print("Will try to guess Python version, as it was not specified in `lgtm.yml`")
+    version = trove_version(root) or travis_version(root) or machine_version()
+    if version is None:
+        version = default
+        print("Could not guess Python version, will use default: Python {}".format(version))
+    return version
--- a/python/extractor/cli-integration-test/.gitignore
+++ b/python/extractor/cli-integration-test/.gitignore
@@ -0,0 +1,5 @@
+*/db/
+*/dbs/
+*/venv/
+**/*.egg-info/
+*/.cache
--- a/python/extractor/cli-integration-test/README.md
+++ b/python/extractor/cli-integration-test/README.md
@@ -0,0 +1,21 @@
+# Extractor Python CodeQL CLI integration tests
+
+To ensure that the two work together as intended, and as an easy way to set up realistic test-cases.
+
+
+### Adding a new test case
+
+Add a new folder, place a file called `test.sh` in it, which should start with the code below. The script should exit with failure code to fail the test.
+
+```bash
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+```
--- a/python/extractor/cli-integration-test/basic/query.ql
+++ b/python/extractor/cli-integration-test/basic/query.ql
@@ -0,0 +1 @@
+select 1
--- a/python/extractor/cli-integration-test/basic/repo_dir/foo.py
+++ b/python/extractor/cli-integration-test/basic/repo_dir/foo.py
@@ -0,0 +1 @@
+print(42)
--- a/python/extractor/cli-integration-test/basic/test.sh
+++ b/python/extractor/cli-integration-test/basic/test.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+rm -rf db
+
+$CODEQL database create db --language python --source-root repo_dir/
+$CODEQL query run --database db query.ql
--- a/python/extractor/cli-integration-test/disable-library-extraction/repo_dir/foo.py
+++ b/python/extractor/cli-integration-test/disable-library-extraction/repo_dir/foo.py
@@ -0,0 +1,3 @@
+import pip
+
+print(42)
--- a/python/extractor/cli-integration-test/disable-library-extraction/test.sh
+++ b/python/extractor/cli-integration-test/disable-library-extraction/test.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+# start on clean slate
+rm -rf dbs
+mkdir dbs
+
+cd "$SCRIPTDIR"
+
+# In 2.16.0 we will not extract libraries by default, so there is no difference in what
+# is extracted by setting this environment variable.. We should remove this test when
+# 2.17.0 is released.
+export CODEQL_EXTRACTOR_PYTHON_DISABLE_LIBRARY_EXTRACTION=
+$CODEQL database create dbs/normal --language python --source-root repo_dir/
+
+export CODEQL_EXTRACTOR_PYTHON_DISABLE_LIBRARY_EXTRACTION=1
+$CODEQL database create dbs/no-lib-extraction --language python --source-root repo_dir/
+
+# ---
+
+set +x
+
+EXTRACTED_NORMAL=$(unzip -l dbs/normal/src.zip | wc -l)
+EXTRACTED_NO_LIB_EXTRACTION=$(unzip -l dbs/no-lib-extraction/src.zip | wc -l)
+
+exitcode=0
+
+echo "EXTRACTED_NORMAL=$EXTRACTED_NORMAL"
+echo "EXTRACTED_NO_LIB_EXTRACTION=$EXTRACTED_NO_LIB_EXTRACTION"
+
+if [[ $EXTRACTED_NO_LIB_EXTRACTION -lt $EXTRACTED_NORMAL ]]; then
+    echo "ERROR: EXTRACTED_NO_LIB_EXTRACTION smaller than EXTRACTED_NORMAL"
+    exitcode=1
+fi
+
+exit $exitcode
--- a/python/extractor/cli-integration-test/extract-stdlib/query.ql
+++ b/python/extractor/cli-integration-test/extract-stdlib/query.ql
@@ -0,0 +1,18 @@
+import python
+import semmle.python.types.Builtins
+
+predicate named_entity(string name, string kind) {
+  exists(Builtin::special(name)) and kind = "special"
+  or
+  exists(Builtin::builtin(name)) and kind = "builtin"
+  or
+  exists(Module m | m.getName() = name) and kind = "module"
+  or
+  exists(File f | f.getShortName() = name + ".py") and kind = "file"
+}
+
+from string name, string kind
+where
+  name in ["foo", "baz", "main", "os", "sys", "re"] and
+  named_entity(name, kind)
+select name, kind order by name, kind
--- a/python/extractor/cli-integration-test/extract-stdlib/query.with-stdlib.expected
+++ b/python/extractor/cli-integration-test/extract-stdlib/query.with-stdlib.expected
@@ -0,0 +1,12 @@
+| name |  kind   |
+------+---------+
+| baz  | file    |
+| baz  | module  |
+| foo  | file    |
+| foo  | module  |
+| main | file    |
+| os   | file    |
+| os   | module  |
+| re   | file    |
+| re   | module  |
+| sys  | special |
--- a/python/extractor/cli-integration-test/extract-stdlib/query.without-stdlib.expected
+++ b/python/extractor/cli-integration-test/extract-stdlib/query.without-stdlib.expected
@@ -0,0 +1,8 @@
+| name |  kind   |
+------+---------+
+| baz  | file    |
+| baz  | module  |
+| foo  | file    |
+| foo  | module  |
+| main | file    |
+| sys  | special |
--- a/python/extractor/cli-integration-test/extract-stdlib/repo_dir/baz.py
+++ b/python/extractor/cli-integration-test/extract-stdlib/repo_dir/baz.py
@@ -0,0 +1 @@
+quux = 4
--- a/python/extractor/cli-integration-test/extract-stdlib/repo_dir/foo.py
+++ b/python/extractor/cli-integration-test/extract-stdlib/repo_dir/foo.py
@@ -0,0 +1,4 @@
+import baz
+import re
+bar = 5 + baz.quux
+re.compile("hello")
--- a/python/extractor/cli-integration-test/extract-stdlib/repo_dir/main.py
+++ b/python/extractor/cli-integration-test/extract-stdlib/repo_dir/main.py
@@ -0,0 +1,6 @@
+import sys
+import os
+print(os.path)
+print(sys.path)
+import foo
+print(foo.bar)
--- a/python/extractor/cli-integration-test/extract-stdlib/test.sh
+++ b/python/extractor/cli-integration-test/extract-stdlib/test.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+rm -rf dbs
+
+mkdir dbs
+
+CODEQL_EXTRACTOR_PYTHON_DONT_EXTRACT_STDLIB=True $CODEQL database create dbs/without-stdlib --language python --source-root repo_dir/
+$CODEQL query run --database dbs/without-stdlib query.ql > query.without-stdlib.actual
+diff query.without-stdlib.expected query.without-stdlib.actual
+
+LGTM_INDEX_EXCLUDE="/usr/lib/**" $CODEQL database create dbs/with-stdlib --language python --source-root repo_dir/
+$CODEQL query run --database dbs/with-stdlib query.ql > query.with-stdlib.actual
+diff query.with-stdlib.expected query.with-stdlib.actual
--- a/python/extractor/cli-integration-test/force-enable-library-extraction/repo_dir/foo.py
+++ b/python/extractor/cli-integration-test/force-enable-library-extraction/repo_dir/foo.py
@@ -0,0 +1,3 @@
+import pip
+
+print(42)
--- a/python/extractor/cli-integration-test/force-enable-library-extraction/test.sh
+++ b/python/extractor/cli-integration-test/force-enable-library-extraction/test.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+# start on clean slate
+rm -rf dbs
+mkdir dbs
+
+cd "$SCRIPTDIR"
+
+export CODEQL_EXTRACTOR_PYTHON_FORCE_ENABLE_LIBRARY_EXTRACTION_UNTIL_2_17_0=
+$CODEQL database create dbs/normal --language python --source-root repo_dir/
+
+export CODEQL_EXTRACTOR_PYTHON_FORCE_ENABLE_LIBRARY_EXTRACTION_UNTIL_2_17_0=1
+$CODEQL database create dbs/with-lib-extraction --language python --source-root repo_dir/
+
+# ---
+
+set +x
+
+EXTRACTED_NORMAL=$(unzip -l dbs/normal/src.zip | wc -l)
+EXTRACTED_WITH_LIB_EXTRACTION=$(unzip -l dbs/with-lib-extraction/src.zip | wc -l)
+
+exitcode=0
+
+echo "EXTRACTED_NORMAL=$EXTRACTED_NORMAL"
+echo "EXTRACTED_WITH_LIB_EXTRACTION=$EXTRACTED_WITH_LIB_EXTRACTION"
+
+if [[ ! $EXTRACTED_WITH_LIB_EXTRACTION -gt $EXTRACTED_NORMAL ]]; then
+    echo "ERROR: EXTRACTED_WITH_LIB_EXTRACTION not greater than EXTRACTED_NORMAL"
+    exitcode=1
+fi
+
+exit $exitcode
--- a/python/extractor/cli-integration-test/ignore-venv/.gitignore
+++ b/python/extractor/cli-integration-test/ignore-venv/.gitignore
@@ -0,0 +1,2 @@
+venv/
+venv2/
--- a/python/extractor/cli-integration-test/ignore-venv/repo_dir/foo.py
+++ b/python/extractor/cli-integration-test/ignore-venv/repo_dir/foo.py
@@ -0,0 +1,3 @@
+import flask
+
+print(42)
--- a/python/extractor/cli-integration-test/ignore-venv/test.sh
+++ b/python/extractor/cli-integration-test/ignore-venv/test.sh
@@ -0,0 +1,79 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+# start on clean slate
+rm -rf dbs repo_dir/venv*
+mkdir dbs
+
+
+# set up venvs
+cd repo_dir
+
+python3 -m venv venv
+venv/bin/pip install flask
+
+python3 -m venv venv2
+
+cd "$SCRIPTDIR"
+
+# In 2.16.0 we stop extracting libraries by default, so to test this functionality we
+# need to force enable it. Once we release 2.17.0 and turn off library extraction for
+# good, we can remove the part of this test ensuring that dependencies in an active
+# venv are still extracted (since that will no longer be the case).
+export CODEQL_EXTRACTOR_PYTHON_FORCE_ENABLE_LIBRARY_EXTRACTION_UNTIL_2_17_0=1
+
+# Create DBs with venv2 active (that does not have flask installed)
+source repo_dir/venv2/bin/activate
+
+export CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_VENV_EXCLUDE=
+$CODEQL database create dbs/normal --language python --source-root repo_dir/
+
+export CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_VENV_EXCLUDE=1
+$CODEQL database create dbs/no-venv-ignore --language python --source-root repo_dir/
+
+# Create DB with venv active that has flask installed. We want to ensure that we're
+# still able to resolve imports to flask, but don't want to extract EVERYTHING from
+# within the venv. Important note is that the test-file in the repo_dir actually imports
+# flask :D
+source repo_dir/venv/bin/activate
+export CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_VENV_EXCLUDE=
+$CODEQL database create dbs/normal-with-flask-venv --language python --source-root repo_dir/
+
+# ---
+
+set +x
+
+EXTRACTED_NORMAL=$(unzip -l dbs/normal/src.zip | wc -l)
+EXTRACTED_NO_VENV_IGNORE=$(unzip -l dbs/no-venv-ignore/src.zip | wc -l)
+EXTRACTED_ACTIVE_FLASK=$(unzip -l dbs/normal-with-flask-venv/src.zip | wc -l)
+
+exitcode=0
+
+echo "EXTRACTED_NORMAL=$EXTRACTED_NORMAL"
+echo "EXTRACTED_NO_VENV_IGNORE=$EXTRACTED_NO_VENV_IGNORE"
+echo "EXTRACTED_ACTIVE_FLASK=$EXTRACTED_ACTIVE_FLASK"
+
+if [[ ! $EXTRACTED_NORMAL -lt $EXTRACTED_NO_VENV_IGNORE ]]; then
+    echo "ERROR: EXTRACTED_NORMAL not smaller EXTRACTED_NO_VENV_IGNORE"
+    exitcode=1
+fi
+
+if [[ ! $EXTRACTED_NORMAL -lt $EXTRACTED_ACTIVE_FLASK ]]; then
+    echo "ERROR: EXTRACTED_NORMAL not smaller EXTRACTED_ACTIVE_FLASK"
+    exitcode=1
+fi
+
+if [[ ! $EXTRACTED_ACTIVE_FLASK -lt $EXTRACTED_NO_VENV_IGNORE ]]; then
+    echo "ERROR: EXTRACTED_ACTIVE_FLASK not smaller EXTRACTED_NO_VENV_IGNORE"
+    exitcode=1
+fi
+
+exit $exitcode
--- a/python/extractor/cli-integration-test/pip-21.3-build-dir/.gitignore
+++ b/python/extractor/cli-integration-test/pip-21.3-build-dir/.gitignore
@@ -0,0 +1,2 @@
+repo_dir/build/
+dbs/
--- a/python/extractor/cli-integration-test/pip-21.3-build-dir/repo_dir/setup.py
+++ b/python/extractor/cli-integration-test/pip-21.3-build-dir/repo_dir/setup.py
@@ -0,0 +1,12 @@
+from setuptools import find_packages, setup
+
+# using src/ folder as recommended in: https://blog.ionelmc.ro/2014/05/25/python-packaging/
+
+setup(
+    name="example_pkg",
+    version="0.0.1",
+    description="example",
+    packages=find_packages("src"),
+    package_dir={"": "src"},
+    install_requires=[],
+)
--- a/python/extractor/cli-integration-test/pip-21.3-build-dir/repo_dir/src/example_pkg/init.py
+++ b/python/extractor/cli-integration-test/pip-21.3-build-dir/repo_dir/src/example_pkg/init.py
--- a/python/extractor/cli-integration-test/pip-21.3-build-dir/repo_dir/src/example_pkg/foo.py
+++ b/python/extractor/cli-integration-test/pip-21.3-build-dir/repo_dir/src/example_pkg/foo.py
@@ -0,0 +1 @@
+print(42)
--- a/python/extractor/cli-integration-test/pip-21.3-build-dir/test.sh
+++ b/python/extractor/cli-integration-test/pip-21.3-build-dir/test.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+NUM_PYTHON_FILES_IN_REPO=$(find repo_dir/src/ -name '*.py' | wc -l)
+
+rm -rf venv dbs
+
+mkdir dbs
+
+python3 -m venv venv
+
+source venv/bin/activate
+
+pip install --upgrade 'pip>=21.3'
+
+cd repo_dir
+pip install .
+cd "$SCRIPTDIR"
+
+export CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_PIP_BUILD_DIR_EXCLUDE=
+$CODEQL database create dbs/normal --language python --source-root repo_dir/
+
+export CODEQL_EXTRACTOR_PYTHON_DISABLE_AUTOMATIC_PIP_BUILD_DIR_EXCLUDE=1
+$CODEQL database create dbs/with-build-dir --language python --source-root repo_dir/
+
+EXTRACTED_NORMAL=$(unzip -l dbs/normal/src.zip | wc -l)
+EXTRACTED_WITH_BUILD=$(unzip -l dbs/with-build-dir/src.zip | wc -l)
+
+if [[ $((EXTRACTED_NORMAL + NUM_PYTHON_FILES_IN_REPO)) == $EXTRACTED_WITH_BUILD ]]; then
+    echo "Numbers add up"
+else
+    echo "Numbers did not add up"
+    echo "NUM_PYTHON_FILES_IN_REPO=$NUM_PYTHON_FILES_IN_REPO"
+    echo "EXTRACTED_NORMAL=$EXTRACTED_NORMAL"
+    echo "EXTRACTED_WITH_BUILD=$EXTRACTED_WITH_BUILD"
+    exit 1
+fi
--- a/python/extractor/cli-integration-test/python-2-deprecation/query.only-python2.expected
+++ b/python/extractor/cli-integration-test/python-2-deprecation/query.only-python2.expected
@@ -0,0 +1,5 @@
+|   name   |
+----------+
+| dircache |
+| stat     |
+| test     |
--- a/python/extractor/cli-integration-test/python-2-deprecation/query.python2-using-python3.expected
+++ b/python/extractor/cli-integration-test/python-2-deprecation/query.python2-using-python3.expected
@@ -0,0 +1,5 @@
+|   name   |
+----------+
+| dircache |
+| stat     |
+| test     |
--- a/python/extractor/cli-integration-test/python-2-deprecation/query.ql
+++ b/python/extractor/cli-integration-test/python-2-deprecation/query.ql
@@ -0,0 +1,18 @@
+import python
+import semmle.python.types.Builtins
+
+predicate named_entity(string name, string kind) {
+  exists(Builtin::special(name)) and kind = "special"
+  or
+  exists(Builtin::builtin(name)) and kind = "builtin"
+  or
+  exists(Module m | m.getName() = name) and kind = "module"
+  or
+  exists(File f | f.getShortName() = name + ".py") and kind = "file"
+}
+
+from string name
+where
+  name in ["dircache", "test", "stat"] and
+  named_entity(name, "file")
+select name order by name
--- a/python/extractor/cli-integration-test/python-2-deprecation/query.without-python2.expected
+++ b/python/extractor/cli-integration-test/python-2-deprecation/query.without-python2.expected
@@ -0,0 +1,4 @@
+| name |
+------+
+| stat |
+| test |
--- a/python/extractor/cli-integration-test/python-2-deprecation/repo_dir/setup.py
+++ b/python/extractor/cli-integration-test/python-2-deprecation/repo_dir/setup.py
@@ -0,0 +1 @@
+"Programming Language :: Python :: 2"
--- a/python/extractor/cli-integration-test/python-2-deprecation/repo_dir/test.py
+++ b/python/extractor/cli-integration-test/python-2-deprecation/repo_dir/test.py
@@ -0,0 +1,5 @@
+# `dircache` was removed in Python 3, and so is a good test of which standard library we're
+# extracting.
+import dircache
+# A module that's present in both Python 2 and 3
+import stat
--- a/python/extractor/cli-integration-test/python-2-deprecation/test.sh
+++ b/python/extractor/cli-integration-test/python-2-deprecation/test.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+rm -rf dbs
+rm -f *.actual
+
+mkdir dbs
+
+# NB: on our Linux CI infrastructure, `python` is aliased to `python3`.
+WITHOUT_PYTHON2=$(pwd)/without-python2
+WITHOUT_PYTHON3=$(pwd)/without-python3
+
+echo "Test 1: Only Python 2 is available. Should fail."
+# Note the negation at the start of the command.
+! PATH="$WITHOUT_PYTHON3:$PATH" $CODEQL database create dbs/only-python2-no-flag --language python --source-root repo_dir/
+
+echo "Test 2: Only Python 3 is available. Should extract using Python 3 and use the Python 3 standard library."
+PATH="$WITHOUT_PYTHON2:$PATH" $CODEQL database create dbs/without-python2 --language python --source-root repo_dir/
+$CODEQL query run --database dbs/without-python2 query.ql > query.without-python2.actual
+diff query.without-python2.expected query.without-python2.actual
+
+echo "Test 3: Python 2 and 3 are both available. Should extract using Python 3, but use the Python 2 standard library."
+$CODEQL database create dbs/python2-using-python3 --language python --source-root repo_dir/
+$CODEQL query run --database dbs/python2-using-python3 query.ql > query.python2-using-python3.actual
+diff query.python2-using-python3.expected query.python2-using-python3.actual
+
+rm -f *.actual
--- a/python/extractor/cli-integration-test/python-2-deprecation/without-python2/python2
+++ b/python/extractor/cli-integration-test/python-2-deprecation/without-python2/python2
@@ -0,0 +1,4 @@
+echo "Attempted to run:"
+echo "  python2 $@"
+echo "Failing instead."
+exit 127
--- a/python/extractor/cli-integration-test/python-2-deprecation/without-python2/which
+++ b/python/extractor/cli-integration-test/python-2-deprecation/without-python2/which
@@ -0,0 +1,6 @@
+#!/bin/bash -p
+
+case $1 in
+    python2)   exit 1;;
+    *)         command /usr/bin/which -- "$1";;
+esac
--- a/python/extractor/cli-integration-test/python-2-deprecation/without-python3/python
+++ b/python/extractor/cli-integration-test/python-2-deprecation/without-python3/python
@@ -0,0 +1,4 @@
+echo "Attempted to run:"
+echo "  python $@"
+echo "Failing instead."
+exit 127
--- a/python/extractor/cli-integration-test/python-2-deprecation/without-python3/python3
+++ b/python/extractor/cli-integration-test/python-2-deprecation/without-python3/python3
@@ -0,0 +1,4 @@
+echo "Attempted to run:"
+echo "  python3 $@"
+echo "Failing instead."
+exit 127
--- a/python/extractor/cli-integration-test/python-2-deprecation/without-python3/which
+++ b/python/extractor/cli-integration-test/python-2-deprecation/without-python3/which
@@ -0,0 +1,9 @@
+#!/bin/bash -p
+
+echo "Fake which called with arguments: $@"
+
+case $1 in
+    python)    exit 1;;
+    python3)   exit 1;;
+    *)         command /usr/bin/which -- "$1";;
+esac
--- a/python/extractor/cli-integration-test/run-all-tests.sh
+++ b/python/extractor/cli-integration-test/run-all-tests.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+failures=()
+for f in */test.sh; do
+    echo "Running $f:"
+    if ! bash "$f"; then
+        echo "ERROR: $f failed"
+        failures+=("$f")
+    fi
+    echo "---"
+done
+
+if [ -z "${failures[*]}" ]; then
+    echo "All integration tests passed!"
+    exit 0
+else
+    echo "ERROR: Some integration test failed! Failures:"
+    for failure in "${failures[@]}"
+    do
+        echo "- ${failure}"
+    done
+    exit 1
+fi
--- a/python/extractor/cli-integration-test/stdout-encoding/repo_dir/ನನ್ನ_ಸ್ಕ್ರಿಪ್ಟ್.py
+++ b/python/extractor/cli-integration-test/stdout-encoding/repo_dir/ನನ್ನ_ಸ್ಕ್ರಿಪ್ಟ್.py
@@ -0,0 +1 @@
+print(42)
--- a/python/extractor/cli-integration-test/stdout-encoding/test.sh
+++ b/python/extractor/cli-integration-test/stdout-encoding/test.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+rm -rf db
+
+# even with default encoding that doesn't support utf-8 (like on windows) we want to
+# ensure that we can properly log that we've extracted files whose filenames contain
+# utf-8 chars
+export PYTHONIOENCODING="ascii"
+$CODEQL database create db --language python --source-root repo_dir/
--- a/python/extractor/cli-integration-test/symlinks/.gitignore
+++ b/python/extractor/cli-integration-test/symlinks/.gitignore
@@ -0,0 +1,2 @@
+repo_dir/subdir
+repo_dir/symlink_to_top
--- a/python/extractor/cli-integration-test/symlinks/query.ql
+++ b/python/extractor/cli-integration-test/symlinks/query.ql
@@ -0,0 +1 @@
+select 1
--- a/python/extractor/cli-integration-test/symlinks/repo_dir/foo.py
+++ b/python/extractor/cli-integration-test/symlinks/repo_dir/foo.py
@@ -0,0 +1 @@
+print(42)
--- a/python/extractor/cli-integration-test/symlinks/test.sh
+++ b/python/extractor/cli-integration-test/symlinks/test.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+rm -rf db
+
+# create two symlink loops, so
+# - repo_dir/subdir/symlink_to_top -> repo_dir
+# - repo_dir/symlink_to_top -> repo_dir
+# such a setup was seen in https://github.com/PowerDNS/weakforced
+
+rm -rf repo_dir/subdir
+mkdir repo_dir/subdir
+ln -s .. repo_dir/subdir/symlink_to_top
+
+rm -f repo_dir/symlink_to_top
+ln -s . repo_dir/symlink_to_top
+
+timeout --verbose 15s $CODEQL database create db --language python --source-root repo_dir/
+$CODEQL query run --database db query.ql
--- a/python/extractor/cli-integration-test/writing-diagnostics/diagnostics.expected
+++ b/python/extractor/cli-integration-test/writing-diagnostics/diagnostics.expected
@@ -0,0 +1,163 @@
+{
+  "attributes": {
+    "args": [
+      "Syntax Error"
+    ],
+    "traceback": [
+      "\"semmle/python/modules.py\", line 108, in py_ast",
+      "\"semmle/python/modules.py\", line 102, in old_py_ast",
+      "\"semmle/python/parser/__init__.py\", line 100, in parse",
+      "\"semmleFile \"<string>\", line 1",
+      "\"semmle/python/extractor.py\", line 84, in process_source_module",
+      "\"semmle/python/modules.py\", line 92, in ast",
+      "\"semmle/python/modules.py\", line 120, in py_ast",
+      "\"semmle/python/modules.py\", line 117, in py_ast",
+      "\"semmle/python/parser/tsg_parser.py\", line 221, in parse",
+      "\"semmleFile \"<string>\", line 1"
+    ]
+  },
+  "location": {
+    "file": "<test-root-directory>/repo_dir/syntaxerror3.py",
+    "startColumn": 0,
+    "endColumn": 0,
+    "startLine": 1,
+    "endLine": 1
+  },
+  "markdownMessage": "A parse error occurred while processing `<test-root-directory>/repo_dir/syntaxerror3.py`, and as a result this file could not be analyzed. Check the syntax of the file using the `python -m py_compile` command and correct any invalid syntax.",
+  "severity": "warning",
+  "source": {
+    "extractorName": "python",
+    "id": "py/diagnostics/syntax-error",
+    "name": "Could not process some files due to syntax errors"
+  },
+  "timestamp": "2023-03-13T15:03:48.177832",
+  "visibility": {
+    "cliSummaryTable": true,
+    "statusPage": true,
+    "telemetry": true
+  }
+}
+{
+  "attributes": {
+    "args": [
+      "Syntax Error"
+    ],
+    "traceback": [
+      "\"semmle/python/modules.py\", line 108, in py_ast",
+      "\"semmle/python/modules.py\", line 102, in old_py_ast",
+      "\"semmle/python/parser/__init__.py\", line 100, in parse",
+      "\"semmleFile \"<string>\", line 3",
+      "\"semmle/python/extractor.py\", line 84, in process_source_module",
+      "\"semmle/python/modules.py\", line 92, in ast",
+      "\"semmle/python/modules.py\", line 120, in py_ast",
+      "\"semmle/python/modules.py\", line 117, in py_ast",
+      "\"semmle/python/parser/tsg_parser.py\", line 221, in parse",
+      "\"semmleFile \"<string>\", line 3"
+    ]
+  },
+  "location": {
+    "file": "<test-root-directory>/repo_dir/syntaxerror1.py",
+    "startColumn": 0,
+    "endColumn": 0,
+    "startLine": 3,
+    "endLine": 3
+  },
+  "markdownMessage": "A parse error occurred while processing `<test-root-directory>/repo_dir/syntaxerror1.py`, and as a result this file could not be analyzed. Check the syntax of the file using the `python -m py_compile` command and correct any invalid syntax.",
+  "severity": "warning",
+  "source": {
+    "extractorName": "python",
+    "id": "py/diagnostics/syntax-error",
+    "name": "Could not process some files due to syntax errors"
+  },
+  "timestamp": "2023-03-13T15:03:48.181384",
+  "visibility": {
+    "cliSummaryTable": true,
+    "statusPage": true,
+    "telemetry": true
+  }
+}
+{
+  "attributes": {
+    "args": [
+      "Syntax Error"
+    ],
+    "traceback": [
+      "\"semmle/python/modules.py\", line 108, in py_ast",
+      "\"semmle/python/modules.py\", line 102, in old_py_ast",
+      "\"semmle/python/parser/__init__.py\", line 100, in parse",
+      "\"semmleFile \"<string>\", line 6",
+      "\"semmle/python/extractor.py\", line 84, in process_source_module",
+      "\"semmle/python/modules.py\", line 92, in ast",
+      "\"semmle/python/modules.py\", line 120, in py_ast",
+      "\"semmle/python/modules.py\", line 117, in py_ast",
+      "\"semmle/python/parser/tsg_parser.py\", line 221, in parse",
+      "\"semmleFile \"<string>\", line 5"
+    ]
+  },
+  "location": {
+    "file": "<test-root-directory>/repo_dir/syntaxerror2.py",
+    "startColumn": 0,
+    "endColumn": 0,
+    "startLine": 5,
+    "endLine": 5
+  },
+  "markdownMessage": "A parse error occurred while processing `<test-root-directory>/repo_dir/syntaxerror2.py`, and as a result this file could not be analyzed. Check the syntax of the file using the `python -m py_compile` command and correct any invalid syntax.",
+  "severity": "warning",
+  "source": {
+    "extractorName": "python",
+    "id": "py/diagnostics/syntax-error",
+    "name": "Could not process some files due to syntax errors"
+  },
+  "timestamp": "2023-03-13T15:03:48.164991",
+  "visibility": {
+    "cliSummaryTable": true,
+    "statusPage": true,
+    "telemetry": true
+  }
+}
+{
+  "attributes": {
+    "args": [
+      "maximum recursion depth exceeded while calling a Python object"
+    ],
+    "traceback": [
+      "\"semmle/worker.py\", line 235, in _extract_loop",
+      "\"semmle/extractors/super_extractor.py\", line 37, in process",
+      "\"semmle/extractors/py_extractor.py\", line 43, in process",
+      "\"semmle/python/extractor.py\", line 227, in process_source_module",
+      "\"semmle/python/extractor.py\", line 84, in process_source_module",
+      "\"semmle/python/modules.py\", line 96, in ast",
+      "\"semmle/python/passes/labeller.py\", line 85, in apply",
+      "\"semmle/python/passes/labeller.py\", line 44, in __init__",
+      "\"semmle/python/passes/labeller.py\", line 14, in __init__",
+      "\"semmle/python/passes/ast_pass.py\", line 208, in visit",
+      "\"semmle/python/passes/ast_pass.py\", line 216, in generic_visit",
+      "\"semmle/python/passes/ast_pass.py\", line 213, in generic_visit",
+      "\"semmle/python/passes/ast_pass.py\", line 208, in visit",
+      "\"semmle/python/passes/ast_pass.py\", line 213, in generic_visit",
+      "\"semmle/python/passes/ast_pass.py\", line 208, in visit",
+      "... 3930 lines skipped",
+      "\"semmle/python/passes/ast_pass.py\", line 213, in generic_visit",
+      "\"semmle/python/passes/ast_pass.py\", line 208, in visit",
+      "\"semmle/python/passes/ast_pass.py\", line 213, in generic_visit",
+      "\"semmle/python/passes/ast_pass.py\", line 208, in visit",
+      "\"semmle/python/passes/ast_pass.py\", line 205, in _get_visit_method"
+    ]
+  },
+  "location": {
+    "file": "<test-root-directory>/repo_dir/recursion_error.py"
+  },
+  "plaintextMessage": "maximum recursion depth exceeded while calling a Python object",
+  "severity": "error",
+  "source": {
+    "extractorName": "python",
+    "id": "py/diagnostics/recursion-error",
+    "name": "Recursion error in Python extractor"
+  },
+  "timestamp": "2023-03-13T15:03:47.468924",
+  "visibility": {
+    "cliSummaryTable": false,
+    "statusPage": false,
+    "telemetry": true
+  }
+}
--- a/python/extractor/cli-integration-test/writing-diagnostics/make_test.py
+++ b/python/extractor/cli-integration-test/writing-diagnostics/make_test.py
@@ -0,0 +1,4 @@
+
+# Creates a test file that will cause a RecursionError when run with the Python extractor.
+with open('repo_dir/recursion_error.py', 'w') as f:
+    f.write("print({})\n".format("+".join(["1"] * 1000)))
--- a/python/extractor/cli-integration-test/writing-diagnostics/query.expected
+++ b/python/extractor/cli-integration-test/writing-diagnostics/query.expected
@@ -0,0 +1,6 @@
+|    filename     |
+-----------------+
+| safe.py         |
+| syntaxerror1.py |
+| syntaxerror2.py |
+| syntaxerror3.py |
--- a/python/extractor/cli-integration-test/writing-diagnostics/query.ql
+++ b/python/extractor/cli-integration-test/writing-diagnostics/query.ql
@@ -0,0 +1,3 @@
+import python
+
+select any(File f).getShortName() as filename order by filename
--- a/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/safe.py
+++ b/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/safe.py
@@ -0,0 +1 @@
+print("No deeply nested structures here!")
--- a/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/syntaxerror1.py
+++ b/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/syntaxerror1.py
@@ -0,0 +1,3 @@
+# This file contains a deliberate syntax error
+
+2 +
--- a/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/syntaxerror2.py
+++ b/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/syntaxerror2.py
@@ -0,0 +1,5 @@
+
+
+
+
+[
--- a/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/syntaxerror3.py
+++ b/python/extractor/cli-integration-test/writing-diagnostics/repo_dir/syntaxerror3.py
@@ -0,0 +1 @@
+"Oh no!
--- a/python/extractor/cli-integration-test/writing-diagnostics/test.sh
+++ b/python/extractor/cli-integration-test/writing-diagnostics/test.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
+
+set -x
+
+CODEQL=${CODEQL:-codeql}
+
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$SCRIPTDIR"
+
+rm -rf db
+rm -f *.actual
+
+python3 make_test.py
+
+echo "Testing database with various errors during extraction"
+$CODEQL database create db --language python --source-root repo_dir/
+$CODEQL query run --database db query.ql > query.actual
+diff query.expected query.actual
+python3 test_diagnostics_output.py
+
+rm -f *.actual
+rm -f repo_dir/recursion_error.py
+rm -rf db
--- a/python/extractor/cli-integration-test/writing-diagnostics/test_diagnostics_output.py
+++ b/python/extractor/cli-integration-test/writing-diagnostics/test_diagnostics_output.py
@@ -0,0 +1,7 @@
+import os
+import sys
+sys.path.append(os.path.join(os.path.dirname(__file__), "..", "..", "..", "integration-tests"))
+import diagnostics_test_utils
+
+test_db = "db"
+diagnostics_test_utils.check_diagnostics(".", test_db, skip_attributes=True)
--- a/python/extractor/convert_setup.py
+++ b/python/extractor/convert_setup.py
@@ -0,0 +1,126 @@
+#!/usr/bin/env python
+
+import os.path
+import imp
+import sys
+import traceback
+import re
+
+SETUP_TAG = "LGTM_PYTHON_SETUP_SETUP_PY"
+
+setup_file_path = "<default value>"
+requirements_file_path = "<default value>"
+
+if sys.version_info >= (3,):
+    basestring = str
+
+def setup_interceptor(**args):
+    requirements = make_requirements(**args)
+    write_requirements_file(requirements)
+
+def make_requirements(requires=(), install_requires=(), extras_require={}, dependency_links=[], **other_args):
+    # Install main requirements.
+    requirements = list(requires) + list(install_requires)
+    # Install requirements for all features.
+    for feature, feature_requirements in extras_require.items():
+        if isinstance(feature_requirements, basestring):
+            requirements += [feature_requirements]
+        else:
+            requirements += list(feature_requirements)
+
+    # Attempt to use dependency_links to find requirements first.
+    for link in dependency_links:
+        split_link = link.rsplit("#egg=", 1)
+        if len(split_link) != 2:
+            print("Invalid dependency link \"%s\" was ignored." % link)
+            continue
+        if not link.startswith("http"):
+            print("Dependency link \"%s\" is not an HTTP link so is being ignored." % link)
+            continue
+        package_name = split_link[1].rsplit("-", 1)[0]
+        for index, requirement in enumerate(requirements):
+            if requirement_name(requirement) == package_name:
+                print("Using %s to install %s." % (link, requirement))
+                requirements[index] = package_name + " @ " + link
+
+    print("Creating %s file from %s." % (requirements_file_path, setup_file_path))
+    requirements = [requirement.encode("ascii", "ignore").strip().decode("ascii") for requirement in requirements]
+    print("Requirements extracted from setup.py: %s" % requirements)
+    return requirements
+
+REQUIREMENT = re.compile(r"^([\w-]+)")
+
+def requirement_name(req_string):
+    req_string = req_string.strip()
+    if req_string[0] == '#':
+        return None
+    match = REQUIREMENT.match(req_string)
+    if match:
+        return match.group(1)
+    return None
+
+
+def write_requirements_file(requirements):
+    if os.path.exists(requirements_file_path):
+        # Only overwrite the existing requirements if the new requirements are not empty.
+        if requirements:
+            print("%s already exists. It will be overwritten." % requirements_file_path)
+        else:
+            print("%s already exists and it will not be overwritten because the new requirements list is empty." % requirements_file_path)
+            return
+    elif not requirements:
+        print("%s will not be written because the new requirements list is empty." % requirements_file_path)
+        return
+    with open(requirements_file_path, "w") as requirements_file:
+        for requirement in requirements:
+            requirements_file.write(requirement + "\n")
+    print("Requirements have been written to " + requirements_file_path)
+
+def convert_setup_to_requirements(root):
+    global setup_file_path
+    if SETUP_TAG in os.environ:
+        setup_file_path = os.environ[SETUP_TAG]
+        if setup_file_path == "false":
+            print("setup.py explicitly ignored")
+            return 0
+    else:
+        setup_file_path = os.path.join(root, "setup.py")
+    if not os.path.exists(setup_file_path):
+        print("%s does not exist. Not generating requirements.txt." % setup_file_path)
+        return 0
+    # Override the setuptools and distutils.core implementation of setup with our own.
+    import setuptools
+    setattr(setuptools, "setup", setup_interceptor)
+    import distutils.core
+    setattr(distutils.core, "setup", setup_interceptor)
+
+    # TODO: WHY are we inserting at index 1?
+    # >>> l = [1,2,3]; l.insert(1, 'x'); print(l)
+    # [1, 'x', 2, 3]
+
+    # Ensure the current directory is on path since setup.py might try and include some files in it.
+    sys.path.insert(1, root)
+
+    # Modify the arguments since the setup file sometimes checks them.
+    sys.argv = [setup_file_path, "build"]
+
+    # Run the setup.py file.
+    try:
+        imp.load_source("__main__", setup_file_path)
+    except BaseException as ex:
+        # We don't really care about errors so long as a requirements.txt exists in the next build step.
+        print("Running %s failed." % setup_file_path)
+        traceback.print_exc(file=sys.stdout)
+        if not os.path.exists(requirements_file_path):
+            print("%s failed, and a %s file does not exist. Exiting with error." % (setup_file_path, requirements_file_path))
+            return 1
+    return 0
+
+def main():
+    global requirements_file_path
+    requirements_file_path = sys.argv[2]
+    sys.path.extend(sys.argv[3:])
+    sys.exit(convert_setup_to_requirements(sys.argv[1]))
+
+if __name__ == "__main__":
+    main()
--- a/python/extractor/data/python/stubs/README.md
+++ b/python/extractor/data/python/stubs/README.md
@@ -0,0 +1,3 @@
+This folder contains stubs for commonly used Python libraries, which have
+the same interface as the original libraries, but are more amenable to
+static analysis. The original licenses are noted in each subdirectory.
--- a/python/extractor/data/python/stubs/six/LICENSE
+++ b/python/extractor/data/python/stubs/six/LICENSE
@@ -0,0 +1,18 @@
+Copyright (c) 2010-2019 Benjamin Peterson
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--- a/python/extractor/data/python/stubs/six/init.py
+++ b/python/extractor/data/python/stubs/six/init.py
@@ -0,0 +1,240 @@
+# Stub file for six.
+#This should have the same interface as the six module,
+#but be much more tractable for static analysis.
+
+
+
+"""Utilities for writing code that runs on Python 2 and 3"""
+
+# Copyright (c) 2015 Semmle Limited
+# All rights reserved
+# Note that the original six module is copyright Benjamin Peterson
+#
+
+import operator
+import sys
+import types
+
+__author__ = "Benjamin Peterson <benjamin@python.org>"
+__version__ = "1.14.0"
+
+
+# Useful for very coarse version differentiation.
+PY2 = sys.version_info < (3,)
+PY3 = sys.version_info >= (3,)
+
+if PY3:
+    string_types = str,
+    integer_types = int,
+    class_types = type,
+    text_type = str
+    binary_type = bytes
+
+    MAXSIZE = sys.maxsize
+else:
+    string_types = basestring,
+    integer_types = (int, long)
+    class_types = (type, types.ClassType)
+    text_type = unicode
+    binary_type = str
+    #We can't compute MAXSIZE, but it doesn't really matter
+    MAXSIZE = int((1 << 63) - 1)
+
+
+def _add_doc(func, doc):
+    """Add documentation to a function."""
+    func.__doc__ = doc
+
+
+def _import_module(name):
+    """Import module, returning the module after the last dot."""
+    __import__(name)
+    return sys.modules[name]
+
+import six.moves as moves
+
+
+def add_move(move):
+    """Add an item to six.moves."""
+    setattr(_MovedItems, move.name, move)
+
+
+def remove_move(name):
+    """Remove item from six.moves."""
+    try:
+        delattr(_MovedItems, name)
+    except AttributeError:
+        try:
+            del moves.__dict__[name]
+        except KeyError:
+            raise AttributeError("no such move, %r" % (name,))
+
+
+if PY3:
+    _meth_func = "__func__"
+    _meth_self = "__self__"
+
+    _func_closure = "__closure__"
+    _func_code = "__code__"
+    _func_defaults = "__defaults__"
+    _func_globals = "__globals__"
+
+    _iterkeys = "keys"
+    _itervalues = "values"
+    _iteritems = "items"
+    _iterlists = "lists"
+else:
+    _meth_func = "im_func"
+    _meth_self = "im_self"
+
+    _func_closure = "func_closure"
+    _func_code = "func_code"
+    _func_defaults = "func_defaults"
+    _func_globals = "func_globals"
+
+    _iterkeys = "iterkeys"
+    _itervalues = "itervalues"
+    _iteritems = "iteritems"
+    _iterlists = "iterlists"
+
+
+try:
+    advance_iterator = next
+except NameError:
+    def advance_iterator(it):
+        return it.next()
+next = advance_iterator
+
+
+try:
+    callable = callable
+except NameError:
+    def callable(obj):
+        return any("__call__" in klass.__dict__ for klass in type(obj).__mro__)
+
+
+if PY3:
+    def get_unbound_function(unbound):
+        return unbound
+
+    create_bound_method = types.MethodType
+
+    Iterator = object
+else:
+    def get_unbound_function(unbound):
+        return unbound.im_func
+
+    def create_bound_method(func, obj):
+        return types.MethodType(func, obj, obj.__class__)
+
+    class Iterator(object):
+
+        def next(self):
+            return type(self).__next__(self)
+
+    callable = callable
+_add_doc(get_unbound_function,
+         """Get the function out of a possibly unbound function""")
+
+
+get_method_function = operator.attrgetter(_meth_func)
+get_method_self = operator.attrgetter(_meth_self)
+get_function_closure = operator.attrgetter(_func_closure)
+get_function_code = operator.attrgetter(_func_code)
+get_function_defaults = operator.attrgetter(_func_defaults)
+get_function_globals = operator.attrgetter(_func_globals)
+
+
+def iterkeys(d, **kw):
+    """Return an iterator over the keys of a dictionary."""
+    return iter(getattr(d, _iterkeys)(**kw))
+
+def itervalues(d, **kw):
+    """Return an iterator over the values of a dictionary."""
+    return iter(getattr(d, _itervalues)(**kw))
+
+def iteritems(d, **kw):
+    """Return an iterator over the (key, value) pairs of a dictionary."""
+    return iter(getattr(d, _iteritems)(**kw))
+
+def iterlists(d, **kw):
+    """Return an iterator over the (key, [values]) pairs of a dictionary."""
+    return iter(getattr(d, _iterlists)(**kw))
+
+def byte2int(ch): #type bytes -> int
+    return int(unknown())
+
+def b(s): #type str -> bytes
+    """Byte literal"""
+    return bytes(unknown())
+
+def u(s): #type str -> unicode
+    """Text literal"""
+    if PY3:
+        unicode = str
+    return unicode(unknown())
+
+if PY3:
+    unichr = chr
+    def int2byte(i): #type int -> bytes
+        return bytes(unknown())
+    indexbytes = operator.getitem
+    iterbytes = iter
+    import io
+    StringIO = io.StringIO
+    BytesIO = io.BytesIO
+else:
+    unichr = unichr
+    int2byte = chr
+    def indexbytes(buf, i):
+        return int(unknown())
+    def iterbytes(buf):
+        return (int(unknown()) for byte in buf)
+    import StringIO
+    StringIO = BytesIO = StringIO.StringIO
+
+
+if PY3:
+    exec_ = getattr(six.moves.builtins, "exec")
+
+    def reraise(tp, value, tb=None):
+        """Reraise an exception."""
+        if value.__traceback__ is not tb:
+            raise value.with_traceback(tb)
+        raise value
+
+else:
+    def exec_(_code_, _globs_=None, _locs_=None):
+        pass
+
+    def reraise(tp, value, tb=None):
+        """Reraise an exception."""
+        exc = tp(value)
+        exc.__traceback__ = tb
+        raise exc
+
+
+print_ = getattr(moves.builtins, "print", None)
+if print_ is None:
+    def print_(*args, **kwargs):
+        """The new-style print function for Python 2.4 and 2.5."""
+        pass
+
+def with_metaclass(meta, *bases):
+    """Create a base class with a metaclass."""
+    return meta("NewBase", bases, {})
+
+def add_metaclass(metaclass):
+    """Class decorator for creating a class with a metaclass."""
+    def wrapper(cls):
+        orig_vars = cls.__dict__.copy()
+        orig_vars.pop('__dict__', None)
+        orig_vars.pop('__weakref__', None)
+        slots = orig_vars.get('__slots__')
+        if slots is not None:
+            if isinstance(slots, str):
+                slots = [slots]
+            for slots_var in slots:
+                orig_vars.pop(slots_var)
+        return metaclass(cls.__name__, cls.__bases__, orig_vars)
+    return wrapper
--- a/python/extractor/data/python/stubs/six/moves/init.py
+++ b/python/extractor/data/python/stubs/six/moves/init.py
@@ -0,0 +1,239 @@
+# six.moves
+
+import sys
+PY2 = sys.version_info < (3,)
+PY3 = sys.version_info >= (3,)
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 2.7.17 (default, Nov 18 2019, 13:12:39)
+if PY2:
+    import cStringIO as _1
+    cStringIO = _1.StringIO
+    import itertools as _2
+    filter = _2.filter
+    filterfalse = _2.filterfalse
+    import __builtin__ as _3
+    input = _3.raw_input
+    intern = _3.intern
+    map = _2.map
+    import os as _4
+    getcwd = _4.getcwdu
+    getcwdb = _4.getcwd
+    import commands as _5
+    getoutput = _5.getoutput
+    range = _3.xrange
+    reload_module = _3.reload
+    reduce = _3.reduce
+    import pipes as _6
+    shlex_quote = _6.quote
+    import StringIO as _7
+    StringIO = _7.StringIO
+    import UserDict as _8
+    UserDict = _8.UserDict
+    import UserList as _9
+    UserList = _9.UserList
+    import UserString as _10
+    UserString = _10.UserString
+    xrange = _3.xrange
+    zip = zip
+    zip_longest = _2.zip_longest
+    import __builtin__ as builtins
+    import ConfigParser as configparser
+    import collections as collections_abc
+    import copy_reg as copyreg
+    import gdbm as dbm_gnu
+    import dbm as dbm_ndbm
+    import dummy_thread as _dummy_thread
+    import cookielib as http_cookiejar
+    import Cookie as http_cookies
+    import htmlentitydefs as html_entities
+    import HTMLParser as html_parser
+    import httplib as http_client
+    import email.MIMEBase as email_mime_base
+    import email.MIMEImage as email_mime_image
+    import email.MIMEMultipart as email_mime_multipart
+    import email.MIMENonMultipart as email_mime_nonmultipart
+    import email.MIMEText as email_mime_text
+    import BaseHTTPServer as BaseHTTPServer
+    import CGIHTTPServer as CGIHTTPServer
+    import SimpleHTTPServer as SimpleHTTPServer
+    import cPickle as cPickle
+    import Queue as queue
+    import repr as reprlib
+    import SocketServer as socketserver
+    import thread as _thread
+    import Tkinter as tkinter
+    import Dialog as tkinter_dialog
+    import FileDialog as tkinter_filedialog
+    import ScrolledText as tkinter_scrolledtext
+    import SimpleDialog as tkinter_simpledialog
+    import Tix as tkinter_tix
+    import ttk as tkinter_ttk
+    import Tkconstants as tkinter_constants
+    import Tkdnd as tkinter_dnd
+    import tkColorChooser as tkinter_colorchooser
+    import tkCommonDialog as tkinter_commondialog
+    import tkFileDialog as tkinter_tkfiledialog
+    import tkFont as tkinter_font
+    import tkMessageBox as tkinter_messagebox
+    import tkSimpleDialog as tkinter_tksimpledialog
+    import xmlrpclib as xmlrpc_client
+    import SimpleXMLRPCServer as xmlrpc_server
+    del _1
+    del _5
+    del _7
+    del _8
+    del _6
+    del _3
+    del _9
+    del _2
+    del _10
+    del _4
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 3.8.0 (default, Nov 18 2019, 13:17:17)
+if PY3:
+    import io as _1
+    cStringIO = _1.StringIO
+    import builtins as _2
+    filter = _2.filter
+    import itertools as _3
+    filterfalse = _3.filterfalse
+    input = _2.input
+    import sys as _4
+    intern = _4.intern
+    map = _2.map
+    import os as _5
+    getcwd = _5.getcwd
+    getcwdb = _5.getcwdb
+    import subprocess as _6
+    getoutput = _6.getoutput
+    range = _2.range
+    import importlib as _7
+    reload_module = _7.reload
+    import functools as _8
+    reduce = _8.reduce
+    import shlex as _9
+    shlex_quote = _9.quote
+    StringIO = _1.StringIO
+    import collections as _10
+    UserDict = _10.UserDict
+    UserList = _10.UserList
+    UserString = _10.UserString
+    xrange = _2.range
+    zip = _2.zip
+    zip_longest = _3.zip_longest
+    import builtins as builtins
+    import configparser as configparser
+    import collections.abc as collections_abc
+    import copyreg as copyreg
+    import dbm.gnu as dbm_gnu
+    import dbm.ndbm as dbm_ndbm
+    import _dummy_thread as _dummy_thread
+    import http.cookiejar as http_cookiejar
+    import http.cookies as http_cookies
+    import html.entities as html_entities
+    import html.parser as html_parser
+    import http.client as http_client
+    import email.mime.base as email_mime_base
+    import email.mime.image as email_mime_image
+    import email.mime.multipart as email_mime_multipart
+    import email.mime.nonmultipart as email_mime_nonmultipart
+    import email.mime.text as email_mime_text
+    import http.server as BaseHTTPServer
+    import http.server as CGIHTTPServer
+    import http.server as SimpleHTTPServer
+    import pickle as cPickle
+    import queue as queue
+    import reprlib as reprlib
+    import socketserver as socketserver
+    import _thread as _thread
+    import tkinter as tkinter
+    import tkinter.dialog as tkinter_dialog
+    import tkinter.filedialog as tkinter_filedialog
+    import tkinter.scrolledtext as tkinter_scrolledtext
+    import tkinter.simpledialog as tkinter_simpledialog
+    import tkinter.tix as tkinter_tix
+    import tkinter.ttk as tkinter_ttk
+    import tkinter.constants as tkinter_constants
+    import tkinter.dnd as tkinter_dnd
+    import tkinter.colorchooser as tkinter_colorchooser
+    import tkinter.commondialog as tkinter_commondialog
+    import tkinter.filedialog as tkinter_tkfiledialog
+    import tkinter.font as tkinter_font
+    import tkinter.messagebox as tkinter_messagebox
+    import tkinter.simpledialog as tkinter_tksimpledialog
+    import xmlrpc.client as xmlrpc_client
+    import xmlrpc.server as xmlrpc_server
+    del _1
+    del _2
+    del _3
+    del _4
+    del _5
+    del _6
+    del _7
+    del _8
+    del _9
+    del _10
+
+# Not generated:
+
+import six.moves.urllib as urllib
+import six.moves.urllib_parse as urllib_parse
+import six.moves.urllib_response as urllib_response
+import six.moves.urllib_request as urllib_request
+import six.moves.urllib_error as urllib_error
+import six.moves.urllib_robotparser as urllib_robotparser
+
+
+sys.modules['six.moves.builtins'] = builtins
+sys.modules['six.moves.configparser'] = configparser
+sys.modules['six.moves.collections_abc'] = collections_abc
+sys.modules['six.moves.copyreg'] = copyreg
+sys.modules['six.moves.dbm_gnu'] = dbm_gnu
+sys.modules['six.moves.dbm_ndbm'] = dbm_ndbm
+sys.modules['six.moves._dummy_thread'] = _dummy_thread
+sys.modules['six.moves.http_cookiejar'] = http_cookiejar
+sys.modules['six.moves.http_cookies'] = http_cookies
+sys.modules['six.moves.html_entities'] = html_entities
+sys.modules['six.moves.html_parser'] = html_parser
+sys.modules['six.moves.http_client'] = http_client
+sys.modules['six.moves.email_mime_base'] = email_mime_base
+sys.modules['six.moves.email_mime_image'] = email_mime_image
+sys.modules['six.moves.email_mime_multipart'] = email_mime_multipart
+sys.modules['six.moves.email_mime_nonmultipart'] = email_mime_nonmultipart
+sys.modules['six.moves.email_mime_text'] = email_mime_text
+sys.modules['six.moves.BaseHTTPServer'] = BaseHTTPServer
+sys.modules['six.moves.CGIHTTPServer'] = CGIHTTPServer
+sys.modules['six.moves.SimpleHTTPServer'] = SimpleHTTPServer
+sys.modules['six.moves.cPickle'] = cPickle
+sys.modules['six.moves.queue'] = queue
+sys.modules['six.moves.reprlib'] = reprlib
+sys.modules['six.moves.socketserver'] = socketserver
+sys.modules['six.moves._thread'] = _thread
+sys.modules['six.moves.tkinter'] = tkinter
+sys.modules['six.moves.tkinter_dialog'] = tkinter_dialog
+sys.modules['six.moves.tkinter_filedialog'] = tkinter_filedialog
+sys.modules['six.moves.tkinter_scrolledtext'] = tkinter_scrolledtext
+sys.modules['six.moves.tkinter_simpledialog'] = tkinter_simpledialog
+sys.modules['six.moves.tkinter_tix'] = tkinter_tix
+sys.modules['six.moves.tkinter_ttk'] = tkinter_ttk
+sys.modules['six.moves.tkinter_constants'] = tkinter_constants
+sys.modules['six.moves.tkinter_dnd'] = tkinter_dnd
+sys.modules['six.moves.tkinter_colorchooser'] = tkinter_colorchooser
+sys.modules['six.moves.tkinter_commondialog'] = tkinter_commondialog
+sys.modules['six.moves.tkinter_tkfiledialog'] = tkinter_tkfiledialog
+sys.modules['six.moves.tkinter_font'] = tkinter_font
+sys.modules['six.moves.tkinter_messagebox'] = tkinter_messagebox
+sys.modules['six.moves.tkinter_tksimpledialog'] = tkinter_tksimpledialog
+sys.modules['six.moves.xmlrpc_client'] = xmlrpc_client
+sys.modules['six.moves.xmlrpc_server'] = xmlrpc_server
+
+# Windows special
+
+if PY2:
+    import _winreg as winreg
+if PY3:
+    import winreg as winreg
+
+sys.modules['six.moves.winreg'] = winreg
+
+del sys
--- a/python/extractor/data/python/stubs/six/moves/urllib/init.py
+++ b/python/extractor/data/python/stubs/six/moves/urllib/init.py
@@ -0,0 +1,15 @@
+import sys
+
+import six.moves.urllib_error as error
+import six.moves.urllib_parse as parse
+import six.moves.urllib_request as request
+import six.moves.urllib_response as response
+import six.moves.urllib_robotparser as robotparser
+
+sys.modules['six.moves.urllib.error'] = error
+sys.modules['six.moves.urllib.parse'] = parse
+sys.modules['six.moves.urllib.request'] = request
+sys.modules['six.moves.urllib.response'] = response
+sys.modules['six.moves.urllib.robotparser'] = robotparser
+
+del sys
--- a/python/extractor/data/python/stubs/six/moves/urllib_error.py
+++ b/python/extractor/data/python/stubs/six/moves/urllib_error.py
@@ -0,0 +1,21 @@
+# six.moves.urllib_error
+
+from six import PY2, PY3
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 2.7.17 (default, Nov 18 2019, 13:12:39)
+if PY2:
+    import urllib2 as _1
+    URLError = _1.URLError
+    HTTPError = _1.HTTPError
+    import urllib as _2
+    ContentTooShortError = _2.ContentTooShortError
+    del _1
+    del _2
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 3.8.0 (default, Nov 18 2019, 13:17:17)
+if PY3:
+    import urllib.error as _1
+    URLError = _1.URLError
+    HTTPError = _1.HTTPError
+    ContentTooShortError = _1.ContentTooShortError
+    del _1
--- a/python/extractor/data/python/stubs/six/moves/urllib_parse.py
+++ b/python/extractor/data/python/stubs/six/moves/urllib_parse.py
@@ -0,0 +1,65 @@
+# six.moves.urllib_parse
+
+from six import PY2, PY3
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 2.7.17 (default, Nov 18 2019, 13:12:39)
+if PY2:
+    import urlparse as _1
+    ParseResult = _1.ParseResult
+    SplitResult = _1.SplitResult
+    parse_qs = _1.parse_qs
+    parse_qsl = _1.parse_qsl
+    urldefrag = _1.urldefrag
+    urljoin = _1.urljoin
+    urlparse = _1.urlparse
+    urlsplit = _1.urlsplit
+    urlunparse = _1.urlunparse
+    urlunsplit = _1.urlunsplit
+    import urllib as _2
+    quote = _2.quote
+    quote_plus = _2.quote_plus
+    unquote = _2.unquote
+    unquote_plus = _2.unquote_plus
+    unquote_to_bytes = _2.unquote
+    urlencode = _2.urlencode
+    splitquery = _2.splitquery
+    splittag = _2.splittag
+    splituser = _2.splituser
+    splitvalue = _2.splitvalue
+    uses_fragment = _1.uses_fragment
+    uses_netloc = _1.uses_netloc
+    uses_params = _1.uses_params
+    uses_query = _1.uses_query
+    uses_relative = _1.uses_relative
+    del _1
+    del _2
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 3.8.0 (default, Nov 18 2019, 13:17:17)
+if PY3:
+    import urllib.parse as _1
+    ParseResult = _1.ParseResult
+    SplitResult = _1.SplitResult
+    parse_qs = _1.parse_qs
+    parse_qsl = _1.parse_qsl
+    urldefrag = _1.urldefrag
+    urljoin = _1.urljoin
+    urlparse = _1.urlparse
+    urlsplit = _1.urlsplit
+    urlunparse = _1.urlunparse
+    urlunsplit = _1.urlunsplit
+    quote = _1.quote
+    quote_plus = _1.quote_plus
+    unquote = _1.unquote
+    unquote_plus = _1.unquote_plus
+    unquote_to_bytes = _1.unquote_to_bytes
+    urlencode = _1.urlencode
+    splitquery = _1.splitquery
+    splittag = _1.splittag
+    splituser = _1.splituser
+    splitvalue = _1.splitvalue
+    uses_fragment = _1.uses_fragment
+    uses_netloc = _1.uses_netloc
+    uses_params = _1.uses_params
+    uses_query = _1.uses_query
+    uses_relative = _1.uses_relative
+    del _1
--- a/python/extractor/data/python/stubs/six/moves/urllib_request.py
+++ b/python/extractor/data/python/stubs/six/moves/urllib_request.py
@@ -0,0 +1,85 @@
+# six.moves.urllib_request
+
+from six import PY2, PY3
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 2.7.17 (default, Nov 18 2019, 13:12:39)
+if PY2:
+    import urllib2 as _1
+    urlopen = _1.urlopen
+    install_opener = _1.install_opener
+    build_opener = _1.build_opener
+    import urllib as _2
+    pathname2url = _2.pathname2url
+    url2pathname = _2.url2pathname
+    getproxies = _2.getproxies
+    Request = _1.Request
+    OpenerDirector = _1.OpenerDirector
+    HTTPDefaultErrorHandler = _1.HTTPDefaultErrorHandler
+    HTTPRedirectHandler = _1.HTTPRedirectHandler
+    HTTPCookieProcessor = _1.HTTPCookieProcessor
+    ProxyHandler = _1.ProxyHandler
+    BaseHandler = _1.BaseHandler
+    HTTPPasswordMgr = _1.HTTPPasswordMgr
+    HTTPPasswordMgrWithDefaultRealm = _1.HTTPPasswordMgrWithDefaultRealm
+    AbstractBasicAuthHandler = _1.AbstractBasicAuthHandler
+    HTTPBasicAuthHandler = _1.HTTPBasicAuthHandler
+    ProxyBasicAuthHandler = _1.ProxyBasicAuthHandler
+    AbstractDigestAuthHandler = _1.AbstractDigestAuthHandler
+    HTTPDigestAuthHandler = _1.HTTPDigestAuthHandler
+    ProxyDigestAuthHandler = _1.ProxyDigestAuthHandler
+    HTTPHandler = _1.HTTPHandler
+    HTTPSHandler = _1.HTTPSHandler
+    FileHandler = _1.FileHandler
+    FTPHandler = _1.FTPHandler
+    CacheFTPHandler = _1.CacheFTPHandler
+    UnknownHandler = _1.UnknownHandler
+    HTTPErrorProcessor = _1.HTTPErrorProcessor
+    urlretrieve = _2.urlretrieve
+    urlcleanup = _2.urlcleanup
+    URLopener = _2.URLopener
+    FancyURLopener = _2.FancyURLopener
+    proxy_bypass = _2.proxy_bypass
+    parse_http_list = _1.parse_http_list
+    parse_keqv_list = _1.parse_keqv_list
+    del _1
+    del _2
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 3.8.0 (default, Nov 18 2019, 13:17:17)
+if PY3:
+    import urllib.request as _1
+    urlopen = _1.urlopen
+    install_opener = _1.install_opener
+    build_opener = _1.build_opener
+    pathname2url = _1.pathname2url
+    url2pathname = _1.url2pathname
+    getproxies = _1.getproxies
+    Request = _1.Request
+    OpenerDirector = _1.OpenerDirector
+    HTTPDefaultErrorHandler = _1.HTTPDefaultErrorHandler
+    HTTPRedirectHandler = _1.HTTPRedirectHandler
+    HTTPCookieProcessor = _1.HTTPCookieProcessor
+    ProxyHandler = _1.ProxyHandler
+    BaseHandler = _1.BaseHandler
+    HTTPPasswordMgr = _1.HTTPPasswordMgr
+    HTTPPasswordMgrWithDefaultRealm = _1.HTTPPasswordMgrWithDefaultRealm
+    AbstractBasicAuthHandler = _1.AbstractBasicAuthHandler
+    HTTPBasicAuthHandler = _1.HTTPBasicAuthHandler
+    ProxyBasicAuthHandler = _1.ProxyBasicAuthHandler
+    AbstractDigestAuthHandler = _1.AbstractDigestAuthHandler
+    HTTPDigestAuthHandler = _1.HTTPDigestAuthHandler
+    ProxyDigestAuthHandler = _1.ProxyDigestAuthHandler
+    HTTPHandler = _1.HTTPHandler
+    HTTPSHandler = _1.HTTPSHandler
+    FileHandler = _1.FileHandler
+    FTPHandler = _1.FTPHandler
+    CacheFTPHandler = _1.CacheFTPHandler
+    UnknownHandler = _1.UnknownHandler
+    HTTPErrorProcessor = _1.HTTPErrorProcessor
+    urlretrieve = _1.urlretrieve
+    urlcleanup = _1.urlcleanup
+    URLopener = _1.URLopener
+    FancyURLopener = _1.FancyURLopener
+    proxy_bypass = _1.proxy_bypass
+    parse_http_list = _1.parse_http_list
+    parse_keqv_list = _1.parse_keqv_list
+    del _1
--- a/python/extractor/data/python/stubs/six/moves/urllib_response.py
+++ b/python/extractor/data/python/stubs/six/moves/urllib_response.py
@@ -0,0 +1,21 @@
+# six.moves.urllib_response
+
+from six import PY2, PY3
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 2.7.17 (default, Nov 18 2019, 13:12:39)
+if PY2:
+    import urllib as _1
+    addbase = _1.addbase
+    addclosehook = _1.addclosehook
+    addinfo = _1.addinfo
+    addinfourl = _1.addinfourl
+    del _1
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 3.8.0 (default, Nov 18 2019, 13:17:17)
+if PY3:
+    import urllib.response as _1
+    addbase = _1.addbase
+    addclosehook = _1.addclosehook
+    addinfo = _1.addinfo
+    addinfourl = _1.addinfourl
+    del _1
--- a/python/extractor/data/python/stubs/six/moves/urllib_robotparser.py
+++ b/python/extractor/data/python/stubs/six/moves/urllib_robotparser.py
@@ -0,0 +1,15 @@
+# six.moves.urllib_robotparser
+
+from six import PY2, PY3
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 2.7.17 (default, Nov 18 2019, 13:12:39)
+if PY2:
+    import robotparser as _1
+    RobotFileParser = _1.RobotFileParser
+    del _1
+
+# Generated (six_gen.py) from six version 1.14.0 with Python 3.8.0 (default, Nov 18 2019, 13:17:17)
+if PY3:
+    import urllib.robotparser as _1
+    RobotFileParser = _1.RobotFileParser
+    del _1
--- a/python/extractor/docs/extractor-python-index.svg
+++ b/python/extractor/docs/extractor-python-index.svg
--- a/python/extractor/docs/extractor-python-setup.svg
+++ b/python/extractor/docs/extractor-python-setup.svg
--- a/python/extractor/get_venv_lib.py
+++ b/python/extractor/get_venv_lib.py
@@ -0,0 +1,35 @@
+import os
+import sys
+
+
+def pip_installed_folder():
+    try:
+        import pip
+    except ImportError:
+        print("ERROR: 'pip' not installed.")
+        sys.exit(2)
+    dirname, filename = os.path.split(pip.__file__)
+    if filename.startswith("__init__."):
+        dirname = os.path.dirname(dirname)
+    return dirname
+
+def first_site_packages():
+    dist_packages = None
+    for path in sys.path:
+        if "site-packages" in path:
+            return path
+        if "dist-packages" in path and not dist_packages:
+            dist_packages = path
+    if dist_packages:
+        return dist_packages
+    # No site-packages or dist-packages?
+    raise Exception
+
+def get_venv_lib():
+    try:
+        return pip_installed_folder()
+    except:
+        return first_site_packages()
+
+if __name__=='__main__':
+    print(get_venv_lib())
--- a/python/extractor/imp.py
+++ b/python/extractor/imp.py
@@ -0,0 +1,344 @@
+"""This module provides the components needed to build your own __import__
+function.  Undocumented functions are obsolete.
+
+In most cases it is preferred you consider using the importlib module's
+functionality over this module.
+
+This file was copied from `Lib/imp.py`, copyright PSF, with minor modifications made afterward.
+"""
+# (Probably) need to stay in _imp
+from _imp import (lock_held, acquire_lock, release_lock,
+                  get_frozen_object, is_frozen_package,
+                  init_frozen, is_builtin, is_frozen,
+                  _fix_co_filename)
+try:
+    from _imp import create_dynamic
+except ImportError:
+    # Platform doesn't support dynamic loading.
+    create_dynamic = None
+
+from importlib._bootstrap import _ERR_MSG, _exec, _load, _builtin_from_name
+from importlib._bootstrap_external import SourcelessFileLoader
+
+from importlib import machinery
+from importlib import util
+import importlib
+import os
+import sys
+import tokenize
+import types
+import warnings
+
+
+
+# DEPRECATED
+SEARCH_ERROR = 0
+PY_SOURCE = 1
+PY_COMPILED = 2
+C_EXTENSION = 3
+PY_RESOURCE = 4
+PKG_DIRECTORY = 5
+C_BUILTIN = 6
+PY_FROZEN = 7
+PY_CODERESOURCE = 8
+IMP_HOOK = 9
+
+
+def new_module(name):
+    """**DEPRECATED**
+
+    Create a new module.
+
+    The module is not entered into sys.modules.
+
+    """
+    return types.ModuleType(name)
+
+
+def get_magic():
+    """**DEPRECATED**
+
+    Return the magic number for .pyc files.
+    """
+    return util.MAGIC_NUMBER
+
+
+def get_tag():
+    """Return the magic tag for .pyc files."""
+    return sys.implementation.cache_tag
+
+
+def cache_from_source(path, debug_override=None):
+    """**DEPRECATED**
+
+    Given the path to a .py file, return the path to its .pyc file.
+
+    The .py file does not need to exist; this simply returns the path to the
+    .pyc file calculated as if the .py file were imported.
+
+    If debug_override is not None, then it must be a boolean and is used in
+    place of sys.flags.optimize.
+
+    If sys.implementation.cache_tag is None then NotImplementedError is raised.
+
+    """
+    with warnings.catch_warnings():
+        warnings.simplefilter('ignore')
+        return util.cache_from_source(path, debug_override)
+
+
+def source_from_cache(path):
+    """**DEPRECATED**
+
+    Given the path to a .pyc. file, return the path to its .py file.
+
+    The .pyc file does not need to exist; this simply returns the path to
+    the .py file calculated to correspond to the .pyc file.  If path does
+    not conform to PEP 3147 format, ValueError will be raised. If
+    sys.implementation.cache_tag is None then NotImplementedError is raised.
+
+    """
+    return util.source_from_cache(path)
+
+
+def get_suffixes():
+    """**DEPRECATED**"""
+    extensions = [(s, 'rb', C_EXTENSION) for s in machinery.EXTENSION_SUFFIXES]
+    source = [(s, 'r', PY_SOURCE) for s in machinery.SOURCE_SUFFIXES]
+    bytecode = [(s, 'rb', PY_COMPILED) for s in machinery.BYTECODE_SUFFIXES]
+
+    return extensions + source + bytecode
+
+
+class NullImporter:
+
+    """**DEPRECATED**
+
+    Null import object.
+
+    """
+
+    def __init__(self, path):
+        if path == '':
+            raise ImportError('empty pathname', path='')
+        elif os.path.isdir(path):
+            raise ImportError('existing directory', path=path)
+
+    def find_module(self, fullname):
+        """Always returns None."""
+        return None
+
+
+class _HackedGetData:
+
+    """Compatibility support for 'file' arguments of various load_*()
+    functions."""
+
+    def __init__(self, fullname, path, file=None):
+        super().__init__(fullname, path)
+        self.file = file
+
+    def get_data(self, path):
+        """Gross hack to contort loader to deal w/ load_*()'s bad API."""
+        if self.file and path == self.path:
+            # The contract of get_data() requires us to return bytes. Reopen the
+            # file in binary mode if needed.
+            if not self.file.closed:
+                file = self.file
+                if 'b' not in file.mode:
+                    file.close()
+            if self.file.closed:
+                self.file = file = open(self.path, 'rb')
+
+            with file:
+                return file.read()
+        else:
+            return super().get_data(path)
+
+
+class _LoadSourceCompatibility(_HackedGetData, machinery.SourceFileLoader):
+
+    """Compatibility support for implementing load_source()."""
+
+
+def load_source(name, pathname, file=None):
+    loader = _LoadSourceCompatibility(name, pathname, file)
+    spec = util.spec_from_file_location(name, pathname, loader=loader)
+    if name in sys.modules:
+        module = _exec(spec, sys.modules[name])
+    else:
+        module = _load(spec)
+    # To allow reloading to potentially work, use a non-hacked loader which
+    # won't rely on a now-closed file object.
+    module.__loader__ = machinery.SourceFileLoader(name, pathname)
+    module.__spec__.loader = module.__loader__
+    return module
+
+
+class _LoadCompiledCompatibility(_HackedGetData, SourcelessFileLoader):
+
+    """Compatibility support for implementing load_compiled()."""
+
+
+def load_compiled(name, pathname, file=None):
+    """**DEPRECATED**"""
+    loader = _LoadCompiledCompatibility(name, pathname, file)
+    spec = util.spec_from_file_location(name, pathname, loader=loader)
+    if name in sys.modules:
+        module = _exec(spec, sys.modules[name])
+    else:
+        module = _load(spec)
+    # To allow reloading to potentially work, use a non-hacked loader which
+    # won't rely on a now-closed file object.
+    module.__loader__ = SourcelessFileLoader(name, pathname)
+    module.__spec__.loader = module.__loader__
+    return module
+
+
+def load_package(name, path):
+    """**DEPRECATED**"""
+    if os.path.isdir(path):
+        extensions = (machinery.SOURCE_SUFFIXES[:] +
+                      machinery.BYTECODE_SUFFIXES[:])
+        for extension in extensions:
+            init_path = os.path.join(path, '__init__' + extension)
+            if os.path.exists(init_path):
+                path = init_path
+                break
+        else:
+            raise ValueError('{!r} is not a package'.format(path))
+    spec = util.spec_from_file_location(name, path,
+                                        submodule_search_locations=[])
+    if name in sys.modules:
+        return _exec(spec, sys.modules[name])
+    else:
+        return _load(spec)
+
+
+def load_module(name, file, filename, details):
+    """**DEPRECATED**
+
+    Load a module, given information returned by find_module().
+
+    The module name must include the full package name, if any.
+
+    """
+    suffix, mode, type_ = details
+    if mode and (not mode.startswith(('r', 'U')) or '+' in mode):
+        raise ValueError('invalid file open mode {!r}'.format(mode))
+    elif file is None and type_ in {PY_SOURCE, PY_COMPILED}:
+        msg = 'file object required for import (type code {})'.format(type_)
+        raise ValueError(msg)
+    elif type_ == PY_SOURCE:
+        return load_source(name, filename, file)
+    elif type_ == PY_COMPILED:
+        return load_compiled(name, filename, file)
+    elif type_ == C_EXTENSION and load_dynamic is not None:
+        if file is None:
+            with open(filename, 'rb') as opened_file:
+                return load_dynamic(name, filename, opened_file)
+        else:
+            return load_dynamic(name, filename, file)
+    elif type_ == PKG_DIRECTORY:
+        return load_package(name, filename)
+    elif type_ == C_BUILTIN:
+        return init_builtin(name)
+    elif type_ == PY_FROZEN:
+        return init_frozen(name)
+    else:
+        msg =  "Don't know how to import {} (type code {})".format(name, type_)
+        raise ImportError(msg, name=name)
+
+
+def find_module(name, path=None):
+    """**DEPRECATED**
+
+    Search for a module.
+
+    If path is omitted or None, search for a built-in, frozen or special
+    module and continue search in sys.path. The module name cannot
+    contain '.'; to search for a submodule of a package, pass the
+    submodule name and the package's __path__.
+
+    """
+    if not isinstance(name, str):
+        raise TypeError("'name' must be a str, not {}".format(type(name)))
+    elif not isinstance(path, (type(None), list)):
+        # Backwards-compatibility
+        raise RuntimeError("'path' must be None or a list, "
+                           "not {}".format(type(path)))
+
+    if path is None:
+        if is_builtin(name):
+            return None, None, ('', '', C_BUILTIN)
+        elif is_frozen(name):
+            return None, None, ('', '', PY_FROZEN)
+        else:
+            path = sys.path
+
+    for entry in path:
+        package_directory = os.path.join(entry, name)
+        for suffix in ['.py', machinery.BYTECODE_SUFFIXES[0]]:
+            package_file_name = '__init__' + suffix
+            file_path = os.path.join(package_directory, package_file_name)
+            if os.path.isfile(file_path):
+                return None, package_directory, ('', '', PKG_DIRECTORY)
+        for suffix, mode, type_ in get_suffixes():
+            file_name = name + suffix
+            file_path = os.path.join(entry, file_name)
+            if os.path.isfile(file_path):
+                break
+        else:
+            continue
+        break  # Break out of outer loop when breaking out of inner loop.
+    else:
+        raise ImportError(_ERR_MSG.format(name), name=name)
+
+    encoding = None
+    if 'b' not in mode:
+        with open(file_path, 'rb') as file:
+            encoding = tokenize.detect_encoding(file.readline)[0]
+    file = open(file_path, mode, encoding=encoding)
+    return file, file_path, (suffix, mode, type_)
+
+
+def reload(module):
+    """**DEPRECATED**
+
+    Reload the module and return it.
+
+    The module must have been successfully imported before.
+
+    """
+    return importlib.reload(module)
+
+
+def init_builtin(name):
+    """**DEPRECATED**
+
+    Load and return a built-in module by name, or None is such module doesn't
+    exist
+    """
+    try:
+        return _builtin_from_name(name)
+    except ImportError:
+        return None
+
+
+if create_dynamic:
+    def load_dynamic(name, path, file=None):
+        """**DEPRECATED**
+
+        Load an extension module.
+        """
+        import importlib.machinery
+        loader = importlib.machinery.ExtensionFileLoader(name, path)
+
+        # Issue #24748: Skip the sys.modules check in _load_module_shim;
+        # always load new extension
+        spec = importlib.machinery.ModuleSpec(
+            name=name, loader=loader, origin=path)
+        return _load(spec)
+
+else:
+    load_dynamic = None
--- a/python/extractor/index.py
+++ b/python/extractor/index.py
@@ -0,0 +1,28 @@
+#!/usr/bin/env python
+
+# This file needs to be able to handle all versions of Python we are likely to encounter
+# Which is probably 3.6 and upwards. Handling 3.6 specifically will be by throwing an error, though.
+# We will require at least 3.7 to proceed.
+
+'''Run index.py in buildtools'''
+
+import os
+import sys
+
+if sys.version_info < (3, 7):
+    sys.exit("ERROR: Python 3.7 or later is required (currently running {}.{})".format(sys.version_info[0], sys.version_info[1]))
+
+from python_tracer import getzipfilename
+
+if 'SEMMLE_DIST' in os.environ:
+    if 'CODEQL_EXTRACTOR_PYTHON_ROOT' not in os.environ:
+        os.environ['CODEQL_EXTRACTOR_PYTHON_ROOT'] = os.environ['SEMMLE_DIST']
+else:
+    os.environ["SEMMLE_DIST"] = os.environ["CODEQL_EXTRACTOR_PYTHON_ROOT"]
+
+tools = os.path.join(os.environ['SEMMLE_DIST'], "tools")
+zippath = os.path.join(tools, getzipfilename())
+sys.path = [ zippath ] + sys.path
+
+import buildtools.index
+buildtools.index.main()
--- a/python/extractor/lark/LICENSE
+++ b/python/extractor/lark/LICENSE
@@ -0,0 +1,19 @@
+Copyright © 2017 Erez Shinan
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
--- a/Show More
+++ b/Show More