We've been observing some performance issues using crate_universe on CI.
Therefore, we're moving to vendor the auto-generated BUILD files
in our repository. This should provide a nice speed boost, while
getting rid of the complexity of the "rust cache" job we've been using
when we had a lot of git dependencies.
This PR includes a vendor script, and I'll put up a CI job internally
that runs that vendor script on Cargo.toml and Cargo.lock changes, to check
that the vendored files are in sync.
Does a bunch of things, unfortunately all in the same place, so my
apologies in advance for a slightly complicated commit.
As for the changes themselves, this commit
- Adds timers for the old and new parsers. This means we get the overall
time spent on these parts of the extractor if the extractor is run with
`DEBUG` output shown.
- Adds logging information (at the `DEBUG` level) to show which
invocations of the parsers happen when, and whether they succeed or not.
- Adds support for using an environment variable named
`CODEQL_PYTHON_DISABLE_OLD_PARSER` to disable using the old parser
entirely. This makes it easier to test the new parser in isolation.
- Fixes a bug where we did not check whether a parse with the new parser
had already succeeded, and so would do a superfluous second parse.
Our logic for detecting the first and last item in a generator
expression was faulty, sometimes matching comments as well. Because
attributes (like `_location_start`) can only be written once, this
caused `tree-sitter-graph` to get unhappy.
To fix this, we now require the first item to be an `expression`, and
the last one to be either a `for_in_clause` or an `if_clause`.
Crucially, `comment` is neither of these, and this prevents the
unfortunate overlap.
We were writing the `parenthesised` attribute twice on tuples, once
because of the explicit parenthetisation, and once because all non-empty
tuples are parenthesised. This made `tree-sitter-graph` unhappy.
To fix this, we now explicitly check whether a tuple is already
parenthesised, and do nothing if that is the case.
Turns out we were not setting the `is_async` field on anything except
`async for` statements. This commit makes it so that we also do this for
`async def` and `async with`, and adds a test that this produces the
same behaviour as the old parser.
Found when parsing `Lib/test/test_coroutines.py` using the new parser.
For whatever reason, having `await` be an `expression` (with an argument
of the same kind) resulted in a bad parse. Consulting the official
grammar, we see that `await` should actually be a `primary_expression`
instead. This is also more in line with the other unary operators, whose
precedence is shared by the `await` syntax.
Quoting the Python documentation (last paragraph of
https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences):
"Even in a raw literal, quotes can be escaped with a backslash, but the
backslash remains in the result; for example, r"\"" is a valid string
literal consisting of two characters: a backslash and a double quote;
r"\" is not a valid string literal (even a raw string cannot end in an
odd number of backslashes)."
We did not handle this correctly in the scanner, as we only consumed the
backslash but not the following single or double quote, resulting in
that character getting interpreted as the end of the string.
To fix this, we do a second lookahead after consuming the backslash, and
if the next character is the end character for the string, we advance
the lexer across it as well.
Similarly, backslashes in raw strings can escape other backslashes.
Thus, for a string like '\\' we must consume the second backslash,
otherwise we'll interpret it as escaping the end quote.