This previously only worked in certain circumstances. In particular,
assignments such as `match[1] = ...` or even just `match[1]` would fail
to parse correctly.
Fixing this turned out to be less trivial than anticipated. Consider the
fact that
```
match [1]: case (...)
```
can either look the start of a `match` statement, or it could be a type
ascription, ascribing the value of `case(...)` (a call) to the item at
index 1 of `match`.
To fix this, then, we give `match` the identifier and `match` the
statement the same precendence in the grammar, and additionally also
mark a conflict between `match_statement` and `primary_expression`. This
causes the conflict to be resolved dynamically, and seems to do the
right thing in all cases.
The previous version was tested on a version of the code where we had
temporarily removed the `glob.strip("/")` bit, and so the bug didn't
trigger then.
We now correctly remember if the glob ends in `/`, and add an extra part
in that case. This way, if the path ends with multiple slashes, they
effectively get consolidated into a single one, which results in the
correct semantics.
If you have a filter like `**/foo/**` set in the `paths-ignore` bit of
your config file, then currently the following happens:
- First, the CodeQL CLI observes that this string ends in `/**` and
strips off the `**` leaving `**/foo/`
- Then the Python extractor strips off leading and trailing `/`
characters and proceeds to convert `**/foo` into a regex that is
matched against files to (potentially) extract.
The trouble with this is that it leaves us unable to distinguish
between, say, a file `foo.py` and a file `foo/bar.py`. In other words,
we have lost the ability to exclude only the _folder_ `foo` and not any
files that happen to start with `foo`.
To fix this, we instead make a note of whether the glob ends in a
forward slash or not, and adjust the regex correspondingly.
Changes the default behaviour of the Python extractor so files inside
hidden directories are extracted by default.
Also adds an extractor option, `skip_hidden_directories`, which can be
set to `true` in order to revert to the old behaviour.
Finally, I made the logic surrounding what is logged in various cases a
bit more obvious.
Technically this changes the behaviour of the extractor (in that hidden
excluded files will now be logged as `(excluded)`, but I think this
makes more sense anyway.
This required some code changes because of some breaking changes in
`clap` and `tree-sitter`.
Also needed to assign a new bazel repo name to the `crates_vendor` to
avoid name conflicts in `MODULE.bazel`.
Extends the mechanism introduced in
https://github.com/github/codeql/pull/18030
to behave the same for _all_ `MatchLiteralPattern`s, not just the ones
that happen to be the constant `True` or `False`.
Co-authored-by: yoff <yoff@github.com>
Observed on some test files in Nuitka/Nuitka, having `break` and
`continue` outside of loops in Python is (to Python) a syntax error, but
our parser happily accepted this broken syntax.
This then caused issues further downstream in the control-flow
construction, as it broke some invariants.
To fix this we now skip the code that would previously fail when the
invariants are broken.
Co-authored-by: yoff <yoff@github.com>
Once again, the interaction between anchors and extras (specifically
comments) was causing trouble.
The root of the problem was the fact that in `a[b]`, we put `b` in the
`index` field of the subscript node, whereas in `a[b,c]`, we
additionally synthesize a `Tuple` node for `b,c` (which matches the
Python AST).
To fix this, we refactored the grammar slightly so as to make that tuple
explicit, such that a subscript node either contains a single expression
or the newly added tuple node. This greatly simplifies the logic.
Previously, we were using 8.0.0rc1.
In particular, this upgrade means we need to explicitly
import more rules, as they've been moved out of the core bazel repo.
We've been observing some performance issues using crate_universe on CI.
Therefore, we're moving to vendor the auto-generated BUILD files
in our repository. This should provide a nice speed boost, while
getting rid of the complexity of the "rust cache" job we've been using
when we had a lot of git dependencies.
This PR includes a vendor script, and I'll put up a CI job internally
that runs that vendor script on Cargo.toml and Cargo.lock changes, to check
that the vendored files are in sync.
Does a bunch of things, unfortunately all in the same place, so my
apologies in advance for a slightly complicated commit.
As for the changes themselves, this commit
- Adds timers for the old and new parsers. This means we get the overall
time spent on these parts of the extractor if the extractor is run with
`DEBUG` output shown.
- Adds logging information (at the `DEBUG` level) to show which
invocations of the parsers happen when, and whether they succeed or not.
- Adds support for using an environment variable named
`CODEQL_PYTHON_DISABLE_OLD_PARSER` to disable using the old parser
entirely. This makes it easier to test the new parser in isolation.
- Fixes a bug where we did not check whether a parse with the new parser
had already succeeded, and so would do a superfluous second parse.
Our logic for detecting the first and last item in a generator
expression was faulty, sometimes matching comments as well. Because
attributes (like `_location_start`) can only be written once, this
caused `tree-sitter-graph` to get unhappy.
To fix this, we now require the first item to be an `expression`, and
the last one to be either a `for_in_clause` or an `if_clause`.
Crucially, `comment` is neither of these, and this prevents the
unfortunate overlap.
We were writing the `parenthesised` attribute twice on tuples, once
because of the explicit parenthetisation, and once because all non-empty
tuples are parenthesised. This made `tree-sitter-graph` unhappy.
To fix this, we now explicitly check whether a tuple is already
parenthesised, and do nothing if that is the case.
Turns out we were not setting the `is_async` field on anything except
`async for` statements. This commit makes it so that we also do this for
`async def` and `async with`, and adds a test that this produces the
same behaviour as the old parser.
Found when parsing `Lib/test/test_coroutines.py` using the new parser.
For whatever reason, having `await` be an `expression` (with an argument
of the same kind) resulted in a bad parse. Consulting the official
grammar, we see that `await` should actually be a `primary_expression`
instead. This is also more in line with the other unary operators, whose
precedence is shared by the `await` syntax.
Quoting the Python documentation (last paragraph of
https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences):
"Even in a raw literal, quotes can be escaped with a backslash, but the
backslash remains in the result; for example, r"\"" is a valid string
literal consisting of two characters: a backslash and a double quote;
r"\" is not a valid string literal (even a raw string cannot end in an
odd number of backslashes)."
We did not handle this correctly in the scanner, as we only consumed the
backslash but not the following single or double quote, resulting in
that character getting interpreted as the end of the string.
To fix this, we do a second lookahead after consuming the backslash, and
if the next character is the end character for the string, we advance
the lexer across it as well.
Similarly, backslashes in raw strings can escape other backslashes.
Thus, for a string like '\\' we must consume the second backslash,
otherwise we'll interpret it as escaping the end quote.
A somewhat complicated solution that necessitated adding a new custom
function to `tsg-python`. See the comments in `python.tsg` for why this
was necessary.
Surprisingly, the new parser did not support these constructs (and the
relevant test was missing this case), so on files that required the new
parser we were unable to parse this construct.
To fix it, we add `list_pattern` (not to be confused with
`pattern_list`) as a `tree-sitter-python` node that results in a `List`
node in the AST.
Turns out, `except*` is actually not a token on its own according to the
Python grammar. This means it's legal to write `except *foo: ...`, which
we previously would consider a syntax error.
To fix it, we simply break up the `except*` into two separate tokens.
That is, the `*T` in `def foo(*args : *T): ...`.
This is apparently a piece of syntax we did not support correctly until
now.
In terms of the grammar, we simply add `list_splat` as a possible
alternative for `type` (which could previously only be an `expression`).
We also update `python.tsg` to not specify `expression` those places (as
the relevant stanzas will then not work for `list_splat`s).
This syntax is not supported by the old parser, hence we only add a new
parser test for it.
This is primarily useful for ensuring that errors where a node does not
have an appropriate context set in `python.tsg` actually have an effect
on the pass/fail status of the parser tests. Previously, these would
just be logged to stdout, but test could still succeed when there were
errors present.
Also fixes one of the logging lines in `tsg_parser.py` to be more
consistent with the others.
This caused a dataset check error on the `python/cpython` database, as
we had a `DictUnpacking` node whose parent was not a `dict_item_list`,
but rather an `expr_list`.
Investigating a bit further revealed that this was because in a
construction like
```python
class C[T](base, foo=bar, **kwargs): ...
```
we were mistakenly adding `**kwargs` to the same list as `base` (which
is just a list of expressions), rather than the same list as `foo=bar`
(which is a list of dictionary items)
The ultimate cause of this was the use of `! name` in `python.tsg` to
distinguish between bases and keyword arguments (only the latter of
which have the `name` field). Because `dictionary_splat` doesn't have a
`name` field either, these were mistakenly put in the wrong list,
leading to the error.
Also, because our previous test of `class` statements did not include a
`**kwargs` construction, we were not checking that the new parser
behaved correctly in this case. For the most part this was not a
problem, but on files that use syntax not supported by the old parser
(like type parameters on classes), this became an issue. This is also
why we did not see this error previously.
To fix this, we added `! value` (which is a field present on
`dictionary_splat` nodes) as a secondary filter, and added a third
stanza to handle `dictionary_splat` nodes.