Commit Graph

95 Commits

Author SHA1 Message Date
Taus
6546bb1b1d Merge branch 'main' into tausbn/python-fix-match-pruning-logic 2025-03-06 14:37:58 +01:00
Paolo Tranquilli
1bcc6ddb32 Rust/Ruby/Python: apply clippy lints 2025-02-25 13:21:28 +01:00
Paolo Tranquilli
6089a75262 Rust/Ruby/Python: format code 2025-02-25 13:19:03 +01:00
Paolo Tranquilli
e8799e346d Rust/Python: fix edition-related errors 2025-02-25 13:16:58 +01:00
Paolo Tranquilli
eff87d24fa Rust/Ruby/Python: update rustc and edition 2025-02-25 13:15:19 +01:00
Paolo Tranquilli
38efd4a8a2 Python: downgrade tree-sitter back to 0.20.4 2025-02-18 10:03:18 +01:00
Paolo Tranquilli
342bff6125 Python: undo tree-sitter update 2025-02-17 15:52:45 +01:00
Paolo Tranquilli
91b3d108bb Python: upgrade cargo dependencies
This required some code changes because of some breaking changes in
`clap` and `tree-sitter`.

Also needed to assign a new bazel repo name to the `crates_vendor` to
avoid name conflicts in `MODULE.bazel`.
2025-02-17 10:56:36 +01:00
Taus
918c05c538 Python: Don't prune any MatchLiteralPatterns
Extends the mechanism introduced in
https://github.com/github/codeql/pull/18030
to behave the same for _all_ `MatchLiteralPattern`s, not just the ones
that happen to be the constant `True` or `False`.

Co-authored-by: yoff <yoff@github.com>
2025-02-11 12:58:52 +00:00
Paolo Tranquilli
cc939e64fd Python: fix bazel rule 2025-02-07 14:42:26 +01:00
yoff
37ddaa36ad Merge pull request #18702 from github/tausbn/python-allow-comments-in-subscripts
Python: Allow comments in subscripts
2025-02-06 23:31:29 +01:00
Taus
131ec8d22f Python: Handle loop constructs outside of loops
Observed on some test files in Nuitka/Nuitka, having `break` and
`continue` outside of loops in Python is (to Python) a syntax error, but
our parser happily accepted this broken syntax.

This then caused issues further downstream in the control-flow
construction, as it broke some invariants.

To fix this we now skip the code that would previously fail when the
invariants are broken.

Co-authored-by: yoff <yoff@github.com>
2025-02-06 14:30:16 +00:00
Taus
7124e80f28 Python: Regenerate parser files 2025-02-06 14:05:40 +00:00
Taus
c5be2a3e2d Python: Allow comments in subscripts
Once again, the interaction between anchors and extras (specifically
comments) was causing trouble.

The root of the problem was the fact that in `a[b]`, we put `b` in the
`index` field of the subscript node, whereas in `a[b,c]`, we
additionally synthesize a `Tuple` node for `b,c` (which matches the
Python AST).

To fix this, we refactored the grammar slightly so as to make that tuple
explicit, such that a subscript node either contains a single expression
or the newly added tuple node. This greatly simplifies the logic.
2025-02-06 14:04:57 +00:00
Taus
60d97e0e16 Python: Print file path when logging context errors
This makes it _much_ easier to find the offending bit of syntax.
2025-02-05 13:13:39 +00:00
Cornelius Riemenschneider
53ca5083a9 Upgrade bazel to 8.0.0.
Previously, we were using 8.0.0rc1.
In particular, this upgrade means we need to explicitly
import more rules, as they've been moved out of the core bazel repo.
2024-12-10 12:05:37 +01:00
Taus
a9817a0281 Python: Add guide describing how to extend the parser 2024-11-28 12:32:00 +00:00
Taus
d779ae5c3e Python: Add change note for CFG pruning fix
... And also bump the extractor version.
2024-11-26 15:39:15 +00:00
Taus
a4ccda5fe3 Python: Fix pruning of literals in match pattern
Co-authored-by: yoff <lerchedahl@gmail.com>
2024-11-19 13:48:13 +00:00
Cornelius Riemenschneider
a66f8209f9 Rust: Vendor 3rdparty dependencies.
We've been observing some performance issues using crate_universe on CI.
Therefore, we're moving to vendor the auto-generated BUILD files
in our repository. This should provide a nice speed boost, while
getting rid of the complexity of the "rust cache" job we've been using
when we had a lot of git dependencies.

This PR includes a vendor script, and I'll put up a CI job internally
that runs that vendor script on Cargo.toml and Cargo.lock changes, to check
that the vendored files are in sync.
2024-11-13 13:22:14 +01:00
Taus
0bb5b4b9dc Merge pull request #17875 from github/tausbn/python-improve-parser-logging-and-timing
Python: Improve parser logging/timing/customisability
2024-11-01 12:47:46 +01:00
Taus
2892f0ff48 Merge pull request #17873 from github/tausbn/python-fix-generator-expression-locations
Python: Even more parser fixes
2024-11-01 12:47:19 +01:00
Taus
2ef3ae9860 Python: Improve parser logging/timing/customisability
Does a bunch of things, unfortunately all in the same place, so my
apologies in advance for a slightly complicated commit.

As for the changes themselves, this commit

- Adds timers for the old and new parsers. This means we get the overall
time spent on these parts of the extractor if the extractor is run with
`DEBUG` output shown.
- Adds logging information (at the `DEBUG` level) to show which
invocations of the parsers happen when, and whether they succeed or not.
- Adds support for using an environment variable named
`CODEQL_PYTHON_DISABLE_OLD_PARSER` to disable using the old parser
entirely. This makes it easier to test the new parser in isolation.
- Fixes a bug where we did not check whether a parse with the new parser
had already succeeded, and so would do a superfluous second parse.
2024-10-30 13:58:46 +00:00
Taus
f75615b913 Merge pull request #17822 from github/tausbn/python-more-parser-fixes
Python: A few more parser fixes
2024-10-30 13:47:10 +01:00
Taus
5d6600e61f Python: Fix generator expression locations
Our logic for detecting the first and last item in a generator
expression was faulty, sometimes matching comments as well. Because
attributes (like `_location_start`) can only be written once, this
caused `tree-sitter-graph` to get unhappy.

To fix this, we now require the first item to be an `expression`, and
the last one to be either a `for_in_clause` or an `if_clause`.
Crucially, `comment` is neither of these, and this prevents the
unfortunate overlap.
2024-10-28 14:53:09 +00:00
Taus
ef60b730ea Python: Fix parenthesized tuple parser bug
We were writing the `parenthesised` attribute twice on tuples, once
because of the explicit parenthetisation, and once because all non-empty
tuples are parenthesised. This made `tree-sitter-graph` unhappy.

To fix this, we now explicitly check whether a tuple is already
parenthesised, and do nothing if that is the case.
2024-10-28 14:49:45 +00:00
Taus
b4ecc7937d Python: Fix some more async parsing problems
Turns out we were not setting the `is_async` field on anything except
`async for` statements. This commit makes it so that we also do this for
`async def` and `async with`, and adds a test that this produces the
same behaviour as the old parser.
2024-10-28 14:44:02 +00:00
Taus
e710c0a6bf Python: Regenerate parser files 2024-10-28 14:44:01 +00:00
Taus
ac87868097 Python: Fix parsing of await inside expressions
Found when parsing `Lib/test/test_coroutines.py` using the new parser.

For whatever reason, having `await` be an `expression` (with an argument
of the same kind) resulted in a bad parse. Consulting the official
grammar, we see that `await` should actually be a `primary_expression`
instead. This is also more in line with the other unary operators, whose
precedence is shared by the `await` syntax.
2024-10-28 14:44:01 +00:00
Taus
1e51703ce9 Python: Allow escaped quotes/backslashes in raw strings
Quoting the Python documentation (last paragraph of
https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences):

"Even in a raw literal, quotes can be escaped with a backslash, but the
backslash remains in the result; for example, r"\"" is a valid string
literal consisting of two characters: a backslash and a double quote;
r"\" is not a valid string literal (even a raw string cannot end in an
odd number of backslashes)."

We did not handle this correctly in the scanner, as we only consumed the
backslash but not the following single or double quote, resulting in
that character getting interpreted as the end of the string.

To fix this, we do a second lookahead after consuming the backslash, and
if the next character is the end character for the string, we advance
the lexer across it as well.

Similarly, backslashes in raw strings can escape other backslashes.
Thus, for a string like '\\' we must consume the second backslash,
otherwise we'll interpret it as escaping the end quote.
2024-10-28 14:40:24 +00:00
Taus
5db601af3c Python: Allow comments in comprehensions
A somewhat complicated solution that necessitated adding a new custom
function to `tsg-python`. See the comments in `python.tsg` for why this
was necessary.
2024-10-23 14:24:47 +00:00
Taus
24ae54886f Merge pull request #17809 from github/tausbn/python-fix-kwargs-in-class-bases
Python: Fix bug in handling of `**kwargs` in class bases
2024-10-23 15:04:54 +02:00
Taus
4f60494019 Python: Support assignments of the form [x,y,z] = w
Surprisingly, the new parser did not support these constructs (and the
relevant test was missing this case), so on files that required the new
parser we were unable to parse this construct.

To fix it, we add `list_pattern` (not to be confused with
`pattern_list`) as a `tree-sitter-python` node that results in a `List`
node in the AST.
2024-10-22 16:06:35 +00:00
Taus
89ea4b8200 Python: Regenerate parser files 2024-10-22 15:39:41 +00:00
Taus
9c913902c5 Python: Allow except* to be written as except *
Turns out, `except*` is actually not a token on its own according to the
Python grammar. This means it's legal to write `except *foo: ...`, which
we previously would consider a syntax error.

To fix it, we simply break up the `except*` into two separate tokens.
2024-10-22 15:39:29 +00:00
Taus
7ceefb509b Python: Regenerate parser files 2024-10-22 15:17:34 +00:00
Taus
8053e0ed44 Python: Allow list_splats as type annotations
That is, the `*T` in `def foo(*args : *T): ...`.

This is apparently a piece of syntax we did not support correctly until
now.

In terms of the grammar, we simply add `list_splat` as a possible
alternative for `type` (which could previously only be an `expression`).
We also update `python.tsg` to not specify `expression` those places (as
the relevant stanzas will then not work for `list_splat`s).

This syntax is not supported by the old parser, hence we only add a new
parser test for it.
2024-10-22 15:17:12 +00:00
Taus
fcec8e0256 Python: Fail tests when errors/warnings are logged
This is primarily useful for ensuring that errors where a node does not
have an appropriate context set in `python.tsg` actually have an effect
on the pass/fail status of the parser tests. Previously, these would
just be logged to stdout, but test could still succeed when there were
errors present.

Also fixes one of the logging lines in `tsg_parser.py` to be more
consistent with the others.
2024-10-22 15:11:51 +00:00
Taus
9803bbdc4b Python: Update class parser test 2024-10-21 15:35:48 +00:00
Taus
1cd04c96c7 Python: Fix bug in handling of **kwargs in class bases
This caused a dataset check error on the `python/cpython` database, as
we had a `DictUnpacking` node whose parent was not a `dict_item_list`,
but rather an `expr_list`.

Investigating a bit further revealed that this was because in a
construction like

```python
class C[T](base, foo=bar, **kwargs): ...
```
we were mistakenly adding `**kwargs` to the same list as `base` (which
is just a list of expressions), rather than the same list as `foo=bar`
(which is a list of dictionary items)

The ultimate cause of this was the use of `! name` in `python.tsg` to
distinguish between bases and keyword arguments (only the latter of
which have the `name` field). Because `dictionary_splat` doesn't have a
`name` field either, these were mistakenly put in the wrong list,
leading to the error.

Also, because our previous test of `class` statements did not include a
`**kwargs` construction, we were not checking that the new parser
behaved correctly in this case. For the most part this was not a
problem, but on files that use syntax not supported by the old parser
(like type parameters on classes), this became an issue. This is also
why we did not see this error previously.

To fix this, we added `! value` (which is a field present on
`dictionary_splat` nodes) as a secondary filter, and added a third
stanza to handle `dictionary_splat` nodes.
2024-10-21 15:35:47 +00:00
Taus
ae4a4bb881 Python: Flip test expectation
This test should now validate that we no longer have dataset check
errors even when there are unencodable characters.
2024-10-21 15:32:23 +00:00
Taus
cc39ae57dc Python: Fix dataset check error for string encoding
Here's an example of one of these errors:
```
INVALID_KEY predicate py_cobjectnames(@py_cobject obj, string name)

The key set {obj} does not functionally determine all fields. Here is a
pair of tuples that agree on the key set but differ at index 1: Tuple 1
in row 63874: (72088,"u'<X>'") Tuple 2 in row 63875: (72088,"u'<?>'")
```
(Here, the substring `X` should really be the Unicode character U+FFFD,
but for some reason I'm not allowed to put that in this commit message.)

Inside the extractor, we assign IDs based on the string type (bytestring
or Unicode) and a hash of the UTF-8 encoded content of the string. In
this case, however, certain _different_ strings were receiving the same
hash, due to replacement characters in the encoding process.

In particular, we were converting unencodable characters to question
marks in one place, and to U+FFFD in another place. This caused a
discrepancy that lead to the dataset check error.

To fix this, we put in a custom error handler that always puts the
U+FFFD character in place of unencodable characters. With this, the
strings now agree, and hence there is no clash.
2024-10-21 15:31:16 +00:00
Taus
d01593e571 Python: Add test for string encoding dataset check
Note that this test checks that the current setup creates dataset check
violations. A later commit will fix this (and flip the negation in the
test).
2024-10-21 12:08:46 +00:00
Taus
417e60a466 Python: Update extractor version 2024-10-15 11:22:54 +00:00
Taus
819b3d77ab Python: Update test expectations
Note that this still includes the somewhat puzzling parsing of
`Spam[**P2]` as an exponentiation with an empty left hand side. When we
fix that bug, we should also update this test to contain actually valid
syntax.
2024-10-15 11:22:33 +00:00
Taus
36d89745f9 Python: Fix dbscheme/AST autogeneration
There was an errant `ql` in the relevant paths, a leftover from the move
from the internal repo. Also, we can no longer rely on an intree version
of the CodeQL CLI, so from now on we'll just assume it's present in the
path. (On Codespaces, `gh codeql` is a decent replacement, especially if
using the `install-stub` functionality.
2024-10-15 11:22:32 +00:00
Taus
2af0d78435 Python: Add default field to the relevant AST nodes 2024-10-15 11:22:32 +00:00
Taus
55ee3eb36b Python: Add TSG support for type defaults 2024-10-15 11:22:31 +00:00
Taus
6545bfffa7 Python: Regenerate parser files
Two new files -- alloc.h and array.h -- suddenly appeared. Presumably
they are used by the somewhat newer version of tree-sitter. To be safe,
I included them in this commit.
2024-10-15 11:22:31 +00:00
Taus
882249ef82 Python: Add grammar support for type defaults
Also fixes an oversight in the grammar: starred expressions should be
allowed inside the subscript of an `Index` expression.
2024-10-15 11:22:30 +00:00