Commit Graph

149 Commits

Author SHA1 Message Date
Paolo Tranquilli
ee13ea0f6b Harden _relative_path for Windows and mixed-form inputs
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 11:35:02 +02:00
Paolo Tranquilli
d28792537b Python extractor: use relative paths in diagnostic locations
Diagnostic `Location.file` fields contained absolute filesystem paths,
causing the GitHub UI to generate broken file links with runner paths
like `/home/runner/work/...`. Now paths are relativized against the
source root (`LGTM_SRC` or cwd), falling back to absolute if the file
is outside the source root.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 10:32:05 +02:00
Taus
6c675fcede Python: Consolidate duplicated code 2026-04-16 21:14:42 +00:00
Taus
fc5b3562c3 Python: Add parser test for comprehensions with unpacking 2026-04-14 13:27:31 +02:00
Taus
90b64616f7 Python: Also fix (value, key) bug in old parser 2026-04-14 13:27:31 +02:00
Taus
91d4cf6624 Python: Update python.tsg
First, we extend the various location overriding hacks to also accept
list and dict splats in various places. Having done this, we then have
to tackle how to actually desugar these new comprehension forms (as this
is what we currently do for the old forms).

As a reminder, a list comprehension like `[x for x in y]` currently gets
desugared into a small local function, something like

```python
def listcomp(a):
    for x in a:
        yield x
listcomp(y)
```

For `[*x for x in y]`, the behaviour we want is that we unpack `x`
before yielding its elements in turn. This is essentially what we would
get if we were to use `yield from x` instead of `yield x` in the above
desugaring, so that's what we do. This also works for set
comprehensions.

For dict comprehensions, it's slightly more complicated. Here, the
generator function instead yields a stream of `(key, value)` tuples.
(And apparently the old parser got this wrong and emitted `(value, key)`
pairs instead, which we faithfully recreated in the new parser as well.
We fix that bug in both parsers while we're at it). So, a bare `yield
from` is not enough, we also need a `.items()` call to get the
double-starred expression to emit its items as a stream of tuples (that
we then `yield from`.

To make this (hopefully) less verbose in the implementation, we defer
the decision of whether to use `yield` or `yield from` by introducing a
`yield_kind` scoped variable that determines the type of the actual AST
node. And of course for dict comprehensions with unpacking we need to
synthesise the extra machinery mentioned above.

On the plus side, this means we don't have to mess with control-flow, as
the existing machinery should be able to handle the desugared syntax
just fine.
2026-04-14 13:27:31 +02:00
Taus
97086c3cc9 Python: Regenerate parser files 2026-04-14 13:27:31 +02:00
Taus
4b5ff0b89e Python: Support unpacking in comprehensions in tree-sitter-python
This is the easy part -- we just allow `dictionary_splat` or
`list_splat` to appear in the same place as the expression.
2026-04-14 13:27:31 +02:00
Taus
1ddfed6b6b Python: Add QL support for lazy imports
Adds a new `isLazy` predicate to the relevant classes, and adds the
relevant dbscheme (and up/downgrade) changes. On upgrades we do nothing,
and on downgrades we remove the `is_lazy` bits.
2026-04-10 14:25:08 +00:00
Taus
fe94828fe4 Python: Add overlay annotations to AST template
Otherwise these will disappear every time we regenerate the AST.
2026-04-10 14:23:29 +00:00
Taus
2c79f9d828 Python: Regenerate parser files 2026-04-10 13:50:59 +00:00
Taus
ad4018f399 Python: Add parser support for lazy imports
As defined in PEP-810. We implement this in much the same way as how we
handle `async` annotations currently. The relevant nodes get an
`is_lazy` field that defaults to being false.
2026-04-10 13:50:43 +00:00
Taus
8c27437628 Python: Bump extractor version and add change note 2026-02-05 13:50:54 +00:00
Taus
12ee93042b Python: Add tests 2026-02-05 13:47:24 +00:00
Taus
bac356c9a1 Python: Regenerate parser files 2026-02-05 13:46:59 +00:00
Taus
68c1a3d389 Python: Fix syntax error when = is used as a format fill character
An example (provided by @redsun82) is the string `f"{x:=^20}"`. Parsing
this (with unnamed nodes shown) illustrates the problem:

```
module [0, 0] - [2, 0]
  expression_statement [0, 0] - [0, 11]
    string [0, 0] - [0, 11]
      string_start [0, 0] - [0, 2]
      interpolation [0, 2] - [0, 10]
        "{" [0, 2] - [0, 3]
        expression: named_expression [0, 3] - [0, 9]
          name: identifier [0, 3] - [0, 4]
          ":=" [0, 4] - [0, 6]
          ERROR [0, 6] - [0, 7]
            "^" [0, 6] - [0, 7]
          value: integer [0, 7] - [0, 9]
        "}" [0, 9] - [0, 10]
      string_end [0, 10] - [0, 11]
```
Observe that we've managed to combine the format specifier token `:` and
the fill character `=` in a single token (which doesn't match the `:` we
expect in the grammar rule), and hence we get a syntax error.

If we change the `=` to some other character (e.g. a `-`), we instead
get

```
module [0, 0] - [2, 0]
  expression_statement [0, 0] - [0, 11]
    string [0, 0] - [0, 11]
      string_start [0, 0] - [0, 2]
      interpolation [0, 2] - [0, 10]
        "{" [0, 2] - [0, 3]
        expression: identifier [0, 3] - [0, 4]
        format_specifier: format_specifier [0, 4] - [0, 9]
          ":" [0, 4] - [0, 5]
        "}" [0, 9] - [0, 10]
      string_end [0, 10] - [0, 11]
```
and in particular no syntax error.

To fix this, we want to ensure that the `:` is lexed on its own, and the
`token(prec(1, ...))` construction can be used to do exactly this.

Finally, you may wonder why `=` is special here. I think what's going on
is that the lexer knows that `:=` is a token on its own (because it's
used in the walrus operator), and so it greedily consumes the following
`=` with this in mind.
2026-02-05 13:45:54 +00:00
Ian Lynagh
d1175276ca python: Use more standard shared dbscheme sections
We now use the shared "Overlay support" and "Database metadata".
2026-01-20 11:56:13 +00:00
Ian Lynagh
d125e224ac python: Add dbscheme regeneration instructions 2026-01-20 11:56:13 +00:00
Taus
659ec3999b Mark generated files as generated 2026-01-12 15:24:01 +00:00
Taus
2c83b296a4 Python: Add parser test
Note in particular that the `exceptions.py` test is unaffected.
2026-01-06 13:40:38 +00:00
Taus
4db60df9dd Python: Regenerate parser files 2026-01-06 13:40:38 +00:00
Taus
2380bfd459 Python: Add support for PEP-758 exception syntax
See https://peps.python.org/pep-0758/ for more details.

We implement this by extending the syntax for exceptions and exception
groups so that the `type` field can now contain either an expression
(which matches the old behaviour), or a comma-separated list of at least
two elements (representing the new behaviour).

We model the latter case using a new node type `exception_list`, which
in `tsg-python` is simply mapped to a tuple. This means it matches the
existing behaviour (when the tuple is surrounded by parentheses)
exactly, hence we don't need to change any other code.

As a consequence of this, however, we cannot directly parse the Python
2.7 syntax `except Foo, e: ...` as `except Foo as e: ...`, as this would
introduce an ambiguity in the grammar. Thus, we have removed support for
the (deprecated) 2.7-style syntax, and only allow `as` to indicate
binding of the exception. The syntax `except Foo, e: ...` continues to
be parsed (in particular, it's not suddenly a syntax error), but it will
be parsed as if it were `except (Foo, e): ...`, which may not give the
correct results.

In principle we could extend the QL libraries to account for this case
(specifically when analysing Python 2 code). In practice, however, I
expect this to have a minor impact on results, and not worth the
additional investment at this time.
2026-01-06 13:40:37 +00:00
Taus
47c967a06c Python: Bump extractor version 2025-12-16 23:57:58 +01:00
Taus
28e733e335 Python: Support template strings in rest of extractor
Adds three new AST nodes to the mix:

- `TemplateString` represents a t-string in Python 3.14
- `TemplateStringPart` represents one of the string constituents of a
t-string. (The interpolated expressions are represented as `Expr` nodes,
just like f-strings.)
- `JoinedTemplateString` represents an implicit concatenation of
template strings.

Importantly, we _completely avoid_ the complicated construction we
currently do for format strings (as well as the confusing nomenclature).
No extra injection of empty strings (so that a template string is a
strict alternation of strings and expressions). A `JoinedTemplateString`
simply has a list of template string children, and a `TemplateString`
has a list of "values" which may be either `Expr` or
`TemplateStringPart` nodes.

If we ever find that we actually want the more complicated interface for
these strings, then I would much rather we reconstruct this inside of QL
rather than in the parser.
2025-12-16 23:57:58 +01:00
Taus
cd7ae34380 Python: Regenerate parser files 2025-12-16 23:57:58 +01:00
Taus
7768ebe8b8 Python: Add parser support for template strings
- Extends the scanner with a new token kind representing the start of a
template string. This is used to distinguish template strings from
regular strings (because only a template string will start with a
`_template_string_start` external token).

- Cleans up the logic surrounding interpolations (and the method names)
so that format strings and template strings behave the same in this
case.

Finally, we add two new node types in the tree-sitter grammar:

- `template_string` behaves like format strings, but is a distinct type
(mainly so that an implicit concatenation between template strings and
regular strings becomes a syntax error).
- `concatenated_template_string` is the counterpart of
`concatenated_string`.

However, internally, the string parts of a template strings are just the
same `string_content` nodes that are used in regular format strings. We
will disambiguate these inside `tsg-python`.
2025-12-16 23:57:58 +01:00
Taus
f55ff96674 Python: Bump extractor version and add change note 2025-11-27 13:52:37 +00:00
Alexander Köplinger
458f8570e8 Fix KeyError: 'name' in python/extractor/imp.py on Python 3.14
Follow-up to https://github.com/github/codeql/pull/20630

The fix didn't fully work since when we raise the ImportError in `find_module` we don't pass a named argument into the format string which causes a `KeyError`.

We need to use a format string without named arguments, like Python 3.13 and earlier did.
2025-11-25 12:38:55 +01:00
Nora Dimitrijević
e120e5c3ba Merge pull request #20337 from d10c/d10c/python-overlay-compilation-plus-extractor
Python: enable overlay compilation + extractor overlay support
2025-10-16 14:49:01 +02:00
Taus
c4b27d5f28 Python: Fix ImportError in imp.py under Python 3.14
It seems `_ERR_MSG` was silently removed in Python 3.14, leading to an
`ImportError` when running the extractor.

To fix this, we explicitly set `_ERR_MSG` when the existing import fails
(using `_ERR_MSG_PREFIX` which is available in Python 3.14+, along with
the bits that make up the difference between this and `_ERR_MSG`).
2025-10-13 13:50:43 +00:00
Nora Dimitrijević
c749607db8 Bump python extractor version to 7.1.5 2025-10-07 11:22:16 +02:00
Nora Dimitrijević
1a9683f986 Add @top database type 2025-10-06 11:47:14 +02:00
Nora Dimitrijević
6f208e9dec Write overlay metadata at end of extraction. 2025-10-06 11:47:12 +02:00
Nora Dimitrijević
49b18db044 Python extractor: in overlay mode, traverse only changed files
- fall back to full extraction on overlay changes json read error
- we filter both root modules and (transitive) imports against the overlay-changes json.
2025-10-06 11:47:09 +02:00
Nora Dimitrijević
e0cf719cb9 Path transformer: handle Windows-style paths
And don't add slash to start of path patterns on Windows.
2025-10-06 11:37:04 +02:00
Nora Dimitrijević
29b1a7403b Support CODEQL_PATH_TRANSFORMER env var in python path renamer
The new name is required by overlay support.
2025-10-06 11:37:02 +02:00
Nora Dimitrijević
a88d3397cd Add overlay builtins to python dbscheme 2025-10-06 11:36:56 +02:00
Taus
f5a06bef4a Merge pull request #19929 from github/tausbn/python-update-tree-sitter-dependency
Python: Update `tree-sitter` dependency
2025-09-17 13:40:13 +02:00
Arthur Baars
5d3ec35e29 Remove non-breaking spaces from code 2025-09-05 09:41:15 +02:00
Taus
f6732a927b Python: Bump extractor version 2025-09-03 11:56:54 +00:00
Taus
13a93c7e32 Python: Add suggestions from Copilot 2025-09-03 11:55:49 +00:00
Taus
9802ad77dc Python: Update types_new.py and test output 2025-09-02 12:41:57 +00:00
Taus
235822d782 Python: Improve handling of syntax errors
Rather than relying on matching arbitrary nodes inside tree-sitter-graph
and then checking whether they are of type ERROR or MISSING (which seems
to have stopped working in later versions of tree-sitter), we now
explicitly go through the tree-sitter tree, locating all of the error
and missing nodes along the way. We then add these on to the graph
output in the same format as was previously produced by
tree-sitter-graph.

Note that it's very likely that some of the syntax errors will move
around a bit as a consequence of this change. In general, we don't
expect syntax errors to have stable locations, as small changes in the
grammar can cause an error to appear in a different position, even if
the underlying (erroneous) code has not changed.
2025-09-02 12:41:57 +00:00
Taus
b108d47b26 Python: Update parser test output
It seems that with a newer version of tree-sitter, we no longer parse
the (not actually valid!) syntax `Spam[**P2]` as if the `**` is an
exponentiation operation (with a missing left operand).
2025-09-02 12:41:55 +00:00
Taus
76f15a890c Python: Update tree-sitter dependency
Updates the Python extractor to depend on version 0.24.7 of tree-sitter
(and 0.12.0 of tree-sitter-graph).

A few changes were needed in order to make the code build and run after
updating the dependencies:

- In `main.rs`, the `Language` parameter is now passed as a reference.
- In `python.tsg`, many queries had captures that were not actually used
in the body of the stanza. This is no longer allowed (unless the
captures start with an underscore), as it may indicate an error. To fix
this, I added underscores in the appropriate places (and verified that
none of these unused captures were in fact bugs).
2025-09-02 12:40:20 +00:00
Taus
ad53518644 Python: Regenerate parser files 2025-06-26 15:34:44 +00:00
Taus
e04821e9e3 Python: Allow use of match as an identifier
This previously only worked in certain circumstances. In particular,
assignments such as `match[1] = ...` or even just `match[1]` would fail
to parse correctly.

Fixing this turned out to be less trivial than anticipated. Consider the
fact that
```
match [1]: case (...)
```
can either look the start of a `match` statement, or it could be a type
ascription, ascribing the value of `case(...)` (a call) to the item at
index 1 of `match`.

To fix this, then, we give `match` the identifier and `match` the
statement the same precendence in the grammar, and additionally also
mark a conflict between `match_statement` and `primary_expression`. This
causes the conflict to be resolved dynamically, and seems to do the
right thing in all cases.
2025-06-26 15:33:00 +00:00
Taus
2158eaa34c Python: Fix a bug in glob regex creation
The previous version was tested on a version of the code where we had
temporarily removed the `glob.strip("/")` bit, and so the bug didn't
trigger then.

We now correctly remember if the glob ends in `/`, and add an extra part
in that case. This way, if the path ends with multiple slashes, they
effectively get consolidated into a single one, which results in the
correct semantics.
2025-05-15 15:34:11 +00:00
Taus
c8cca126a1 Python: Bump extractor version 2025-05-15 14:59:33 +00:00
Taus
96558b53b8 Python: Update test
The second test case now sets the `paths-ignore` setting in the config
file in order to skip files in hidden directories.
2025-05-15 14:53:15 +00:00