Adds three new AST nodes to the mix:
- `TemplateString` represents a t-string in Python 3.14
- `TemplateStringPart` represents one of the string constituents of a
t-string. (The interpolated expressions are represented as `Expr` nodes,
just like f-strings.)
- `JoinedTemplateString` represents an implicit concatenation of
template strings.
Importantly, we _completely avoid_ the complicated construction we
currently do for format strings (as well as the confusing nomenclature).
No extra injection of empty strings (so that a template string is a
strict alternation of strings and expressions). A `JoinedTemplateString`
simply has a list of template string children, and a `TemplateString`
has a list of "values" which may be either `Expr` or
`TemplateStringPart` nodes.
If we ever find that we actually want the more complicated interface for
these strings, then I would much rather we reconstruct this inside of QL
rather than in the parser.
- Extends the scanner with a new token kind representing the start of a
template string. This is used to distinguish template strings from
regular strings (because only a template string will start with a
`_template_string_start` external token).
- Cleans up the logic surrounding interpolations (and the method names)
so that format strings and template strings behave the same in this
case.
Finally, we add two new node types in the tree-sitter grammar:
- `template_string` behaves like format strings, but is a distinct type
(mainly so that an implicit concatenation between template strings and
regular strings becomes a syntax error).
- `concatenated_template_string` is the counterpart of
`concatenated_string`.
However, internally, the string parts of a template strings are just the
same `string_content` nodes that are used in regular format strings. We
will disambiguate these inside `tsg-python`.
Follow-up to https://github.com/github/codeql/pull/20630
The fix didn't fully work since when we raise the ImportError in `find_module` we don't pass a named argument into the format string which causes a `KeyError`.
We need to use a format string without named arguments, like Python 3.13 and earlier did.
It seems `_ERR_MSG` was silently removed in Python 3.14, leading to an
`ImportError` when running the extractor.
To fix this, we explicitly set `_ERR_MSG` when the existing import fails
(using `_ERR_MSG_PREFIX` which is available in Python 3.14+, along with
the bits that make up the difference between this and `_ERR_MSG`).
- fall back to full extraction on overlay changes json read error
- we filter both root modules and (transitive) imports against the overlay-changes json.
Rather than relying on matching arbitrary nodes inside tree-sitter-graph
and then checking whether they are of type ERROR or MISSING (which seems
to have stopped working in later versions of tree-sitter), we now
explicitly go through the tree-sitter tree, locating all of the error
and missing nodes along the way. We then add these on to the graph
output in the same format as was previously produced by
tree-sitter-graph.
Note that it's very likely that some of the syntax errors will move
around a bit as a consequence of this change. In general, we don't
expect syntax errors to have stable locations, as small changes in the
grammar can cause an error to appear in a different position, even if
the underlying (erroneous) code has not changed.
It seems that with a newer version of tree-sitter, we no longer parse
the (not actually valid!) syntax `Spam[**P2]` as if the `**` is an
exponentiation operation (with a missing left operand).
Updates the Python extractor to depend on version 0.24.7 of tree-sitter
(and 0.12.0 of tree-sitter-graph).
A few changes were needed in order to make the code build and run after
updating the dependencies:
- In `main.rs`, the `Language` parameter is now passed as a reference.
- In `python.tsg`, many queries had captures that were not actually used
in the body of the stanza. This is no longer allowed (unless the
captures start with an underscore), as it may indicate an error. To fix
this, I added underscores in the appropriate places (and verified that
none of these unused captures were in fact bugs).
This previously only worked in certain circumstances. In particular,
assignments such as `match[1] = ...` or even just `match[1]` would fail
to parse correctly.
Fixing this turned out to be less trivial than anticipated. Consider the
fact that
```
match [1]: case (...)
```
can either look the start of a `match` statement, or it could be a type
ascription, ascribing the value of `case(...)` (a call) to the item at
index 1 of `match`.
To fix this, then, we give `match` the identifier and `match` the
statement the same precendence in the grammar, and additionally also
mark a conflict between `match_statement` and `primary_expression`. This
causes the conflict to be resolved dynamically, and seems to do the
right thing in all cases.
The previous version was tested on a version of the code where we had
temporarily removed the `glob.strip("/")` bit, and so the bug didn't
trigger then.
We now correctly remember if the glob ends in `/`, and add an extra part
in that case. This way, if the path ends with multiple slashes, they
effectively get consolidated into a single one, which results in the
correct semantics.
If you have a filter like `**/foo/**` set in the `paths-ignore` bit of
your config file, then currently the following happens:
- First, the CodeQL CLI observes that this string ends in `/**` and
strips off the `**` leaving `**/foo/`
- Then the Python extractor strips off leading and trailing `/`
characters and proceeds to convert `**/foo` into a regex that is
matched against files to (potentially) extract.
The trouble with this is that it leaves us unable to distinguish
between, say, a file `foo.py` and a file `foo/bar.py`. In other words,
we have lost the ability to exclude only the _folder_ `foo` and not any
files that happen to start with `foo`.
To fix this, we instead make a note of whether the glob ends in a
forward slash or not, and adjust the regex correspondingly.
Changes the default behaviour of the Python extractor so files inside
hidden directories are extracted by default.
Also adds an extractor option, `skip_hidden_directories`, which can be
set to `true` in order to revert to the old behaviour.
Finally, I made the logic surrounding what is logged in various cases a
bit more obvious.
Technically this changes the behaviour of the extractor (in that hidden
excluded files will now be logged as `(excluded)`, but I think this
makes more sense anyway.
This required some code changes because of some breaking changes in
`clap` and `tree-sitter`.
Also needed to assign a new bazel repo name to the `crates_vendor` to
avoid name conflicts in `MODULE.bazel`.
Extends the mechanism introduced in
https://github.com/github/codeql/pull/18030
to behave the same for _all_ `MatchLiteralPattern`s, not just the ones
that happen to be the constant `True` or `False`.
Co-authored-by: yoff <yoff@github.com>
Observed on some test files in Nuitka/Nuitka, having `break` and
`continue` outside of loops in Python is (to Python) a syntax error, but
our parser happily accepted this broken syntax.
This then caused issues further downstream in the control-flow
construction, as it broke some invariants.
To fix this we now skip the code that would previously fail when the
invariants are broken.
Co-authored-by: yoff <yoff@github.com>
Once again, the interaction between anchors and extras (specifically
comments) was causing trouble.
The root of the problem was the fact that in `a[b]`, we put `b` in the
`index` field of the subscript node, whereas in `a[b,c]`, we
additionally synthesize a `Tuple` node for `b,c` (which matches the
Python AST).
To fix this, we refactored the grammar slightly so as to make that tuple
explicit, such that a subscript node either contains a single expression
or the newly added tuple node. This greatly simplifies the logic.
Previously, we were using 8.0.0rc1.
In particular, this upgrade means we need to explicitly
import more rules, as they've been moved out of the core bazel repo.