These tests consist of various Python constructions (hopefully a
somewhat comprehensive set) with specific timestamp annotations
scattered throughout. When the tests are run using the Python 3
interpreter, these annotations are checked and compared to the "current
timestamp" to see that they are in agreement. This is what makes the
tests "self-validating".
There are a few different kinds of annotations: the basic `t[4]` style
(meaning this is executed at timestamp 4), the `t[dead(4)]` variant
(meaning this _would_ happen at timestamp 4, but it is in a dead
branch), and `t[never]` (meaning this is never executed at all).
In addition to this, there is a query, MissingAnnotations, which checks
whether we have applied these annotations maximally. Many expression
nodes are not actually annotatable, so there is a sizeable list of
excluded nodes for that query.
Changes based on code review:
1. Remove redundant strings.Contains check in isExactTestPackage
The equality check on the next line handles both cases, making
the early return unnecessary.
2. Extract package selection logic into selectBestPackages function
This reduces code duplication and allows the test to call the
actual implementation rather than copying the logic.
3. Add TestSelectBestPackages to test the new function
Comprehensive test covering single packages, test vs production,
exact vs nested tests, and multiple packages.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Generated by manually applying the output from CI's Gazelle check.
This adds the go_test target for the new extractor_test.go file.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This test verifies that root internal test files (package foo, not
foo_test) are correctly extracted when the repository has both:
1. Root-level internal tests (main_test.go with package main)
2. Nested packages with tests (nested/nested_test.go)
This scenario reproduces the bug that was fixed: the old extractor
would select the wrong package variant and miss root internal test
files.
The test ensures:
- main_test.go (root internal test) is extracted
- nested/nested_test.go (nested test) is extracted
- All test functions from both files are present in the database
This prevents regression of the bug fix.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
When CODEQL_EXTRACTOR_GO_OPTION_EXTRACT_TESTS=true is set, the Go
extractor was incorrectly skipping internal test files (package foo)
at repository roots when the project contains nested test packages.
Root Cause:
The extractor selected package variants by longest ID string, but this
heuristic fails when nested packages have tests. For a package like
"github.com/go-git/go-git/v6", packages.Load returns multiple variants:
1. "github.com/go-git/go-git/v6" (19 files, production only)
2. "github.com/go-git/go-git/v6 [github.com/go-git/go-git/v6.test]"
(39 files, production + 20 root tests) ← Should select this
3. "github.com/go-git/go-git/v6 [github.com/go-git/go-git/v6/plumbing/format/packfile.test]"
(19 files, test dependency) ← Was incorrectly selected (longest string)
The old logic selected variant #3 (76 chars) over #2 (68 chars),
causing 20 root test files to be missing from the database.
Fix:
Replace string length comparison with a better heuristic that prefers:
1. Exact test packages (e.g., "pkg [pkg.test]") over nested dependencies
2. Packages with more Syntax nodes (more files to extract)
3. String length as a tiebreaker
This ensures the extractor selects the variant with the most complete
test coverage, particularly for root-level internal tests.
Testing:
- Added comprehensive unit tests covering the selection logic
- Tests simulate the real-world go-git scenario
- All tests pass
Impact:
Root-level external tests (package foo_test) were already extracted
correctly. This fix ensures internal tests (package foo) at the root
are now also extracted when they exist alongside nested test packages.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previously, apply_rules_inner snapshotted a node's fields by cloning
the BTreeMap into a Vec<(FieldId, Vec<Id>)>, then built a fresh
BTreeMap of new_fields for the rewritten Ids. For a node with N
fields, this allocated 2N+1 things per visit (the snapshot Vec, N
cloned children Vecs, the new BTreeMap entries) — even when nothing
in the subtree was rewritten.
Use std::mem::take to swap the parent's fields out by ownership: the
recursion can mutate the AST (including pushing new nodes from rule
firings) without any conflict, since we hold the owned BTreeMap
locally. Iterate values_mut() and only allocate a fresh children Vec
on the first divergence (lazy alloc): unchanged children stay in the
existing slot. When done, swap the fields back.
For a subtree with no rewrites, this is now zero allocations per node
(modulo the recursion itself). For nodes with rewrites, it's one Vec
allocation per field that contains a rewritten child, instead of two
plus the BTreeMap rebuild.
apply_rules_inner used to handle the "child was rewritten, so the
parent needs new field IDs" case by cloning the parent node, swapping
in the new fields, pushing the clone onto the arena, and returning the
new Id. Every ancestor on the path from the rewrite up to the root was
duplicated this way, with the originals retained as garbage in the
arena.
Switch to in-place mutation: assign `ast.nodes[id].fields = new_fields`
and return the same Id. Rule firings still produce genuinely new nodes
via BuildCtx (their structure differs from the input), but the
ancestor-rebuild spine no longer copies anything.
This is safe because apply_rules_inner already works entirely by Id:
the field snapshot is cloned out before recursing, no &Node references
are held across mutations of the arena, and captures are scoped to a
single rule firing so the now-stable Ids do not break anything.
Memory effect: a desugaring pass that rewrites R leaves of a tree of
average depth d previously appended R*d ancestor clones to the arena.
Now appends 0.
With Ids stable for the lifetime of an Ast, the Node::id field becomes
truly redundant and is removed (along with the Node::id() accessor).
AstCursor switches from caching `node: &Node` to tracking `node_id:
Id` and looking the node up via the arena on each access; ChildrenIter
now yields Ids directly. A new AstCursor::node_id() method gives
callers access to the cursor position by Id.
Previously, a bare child pattern in a query took whatever the next
child of the iterator was and either matched or failed: it would not
scan ahead to find a match. So `(foo ("baz"))` against a `foo` whose
implicit `child` field was `["bar", "baz"]` would fail (the pattern
took "bar" first).
Switch to forward-scan semantics: a SingleNode matcher advances through
the iterator until it finds a child that matches its sub-query. Patterns
that are named-only continue to skip past unnamed children for free.
Order is preserved across multiple bare patterns at the same level —
each pattern advances the shared iterator past whatever it consumed —
so a query cannot match children out of source order.
Captures from a failed match attempt are rolled back via a snapshot, so
partial captures from a complex sub-query do not leak across attempts.
Add two regression tests against the `do` body wrapper in a Ruby
for-loop, whose implicit `child` field contains [do, identifier, end]:
- a query for ("end") matches by skipping past `do` and the identifier
- a query for ("end") then ("do") fails, demonstrating order preservation
Schema::from_language registered unnamed kinds via or_insert(id), where
`id` came from iterating 0..node_kind_count. For names with multiple
unnamed IDs (notably "end" in tree-sitter-ruby has IDs 0 and 13, where
ID 0 is the reserved error token), this picked the first encountered
ID — typically the wrong one.
The visitor sets node.kind via language.id_for_node_kind(name, false),
which returns the canonical ID. So a query for ("end") would compare
node.kind=13 against schema=0 and silently fail to match, with no
diagnostic.
Use language.id_for_node_kind(name, false) to obtain the canonical ID
when registering, mirroring the named-kind path that already does the
same with id_for_node_kind(name, true).
Three improvements to the query parser, all aimed at allowing query
patterns to refer to unnamed tokens:
1. Bare-literal capture: `"=" @op` now captures the unnamed `=` token,
matching the parenthesized form `("=") @op`. Previously the literal
branch in parse_query_list skipped the maybe_wrap_capture call, so
the `@op` was a leftover token and would error.
2. Bare `_` matches any node, named or unnamed. Previously bare `_` and
`(_)` both produced QueryNode::Any with the same matches_named_only
behaviour, so bare `_` would skip unnamed children. Now Any carries a
match_unnamed flag: false for `(_)` (named-only, tree-sitter default)
and true for bare `_` (any node).
3. Named fields and bare child patterns may be intermixed in any order.
Previously, once parse_query_fields saw a bare pattern it would stop
accepting named fields. The fix accumulates bare patterns into the
implicit `child` field and keeps parsing.
Each named field independently selects its target field for matching, so
the source-order of fields in the query is purely cosmetic and intermixing
is safe.
Add tests covering parenthesized capture, bare-literal capture, and the
named-vs-any distinction between `(_)` and bare `_`. Update query-syntax
docs to reflect all three.
Extend the desugaring config from a single flat list of rules to an
ordered sequence of named Phases. Each phase runs to completion (a
full traversal applying its rules) before the next phase starts.
Rules in different phases never compete for matches.
The config is built via the new chainable API:
DesugaringConfig::new()
.add_phase("cleanup", cleanup_rules)
.add_phase("desugar", desugar_rules)
.with_output_node_types_yaml(yaml);
Single-phase configs are just .add_phase(...) called once.
A single FreshScope is shared across phases so generated identifier
names (e.g. $tmp-N) are unique throughout the run.
Phase names appear in error messages, e.g. "Phase `desugar`:
exceeded maximum rewrite depth".
Add two regression tests: one verifying basic two-phase chained
desugaring, and one verifying that errors include the failing phase
name.
Previously, after a rule fired the engine would always re-try that
same rule on the result root. A rule whose output matched its own
query (intentionally or by accident) would loop until the global
MAX_REWRITE_DEPTH safety net kicked in.
Make the default behavior fire-once-per-node: after a rule fires on
node N, the engine no longer tries that same rule on the result root.
Other rules and child traversal are unaffected. Rules that
intentionally rewrite iteratively can opt into the old behavior via
the new Rule::repeated() builder method.
Add two regression tests using a self-swapping assignment rule:
- with .repeated(), the swap loops and trips the depth limit
- without it (default), the swap fires once and terminates
Adds a regression test verifying that desugaring rules can chain across
output-only node kinds: a first rule rewrites an input kind to an
output-only kind, and a second rule then rewrites that output-only
kind into another output-only kind. This exercises the schema lookup
for query patterns whose root kind is not present in the input
tree-sitter grammar.
Add BUILD.bazel files for the yeast and yeast-macros crates, register
them as dependencies of the shared tree-sitter extractor, and refresh
the vendored crate dependencies via update_tree_sitter_extractors_deps.sh.