Add yeast documentation

Covers architecture, query language, template language (tree!/trees!/rule!),
capture semantics, fresh identifiers, and extractor integration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Taus
2026-05-04 13:13:46 +00:00
parent 6b229b9a6e
commit c2dee1884c

314
shared/yeast/doc/yeast.md Normal file
View File

@@ -0,0 +1,314 @@
# YEAST — YEAST Elaborates Abstract Syntax Trees
YEAST is a framework for transforming tree-sitter parse trees before they are
extracted into a CodeQL database. It sits between the tree-sitter parser and
the TRAP extractor, rewriting parts of the AST according to declarative rules.
## Motivation
Tree-sitter grammars describe the **concrete syntax** of a language — every
keyword, operator, and punctuation token appears in the parse tree. CodeQL
analyses often prefer a **simplified abstract syntax** where syntactic sugar
has been removed. YEAST bridges this gap by desugaring the tree-sitter output
into a cleaner form before extraction.
For example, Ruby's `for x in list do ... end` is syntactic sugar for
`list.each { |x| ... }`. A YEAST rule can rewrite the former into the latter
so that CodeQL queries only need to reason about the `.each` form.
## Architecture
```
Source code
┌──────────────┐
│ tree-sitter │ Parse source into a concrete syntax tree
│ parser │
└──────┬───────┘
│ tree_sitter::Tree
┌──────────────┐
│ YEAST │ Apply desugaring rules, producing a new AST
│ Runner │
└──────┬───────┘
│ yeast::Ast
┌──────────────┐
│ TRAP │ Walk the (possibly rewritten) AST and emit TRAP tuples
│ extractor │
└──────────────┘
```
The entry point is `extract_and_desugar()` in the shared tree-sitter
extractor, which passes a set of rules to the YEAST `Runner`. The original
`extract()` function passes empty rules, leaving the tree unchanged.
## How desugaring works
A YEAST `Rule` has two parts:
1. A **query** that matches nodes in the AST using a tree-sitter-inspired
pattern language.
2. A **transform** that produces replacement nodes from the match captures.
The `Runner` applies rules by walking the tree top-down. At each node, it
tries each rule in order. If a rule's query matches, the node is replaced by
the transform's output, and the rules are re-applied to the result. If no
rule matches, the node is kept and its children are processed recursively.
A rule can replace one node with zero nodes (deletion), one node (rewriting),
or multiple nodes (expansion).
## Query language
Queries use a syntax inspired by
[tree-sitter queries](https://tree-sitter.github.io/tree-sitter/using-parsers/queries/index.html),
written inside the `yeast::query!()` proc macro.
### Node patterns
```rust
// Match any named node
(_)
// Match a node of a specific kind
(assignment)
// Match an unnamed token by its text
("end")
```
### Fields
```rust
// Match a node with specific fields
(assignment
left: (identifier) @lhs
right: (_) @rhs
)
```
Fields are matched by name. Unmentioned fields are ignored — the pattern
`(assignment left: (_) @x)` matches any `assignment` node regardless of
what's in `right`.
### Captures
Captures bind matched nodes to names for use in the transform. A capture
`@name` always follows the pattern it captures:
```rust
(identifier) @name // capture an identifier node
(_) @value // capture any named node
(identifier)* @items // capture each repeated match
```
### Unnamed children
Patterns that appear after all named fields match unnamed (positional)
children. Named node patterns like `(_)` automatically skip unnamed tokens
(keywords, operators, punctuation), matching tree-sitter semantics:
```rust
(for
pattern: (_) @pat // named field
value: (in (_) @val) // "in" token is skipped automatically
body: (do (_)* @body) // "do" and "end" tokens skipped
)
```
### Repetitions
```rust
(_)* // zero or more
(_)+ // one or more
(_)? // zero or one
(identifier)* @names // capture each repeated match
```
## Template language
Templates construct new AST nodes using the `tree!` and `trees!` macros.
All children in a template must be in named fields — output AST nodes are
always fully fielded.
When used inside a `rule!` macro, the context is implicit — no explicit
`BuildCtx` argument is needed. When used standalone, they take a `BuildCtx`
as the first argument:
```rust
// Inside rule! — implicit context, captures are Rust variables
yeast::rule!(
(assignment left: (_) @left right: (_) @right)
=>
(assignment left: {right} right: {left})
);
// Standalone — explicit context
let fresh = yeast::tree_builder::FreshScope::new();
let mut ctx = BuildCtx::new(ast, &captures, &fresh);
let id = yeast::tree!(ctx,
(assignment
left: {ctx.capture("lhs")}
right: {ctx.capture("rhs")}
)
);
```
### `tree!` — build a single node
`tree!(...)` returns a single node `Id`:
```rust
yeast::tree!(ctx,
(assignment
left: {ctx.capture("lhs")}
right: {ctx.capture("rhs")}
)
)
```
### `trees!` — build multiple nodes
`trees!(...)` returns `Vec<Id>`:
```rust
yeast::trees!(ctx,
(assignment left: {tmp} right: {right})
{..body}
)
```
### Literal nodes
`(kind "text")` creates a leaf node with fixed text content:
```rust
(identifier "each") // an identifier node whose text is "each"
```
### Computed literals
`(kind #{expr})` creates a leaf node whose content is `expr.to_string()`:
```rust
(integer #{i}) // an integer node with the value of i
(identifier #{name}) // an identifier from a Rust variable
```
### Fresh identifiers
`(kind $name)` creates a leaf node with an auto-generated unique name. All
occurrences of the same `$name` within one `BuildCtx` share the same value:
```rust
(block
parameters: (block_parameters
(identifier $tmp) // generates e.g. "$tmp-0"
)
body: (block_body
(assignment
left: {pat}
right: (identifier $tmp) // same "$tmp-0" value
)
)
)
```
### Embedded Rust expressions
`{expr}` embeds a Rust expression that returns a single node `Id`:
```rust
(assignment
left: {some_node_id} // insert a pre-built node
right: {rhs} // insert a captured value (inside rule!)
)
```
`{..expr}` splices a `Vec<Id>` (or any iterable of `Id`):
```rust
yeast::trees!(ctx,
(assignment left: {tmp} right: {right})
{..extra_nodes} // splice a Vec<Id>
)
```
Inside `rule!`, captures are Rust variables, so `{name}` inserts a
single capture (`Id`) and `{..name}` splices a repeated capture
(`Vec<Id>`).
## Complete example: for-loop desugaring
This rule rewrites Ruby's `for pat in val do body end` into
`val.each { |tmp| pat = tmp; body }`:
```rust
let for_rule = yeast::rule!(
(for
pattern: (_) @pat
value: (in (_) @val)
body: (do (_)* @body)
)
=>
(call
receiver: {val}
method: (identifier "each")
block: (block
parameters: (block_parameters
(identifier $tmp)
)
body: (block_body
(assignment
left: {pat}
right: (identifier $tmp)
)
{..body}
)
)
)
);
```
Captures from the query (`@pat`, `@val`, `@body`) become Rust variables
automatically: single captures bind as `Id`, repeated captures (after
`*` or `+`) as `Vec<Id>`, and optional captures (after `?`) as
`Option<Id>`.
## The `rule!` macro
`rule!` combines a query and a transform into a single declaration:
```rust
// Full template form
yeast::rule!(
(query_pattern field: (_) @capture)
=>
(output_template field: {capture})
)
// Shorthand form — captures become fields on the output node
yeast::rule!(
(query_pattern field: (_) @capture)
=> output_kind
)
```
The shorthand `=> kind` form auto-generates the template, mapping each
capture name to a field of the same name on the output node.
## Integration with the extractor
YEAST integrates with the shared tree-sitter extractor via two mechanisms:
1. **`extract_and_desugar()`** — like `extract()`, but takes a
`Vec<yeast::Rule>` to apply before TRAP extraction.
2. **`LanguageSpec::output_node_types`** — when desugaring produces an AST
with different node types than the tree-sitter grammar, this field points
to a separate `node-types.json` describing the output schema.
Languages that don't use desugaring simply call `extract()`, which passes
empty rules internally.