diff --git a/shared/yeast/doc/yeast.md b/shared/yeast/doc/yeast.md new file mode 100644 index 00000000000..59526c38ec8 --- /dev/null +++ b/shared/yeast/doc/yeast.md @@ -0,0 +1,314 @@ +# YEAST — YEAST Elaborates Abstract Syntax Trees + +YEAST is a framework for transforming tree-sitter parse trees before they are +extracted into a CodeQL database. It sits between the tree-sitter parser and +the TRAP extractor, rewriting parts of the AST according to declarative rules. + +## Motivation + +Tree-sitter grammars describe the **concrete syntax** of a language — every +keyword, operator, and punctuation token appears in the parse tree. CodeQL +analyses often prefer a **simplified abstract syntax** where syntactic sugar +has been removed. YEAST bridges this gap by desugaring the tree-sitter output +into a cleaner form before extraction. + +For example, Ruby's `for x in list do ... end` is syntactic sugar for +`list.each { |x| ... }`. A YEAST rule can rewrite the former into the latter +so that CodeQL queries only need to reason about the `.each` form. + +## Architecture + +``` +Source code + │ + ▼ +┌──────────────┐ +│ tree-sitter │ Parse source into a concrete syntax tree +│ parser │ +└──────┬───────┘ + │ tree_sitter::Tree + ▼ +┌──────────────┐ +│ YEAST │ Apply desugaring rules, producing a new AST +│ Runner │ +└──────┬───────┘ + │ yeast::Ast + ▼ +┌──────────────┐ +│ TRAP │ Walk the (possibly rewritten) AST and emit TRAP tuples +│ extractor │ +└──────────────┘ +``` + +The entry point is `extract_and_desugar()` in the shared tree-sitter +extractor, which passes a set of rules to the YEAST `Runner`. The original +`extract()` function passes empty rules, leaving the tree unchanged. + +## How desugaring works + +A YEAST `Rule` has two parts: + +1. A **query** that matches nodes in the AST using a tree-sitter-inspired + pattern language. +2. A **transform** that produces replacement nodes from the match captures. + +The `Runner` applies rules by walking the tree top-down. At each node, it +tries each rule in order. If a rule's query matches, the node is replaced by +the transform's output, and the rules are re-applied to the result. If no +rule matches, the node is kept and its children are processed recursively. + +A rule can replace one node with zero nodes (deletion), one node (rewriting), +or multiple nodes (expansion). + +## Query language + +Queries use a syntax inspired by +[tree-sitter queries](https://tree-sitter.github.io/tree-sitter/using-parsers/queries/index.html), +written inside the `yeast::query!()` proc macro. + +### Node patterns + +```rust +// Match any named node +(_) + +// Match a node of a specific kind +(assignment) + +// Match an unnamed token by its text +("end") +``` + +### Fields + +```rust +// Match a node with specific fields +(assignment + left: (identifier) @lhs + right: (_) @rhs +) +``` + +Fields are matched by name. Unmentioned fields are ignored — the pattern +`(assignment left: (_) @x)` matches any `assignment` node regardless of +what's in `right`. + +### Captures + +Captures bind matched nodes to names for use in the transform. A capture +`@name` always follows the pattern it captures: + +```rust +(identifier) @name // capture an identifier node +(_) @value // capture any named node +(identifier)* @items // capture each repeated match +``` + +### Unnamed children + +Patterns that appear after all named fields match unnamed (positional) +children. Named node patterns like `(_)` automatically skip unnamed tokens +(keywords, operators, punctuation), matching tree-sitter semantics: + +```rust +(for + pattern: (_) @pat // named field + value: (in (_) @val) // "in" token is skipped automatically + body: (do (_)* @body) // "do" and "end" tokens skipped +) +``` + +### Repetitions + +```rust +(_)* // zero or more +(_)+ // one or more +(_)? // zero or one +(identifier)* @names // capture each repeated match +``` + +## Template language + +Templates construct new AST nodes using the `tree!` and `trees!` macros. +All children in a template must be in named fields — output AST nodes are +always fully fielded. + +When used inside a `rule!` macro, the context is implicit — no explicit +`BuildCtx` argument is needed. When used standalone, they take a `BuildCtx` +as the first argument: + +```rust +// Inside rule! — implicit context, captures are Rust variables +yeast::rule!( + (assignment left: (_) @left right: (_) @right) + => + (assignment left: {right} right: {left}) +); + +// Standalone — explicit context +let fresh = yeast::tree_builder::FreshScope::new(); +let mut ctx = BuildCtx::new(ast, &captures, &fresh); +let id = yeast::tree!(ctx, + (assignment + left: {ctx.capture("lhs")} + right: {ctx.capture("rhs")} + ) +); +``` + +### `tree!` — build a single node + +`tree!(...)` returns a single node `Id`: + +```rust +yeast::tree!(ctx, + (assignment + left: {ctx.capture("lhs")} + right: {ctx.capture("rhs")} + ) +) +``` + +### `trees!` — build multiple nodes + +`trees!(...)` returns `Vec`: + +```rust +yeast::trees!(ctx, + (assignment left: {tmp} right: {right}) + {..body} +) +``` + +### Literal nodes + +`(kind "text")` creates a leaf node with fixed text content: + +```rust +(identifier "each") // an identifier node whose text is "each" +``` + +### Computed literals + +`(kind #{expr})` creates a leaf node whose content is `expr.to_string()`: + +```rust +(integer #{i}) // an integer node with the value of i +(identifier #{name}) // an identifier from a Rust variable +``` + +### Fresh identifiers + +`(kind $name)` creates a leaf node with an auto-generated unique name. All +occurrences of the same `$name` within one `BuildCtx` share the same value: + +```rust +(block + parameters: (block_parameters + (identifier $tmp) // generates e.g. "$tmp-0" + ) + body: (block_body + (assignment + left: {pat} + right: (identifier $tmp) // same "$tmp-0" value + ) + ) +) +``` + +### Embedded Rust expressions + +`{expr}` embeds a Rust expression that returns a single node `Id`: + +```rust +(assignment + left: {some_node_id} // insert a pre-built node + right: {rhs} // insert a captured value (inside rule!) +) +``` + +`{..expr}` splices a `Vec` (or any iterable of `Id`): + +```rust +yeast::trees!(ctx, + (assignment left: {tmp} right: {right}) + {..extra_nodes} // splice a Vec +) +``` + +Inside `rule!`, captures are Rust variables, so `{name}` inserts a +single capture (`Id`) and `{..name}` splices a repeated capture +(`Vec`). + +## Complete example: for-loop desugaring + +This rule rewrites Ruby's `for pat in val do body end` into +`val.each { |tmp| pat = tmp; body }`: + +```rust +let for_rule = yeast::rule!( + (for + pattern: (_) @pat + value: (in (_) @val) + body: (do (_)* @body) + ) + => + (call + receiver: {val} + method: (identifier "each") + block: (block + parameters: (block_parameters + (identifier $tmp) + ) + body: (block_body + (assignment + left: {pat} + right: (identifier $tmp) + ) + {..body} + ) + ) + ) +); +``` + +Captures from the query (`@pat`, `@val`, `@body`) become Rust variables +automatically: single captures bind as `Id`, repeated captures (after +`*` or `+`) as `Vec`, and optional captures (after `?`) as +`Option`. + +## The `rule!` macro + +`rule!` combines a query and a transform into a single declaration: + +```rust +// Full template form +yeast::rule!( + (query_pattern field: (_) @capture) + => + (output_template field: {capture}) +) + +// Shorthand form — captures become fields on the output node +yeast::rule!( + (query_pattern field: (_) @capture) + => output_kind +) +``` + +The shorthand `=> kind` form auto-generates the template, mapping each +capture name to a field of the same name on the output node. + +## Integration with the extractor + +YEAST integrates with the shared tree-sitter extractor via two mechanisms: + +1. **`extract_and_desugar()`** — like `extract()`, but takes a + `Vec` to apply before TRAP extraction. + +2. **`LanguageSpec::output_node_types`** — when desugaring produces an AST + with different node types than the tree-sitter grammar, this field points + to a separate `node-types.json` describing the output schema. + +Languages that don't use desugaring simply call `extract()`, which passes +empty rules internally.