Files
codeql/shared/yeast/doc/yeast.md
2026-05-13 10:35:09 +02:00

12 KiB

YEAST — YEAST Elaborates Abstract Syntax Trees

YEAST is a framework for transforming tree-sitter parse trees before they are extracted into a CodeQL database. It sits between the tree-sitter parser and the TRAP extractor, rewriting parts of the AST according to declarative rules.

Motivation

Tree-sitter grammars describe the concrete syntax of a language — every keyword, operator, and punctuation token appears in the parse tree. CodeQL analyses often prefer a simplified abstract syntax where syntactic sugar has been removed. YEAST bridges this gap by desugaring the tree-sitter output into a cleaner form before extraction.

For example, Ruby's for x in list do ... end is syntactic sugar for list.each { |x| ... }. A YEAST rule can rewrite the former into the latter so that CodeQL queries only need to reason about the .each form.

Architecture

Source code
    │
    ▼
┌──────────────┐
│  tree-sitter │  Parse source into a concrete syntax tree
│    parser    │
└──────┬───────┘
       │ tree_sitter::Tree
       ▼
┌──────────────┐
│    YEAST     │  Apply desugaring rules, producing a new AST
│   Runner     │
└──────┬───────┘
       │ yeast::Ast
       ▼
┌──────────────┐
│    TRAP      │  Walk the (possibly rewritten) AST and emit TRAP tuples
│  extractor   │
└──────────────┘

The entry point is extract() in the shared tree-sitter extractor. When called with a non-empty rules vector, the parsed tree is run through the YEAST Runner before TRAP extraction; with an empty rules vector the tree is extracted unchanged.

How desugaring works

A YEAST Rule has two parts:

  1. A query that matches nodes in the AST using a tree-sitter-inspired pattern language.
  2. A transform that produces replacement nodes from the match captures.

The Runner applies rules by walking the tree top-down. At each node, it tries each rule in order. If a rule's query matches, the node is replaced by the transform's output, and the rules are re-applied to the result. If no rule matches, the node is kept and its children are processed recursively.

A rule can replace one node with zero nodes (deletion), one node (rewriting), or multiple nodes (expansion).

By default a rule fires at most once on a given node: after firing, the engine will not re-try that same rule on the result root. Other rules may still fire on the result, and the rule may still fire on different nodes (including the result's children). To opt into iterative behaviour — when a rule's output is intentionally re-matched by the same rule — call .repeated() on the constructed Rule:

let r = yeast::rule!((foo ...) => (foo ...)).repeated();

Without .repeated(), a rule whose output happens to match its own query simply fires once and stops. With .repeated(), the rule is allowed to re-match indefinitely; the runner still enforces a global rewrite-depth limit (currently 100) as a safety net against accidental cycles.

Query language

Queries use a syntax inspired by tree-sitter queries, written inside the yeast::query!() proc macro.

Node patterns

// Match any named node
(_)

// Match a node of a specific kind
(assignment)

// Match an unnamed token by its text
("end")

Fields

// Match a node with specific fields
(assignment
    left: (identifier) @lhs
    right: (_) @rhs
)

Fields are matched by name. Unmentioned fields are ignored — the pattern (assignment left: (_) @x) matches any assignment node regardless of what's in right.

Captures

Captures bind matched nodes to names for use in the transform. A capture @name always follows the pattern it captures:

(identifier) @name          // capture an identifier node
(_) @value                  // capture any named node
(identifier)* @items        // capture each repeated match
("=") @op                   // capture an unnamed token by its text
"=" @op                     // shorthand for the line above
_ @anything                 // capture any node, named or unnamed

Named vs unnamed children

The two wildcard forms (_) and bare _ differ:

  • (_) matches only named nodes. When used as a positional pattern, unnamed children (keywords, operators, punctuation) are skipped over.
  • Bare _ matches any node, named or unnamed, taking whatever is next in the child list.

Bare child patterns are matched forward-scan: each pattern advances through the iterator until it finds a child that matches, skipping non-matching children along the way. So (foo ("baz")) against a foo whose children are [bar, baz] succeeds — the matcher scans past bar and matches baz. The iterator advances as it goes, so subsequent patterns can never match children that appear earlier in source order than already-matched ones.

For named-only patterns ((_), (some_kind ...)), the scan additionally skips past unnamed tokens without trying to match them, since they can never match anyway.

Anchors (.) for forcing immediate adjacency, like in tree-sitter queries, are not supported.

(for
    pattern: (_) @pat             // named field, captures any named node
    value: (in (_) @val)          // "in" wrapper is a named node here
    body: (do (_)* @body)         // "do" and "end" tokens skipped by (_)
)

Repetitions

(_)*                   // zero or more
(_)+                   // one or more
(_)?                   // zero or one
(identifier)* @names   // capture each repeated match

Template language

Templates construct new AST nodes using the tree! and trees! macros. All children in a template must be in named fields — output AST nodes are always fully fielded.

When used inside a rule! macro, the context is implicit — no explicit BuildCtx argument is needed. When used standalone, they take a BuildCtx as the first argument:

// Inside rule! — implicit context, captures are Rust variables
yeast::rule!(
    (assignment left: (_) @left right: (_) @right)
    =>
    (assignment left: {right} right: {left})
);

// Standalone — explicit context
let fresh = yeast::tree_builder::FreshScope::new();
let mut ctx = BuildCtx::new(ast, &captures, &fresh);
let id = yeast::tree!(ctx,
    (assignment
        left: {ctx.capture("lhs")}
        right: {ctx.capture("rhs")}
    )
);

tree! — build a single node

tree!(...) returns a single node Id:

yeast::tree!(ctx,
    (assignment
        left: {ctx.capture("lhs")}
        right: {ctx.capture("rhs")}
    )
)

trees! — build multiple nodes

trees!(...) returns Vec<Id>:

yeast::trees!(ctx,
    (assignment left: {tmp} right: {right})
    {..body}
)

Literal nodes

(kind "text") creates a leaf node with fixed text content:

(identifier "each")          // an identifier node whose text is "each"

Computed literals

(kind #{expr}) creates a leaf node whose content is expr.to_string():

(integer #{i})               // an integer node with the value of i
(identifier #{name})         // an identifier from a Rust variable

Fresh identifiers

(kind $name) creates a leaf node with an auto-generated unique name. All occurrences of the same $name within one BuildCtx share the same value:

(block
    parameters: (block_parameters
        (identifier $tmp)         // generates e.g. "$tmp-0"
    )
    body: (block_body
        (assignment
            left: {pat}
            right: (identifier $tmp)   // same "$tmp-0" value
        )
    )
)

Embedded Rust expressions

{expr} embeds a Rust expression that returns a single node Id:

(assignment
    left: {some_node_id}       // insert a pre-built node
    right: {rhs}               // insert a captured value (inside rule!)
)

{..expr} splices a Vec<Id> (or any iterable of Id):

yeast::trees!(ctx,
    (assignment left: {tmp} right: {right})
    {..extra_nodes}                        // splice a Vec<Id>
)

Inside rule!, captures are Rust variables, so {name} inserts a single capture (Id) and {..name} splices a repeated capture (Vec<Id>).

Complete example: for-loop desugaring

This rule rewrites Ruby's for pat in val do body end into val.each { |tmp| pat = tmp; body }:

let for_rule = yeast::rule!(
    (for
        pattern: (_) @pat
        value: (in (_) @val)
        body: (do (_)* @body)
    )
    =>
    (call
        receiver: {val}
        method: (identifier "each")
        block: (block
            parameters: (block_parameters
                (identifier $tmp)
            )
            body: (block_body
                (assignment
                    left: {pat}
                    right: (identifier $tmp)
                )
                {..body}
            )
        )
    )
);

Captures from the query (@pat, @val, @body) become Rust variables automatically: single captures bind as Id, repeated captures (after * or +) as Vec<Id>, and optional captures (after ?) as Option<Id>.

The rule! macro

rule! combines a query and a transform into a single declaration:

// Full template form
yeast::rule!(
    (query_pattern field: (_) @capture)
    =>
    (output_template field: {capture})
)

// Shorthand form — captures become fields on the output node
yeast::rule!(
    (query_pattern field: (_) @capture)
    => output_kind
)

The shorthand => kind form auto-generates the template, mapping each capture name to a field of the same name on the output node.

Integration with the extractor

A YEAST desugaring pass is configured with a [DesugaringConfig], which carries one or more named [Phase]s of rules and an optional output node-types schema (in YAML format). Each phase is a complete traversal that runs to completion before the next phase starts; only the current phase's rules are considered during that traversal. Attach the config to a language spec to enable rewriting:

let desugar = yeast::DesugaringConfig::new()
    .add_phase("cleanup", yeast::PhaseKind::Repeating, cleanup_rules())
    .add_phase("translate", yeast::PhaseKind::OneShot, translate_rules())
    .with_output_node_types_yaml(include_str!("output-node-types.yml"));

let lang = simple::LanguageSpec {
    prefix: "ruby",
    ts_language: tree_sitter_ruby::LANGUAGE.into(),
    node_types: tree_sitter_ruby::NODE_TYPES,
    desugar: Some(desugar),
    file_globs: vec!["*.rb".into()],
};

A single-phase config is just .add_phase(...) called once. Phase names appear in error messages so you can tell which phase failed.

There are two kinds of phases:

  • Repeating: Each node is re-processed until none of the rules in the phase matches. When a node no longer matches any rules, its children are recursively processed. In practice this is used to desugar or simplify an AST, while staying mostly within the same schema.
  • One-shot: Each node is processed by the first matching rule, and the engine panics if no rule matches. Rules are then recursively applied to every captured node. In practice this is used when translating from one AST schema to another, where an exhaustive match is required.

The same YAML node-types is used for both the runtime yeast Schema (so rules can refer to output-only kinds and fields) and TRAP validation (it is converted to JSON internally).

For the dbscheme/QL code generator, set Language::desugar to a DesugaringConfig carrying the same YAML; the generator converts it to JSON for downstream code generation. The phases field of the config is unused at code-generation time.