mirror of
https://github.com/github/codeql.git
synced 2025-12-17 09:13:20 +01:00
625 lines
20 KiB
Markdown
625 lines
20 KiB
Markdown
# `tsg-python`
|
|
|
|
Run `tree-sitter-graph` queries against Python source files.
|
|
|
|
## How to build
|
|
|
|
Run `cargo build --release`. The resulting binary can be found in the `target/release` directory.
|
|
|
|
## How to invoke
|
|
|
|
`tsg-python tsg-file.tsg python-file.py`
|
|
|
|
Output is emitted on `stdout`.
|
|
|
|
If you're impatient, you can also build and run using `cargo run` followed by the arguments given
|
|
above.
|
|
|
|
## How to use
|
|
|
|
To use `tsg-python`, you must have an appropriate `.tsg` file containing the directions for how to
|
|
construct a Python AST from the output of `tree-sitter-python`.
|
|
|
|
### A quick primer on `tree-sitter-graph` syntax
|
|
|
|
A file consists of a sequence of stanzas. Each stanza consists of a query (using the [tree-sitter
|
|
query syntax](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries)) and a sequence of nodes and edges to define for each query match in the source file.
|
|
Queries will (almost always) include captures like `@foo`, which means any occurrence of `@foo` in
|
|
the corresponding stanza will refer to a particular syntax node in the bit that the query matches.
|
|
|
|
Stanzas are executed in order, and a stanza is only run when all possible matches have been
|
|
exhausted for all preceding stanzas. (Since the syntax tree that is matched against never changes,
|
|
execution never jumps back to an earlier stanza.)
|
|
|
|
Inside stanzas, scoped variables have the form `@foo.bar` where `@foo` is a capture in the
|
|
associated query, and `bar` is an identifier. This should be thought of as a variable that is
|
|
"attached" to the `tree-sitter` node that `@foo` refers to. If `@baz` is another reference to the same node as
|
|
`@foo` (perhaps even in a different stanza), then `@baz.bar` will be a reference to the _same_
|
|
scoped variable. This permits information to be linked across different stanzas.
|
|
|
|
Assigning a value to a scoped variable is done using the syntax `let @foo.bar = some-expr` (`let`
|
|
for immutable variables, `var` for mutable variables, which may be mutated using `set`). Note that
|
|
scoped variables only exist during the execution of the stack graph, and are not immediately part of
|
|
the output graph.
|
|
|
|
To actually produce output, we must specify some `node`s or `edge`s and possibly `attr`ibutes
|
|
thereof.
|
|
|
|
To produce a node, we declare `node @foo.bar` (which is equivalent to `let @foo.bar = (node)`, the
|
|
right hand side being a function that creates a new node). In the output, nodes are simply integers.
|
|
|
|
To assign an attribute to a node, we write `attr (@foo.bar) identifier = expr`, for some suitable
|
|
choice of `identifier` and `expr`. In the output, attributes are given alongside nodes in a `key:
|
|
value` notation.
|
|
|
|
For edges and their attributes, the syntax is similar:
|
|
|
|
`edge @foo.bar -> @baz.quux`
|
|
|
|
and
|
|
|
|
`attr (@foo.bar -> @baz.quux) identifier = expr`.
|
|
|
|
Note that it is an error to declare the same node, edge, (or attribute of either of these) twice.
|
|
|
|
### The general scheme:
|
|
|
|
|
|
For fields that point to some literal value
|
|
```tsg
|
|
<some capture involving @nd>
|
|
{
|
|
attr (@nd.node) field_name = some_value
|
|
}
|
|
```
|
|
|
|
For fields that point directly to an AST node:
|
|
|
|
```tsg
|
|
<some capture involving @parent and @child>
|
|
{
|
|
attr (@parent.node) field_name = @child.node
|
|
}
|
|
```
|
|
|
|
For fields that point to lists of AST nodes:
|
|
|
|
```tsg
|
|
<some capture involving @parent and @child>
|
|
{
|
|
edge @parent.node -> @child.node
|
|
attr (@parent.node -> @child.node) field_name = <index of @child in the resulting list>
|
|
}
|
|
```
|
|
|
|
Scoped variables of the form `@foo.node` are used to tie the AST together, and so it's important
|
|
that this is set for nodes that map directly onto `tree-sitter-python` nodes. Thus, for instance
|
|
for binary operators, the stanza could look as follows:
|
|
|
|
```tsg
|
|
(binary_operator
|
|
left: (_) @left
|
|
right: (_) @right
|
|
) @bin
|
|
{
|
|
attr (@bin.node) left = @left.node
|
|
attr (@bin.node) right = @right.node
|
|
}
|
|
```
|
|
|
|
Note in particular the `@left.node` and `@right.node` references. In order for the above stanza to
|
|
work, these scoped variables _must_ exist and point to suitable graph `node`s.
|
|
|
|
In practice, the setting up of all of these scoped variables (and creation of output graph nodes)
|
|
will happen at the very top of the `.tsg` file, to ensure that these scoped variables are defined
|
|
for the remainder of the file.
|
|
|
|
To ease the creation of these variables, we have the `ast-node` convenience function. For binary
|
|
operators, it would take the following form:
|
|
|
|
```tsg
|
|
(binary_operator) @bin
|
|
{
|
|
let @bin.node = (ast-node @bin "BinOp")
|
|
}
|
|
```
|
|
Here, the two arguments are respectively
|
|
- a `tree-sitter` node (which is used to set the location of `@bin.node`), and
|
|
- a string (which is used to set the "kind" of `@bin.node`)
|
|
|
|
In effect, the call
|
|
|
|
```tsg
|
|
let @bin.node = (ast-node @bin "BinOp")
|
|
```
|
|
|
|
is exactly equivalent to the more verbose
|
|
|
|
```tsg
|
|
node @bin.node ; or equivalently `let @bin.node = (node)`
|
|
attr (@bin.node) _location = (location @bin)
|
|
attr (@bin.node) _kind = "BinOp"
|
|
```
|
|
|
|
As the above suggests, attributes that start with an underscore are interpreted in a special way
|
|
when reconstructing the AST.
|
|
|
|
### Special attributes
|
|
|
|
#### The `_kind` attribute (mandatory)
|
|
Should be set to a string consisting of the name of the corresponding Python AST class. This
|
|
information will be used to build the AST, and so it is an error if this is left out.
|
|
|
|
Generally, this (and `_location`) will be set using the `ast-node` function.
|
|
|
|
#### The `_skip_to` attribute (optional)
|
|
This is used to indicate that the present graph node should _not_ be turned into an AST node, but that the
|
|
graph node contained in this attribute should be used instead. That graph node may _also_ contain a
|
|
`_skip_to` field, in which case the entire chain is followed until a node is encountered that does
|
|
not have a `_skip_to` field. (Please ensure that there are no cycles of `_skip_to` pointers.)
|
|
|
|
Example:
|
|
|
|
In `tree-sitter-python`, assignment statements are a form of `expression_statement`, and this node
|
|
type also encompasses things like expressions (e.g. `2+2`) appearing at the level of statements. In
|
|
the internal Python AST, we need to separate the assignment from such expressions. The assignment should be present as an `Assign` node, but `2+2` should be
|
|
wrapped in an `Expr` node. To solve this, we create an `Expr` for each `expression_statement`, and
|
|
then explicitly skip this node in the AST if it contains an `assignment`. This is implemented as
|
|
follows:
|
|
```tsg
|
|
(expression_statement (assignment) @inner) @outer
|
|
{
|
|
attr (@outer.node) _skip_to = @inner.node
|
|
}
|
|
```
|
|
|
|
#### The `_location` attribute (optional)
|
|
This attribute is used to indicate the location of the corresponding AST node. As with `_kind` it
|
|
should be set using the `ast-node` function.
|
|
|
|
#### The `_location_start` and `_location_end` attributes (optional)
|
|
These attributes are used to indicate the start or end of the location of the AST node. They can be
|
|
used for nodes where `_location` has already been set, in which case they override the relevant part
|
|
of that location. For an example of this see the worked example on `if` statements below.
|
|
#### The `_start_line`, `_start_column`, `_end_line`, and `_end_column` attributes (optional)
|
|
These can be used to set the start or end position of an AST node with even greater detail than the
|
|
preceding attributes. As with the `_location_start` and `_location_end` attributes, these will
|
|
override the values of the corresponding part of the location.
|
|
|
|
In general, these attributes should be used sparingly, as they are quite verbose.
|
|
|
|
### Built-in functions
|
|
#### `(source-text` _`tree-sitter-node`_`)` (built-in)
|
|
This function returns the source text of the `tree-sitter` node it receives as an argument.
|
|
|
|
Example:
|
|
|
|
Extracting the operator from a binary expression:
|
|
```tsg
|
|
(binary_operator
|
|
operator: _ @op
|
|
) @bin
|
|
{
|
|
attr (@bin.node) op = (source-text @op)
|
|
}
|
|
```
|
|
|
|
#### `(ast-node` _`tree-sitter-node`_ _`string`_`)` (`tsg-python` only)
|
|
Creates a new graph node with the given `_kind` and sets the `_location` attribute to the location
|
|
of the given `tree-sitter` node.
|
|
#### `(child-index` _`tree-sitter-node`_`)` (built-in)
|
|
Returns the index of the given `tree-sitter` node in its parent.
|
|
#### `(location` _`tree-sitter-node`_`)` (`tsg-python` only)
|
|
Returns the location of the given `tree-sitter` node as a list containing four integers
|
|
corresponding to the start row and column, followed by the end row and column.
|
|
#### `(location-start` _`tree-sitter-node`_`)` and `(location-end` _`tree-sitter-node`_`)` (`tsg-python` only)
|
|
Returns the start or end position (row followed by column) of the given `tree-sitter` node (as a list containing two integers).
|
|
#### `start-row`, `start-column`, `end-row`, and `end-column` (built-in)
|
|
(All of these take a `tree-sitter-node` as an argument.)
|
|
|
|
Returns an integer corresponding to the appropriate part of the location of the given `tree-sitter` node.
|
|
|
|
### A worked example: `if` statements
|
|
|
|
The way the current parser handles `if` statements means we cannot do a straight mapping from the tree-sitter grammar to the AST. In particular, a block of code such as
|
|
|
|
```python
|
|
if x: do_x
|
|
elif y: do_y
|
|
elif z: do_z
|
|
else: do_else
|
|
```
|
|
|
|
is unrolled into the following form by the current parser:
|
|
|
|
```python
|
|
if x: do_x
|
|
else:
|
|
if y: do_y
|
|
else:
|
|
if z: do_z
|
|
else: do_else
|
|
```
|
|
|
|
This means we have to synthesise nodes for the inner `if` statements.
|
|
|
|
However, this should be straightforward -- we simply have to make sure that `elif_clause`s also
|
|
produce the appropriate kind of node, and that everything is linked up correctly.
|
|
|
|
For references, here are the productions for `if_statement`, `else_clause` and `elif_clause` in
|
|
`tree-sitter-python`
|
|
|
|
```javascript
|
|
if_statement: $ => seq(
|
|
'if',
|
|
field('condition', $.expression),
|
|
':',
|
|
field('consequence', $._suite),
|
|
repeat(field('alternative', $.elif_clause)),
|
|
optional(field('alternative', $.else_clause))
|
|
),
|
|
|
|
elif_clause: $ => seq(
|
|
'elif',
|
|
field('condition', $.expression),
|
|
':',
|
|
field('consequence', $._suite)
|
|
),
|
|
|
|
else_clause: $ => seq(
|
|
'else',
|
|
':',
|
|
field('body', $._suite)
|
|
),
|
|
```
|
|
|
|
First, we'll set up all of the relevant nodes with corresponding nodes in the AST:
|
|
|
|
```tsg
|
|
|
|
(if_statement)
|
|
@tree_sitter_node
|
|
{
|
|
let @tree_sitter_node.node = (ast-node @tree_sitter_node "If")
|
|
}
|
|
```
|
|
|
|
This ensures that we can reference the `.node` scoped variable on the above nodes.
|
|
|
|
(We named the capture `@tree_sitter_node` above to make it more clear, but in general something like
|
|
`@if` would be more appropriate.)
|
|
|
|
In particular, since we want `elif`s to be turned into nested `if`s, it makes sense to apply the
|
|
`If` kind to `elif_clauses` as well:
|
|
|
|
```tsg
|
|
(elif_clause) @elif
|
|
{
|
|
let @elif.node = (ast-node @elif "If")
|
|
}
|
|
```
|
|
Whenever we refer to a node, we must ensure that it has first been defined, however there is no
|
|
need to do this separately for each node.
|
|
|
|
Next, for both `if`s and `elif`s, we want to record the `test` and the `body`. The `test` we do as follows:
|
|
|
|
```tsg
|
|
[
|
|
(if_statement
|
|
condition: (_) @test) @if
|
|
(elif_clause
|
|
condition: (_) @test) @if
|
|
]
|
|
{
|
|
attr (@if.node) test = @test.node
|
|
}
|
|
```
|
|
For `body`, in the Python AST this is simply a list of nodes, whereas for the `tree-sitter` parse tree, it
|
|
will contain a `block` node. Because there is no Python AST equivalent for `block`, we skip over
|
|
this node when linking the `if`-statement to its body:
|
|
```tsg
|
|
[
|
|
(if_statement
|
|
consequence: (block (_) @stmt)) @parent
|
|
(elif_clause
|
|
consequence: (block (_) @stmt)) @parent
|
|
]
|
|
{
|
|
edge @parent.node -> @stmt.node
|
|
attr (@parent.node -> @stmt.node) body = (child-index @stmt)
|
|
}
|
|
```
|
|
The above shows how we handle fields containing lists of items: we add an edge from the parent node
|
|
to each child node, and put an attribute on that edge. The name of the attribute will be the name of
|
|
the field, and the value will be the index of this node among the children of its `tree-sitter` parent.
|
|
|
|
Now we can begin unwinding the nesting. First of all, the first `elif` should be the `orelse` of the
|
|
initial `if_statement`:
|
|
|
|
```tsg
|
|
(if_statement
|
|
consequence: (_)
|
|
.
|
|
(elif_clause) @elif
|
|
) @if
|
|
{
|
|
edge @if.node -> @elif.node
|
|
attr (@if.node -> @elif.node) orelse = 0
|
|
}
|
|
```
|
|
(The `.` acts as an anchor, forcing its two neighbours to be adjancent in the tree. So in this case,
|
|
we get the first `elif` after the body of the `if`)
|
|
|
|
Next, whenever we have two adjacent `elif`s, we want the `orelse` of the first one to be the second one:
|
|
|
|
```tsg
|
|
(
|
|
(elif_clause) @elif1
|
|
.
|
|
(elif_clause) @elif2
|
|
)
|
|
{
|
|
edge @elif1.node -> @elif2.node
|
|
attr (@elif1.node -> @elif2.node) orelse = 0
|
|
}
|
|
```
|
|
|
|
Finally, the `else` branch of the outermost `if` should be the `orelse` of the _last_ `elif`:
|
|
|
|
```tsg
|
|
(if_statement
|
|
(elif_clause) @elif
|
|
.
|
|
alternative: (else_clause body: (block (_) @orelse))
|
|
)
|
|
{
|
|
edge @elif.node -> @orelse.node
|
|
attr (@elif.node -> @orelse.node) orelse = (child-index @orelse)
|
|
}
|
|
```
|
|
|
|
The above gives us the correct tree structure, but we're still missing a few bits (such as
|
|
locations). To capture location information we use the following stanza:
|
|
```tsg
|
|
[
|
|
(if_statement
|
|
condition: (_)
|
|
":" @colon) @if
|
|
(elif_clause
|
|
condition: (_)
|
|
":" @colon) @if
|
|
]
|
|
{
|
|
attr (@if.node) _location_end = (location-end @colon)
|
|
}
|
|
```
|
|
Because `tree-sitter-python` disagrees with the Python AST about the location of the `If` node, we
|
|
have to adjust it. We do this by setting the `_location_end` attribute to the end of the `:` token.
|
|
(Note that the _start_ of this location was set when we called `ast-node` above. As we don't have to
|
|
change this part of the location, we simply leave it as is.)
|
|
|
|
|
|
|
|
### Synthesizing nodes
|
|
In many cases it will be sufficient to hook up AST nodes to the corresponding `tree-sitter` nodes,
|
|
but occasionally we want the tree structure to be different. One example of this would be the
|
|
`class` statement. For instance, a class declaration such as
|
|
|
|
```python
|
|
class Foo(int, object, metaclass=type):
|
|
x = 5
|
|
```
|
|
|
|
has a `tree-sitter-python` parse tree that looks like this:
|
|
|
|
```
|
|
module [0, 0] - [2, 0]
|
|
class_definition [0, 0] - [1, 9]
|
|
name: identifier [0, 6] - [0, 9]
|
|
superclasses: argument_list [0, 9] - [0, 38]
|
|
identifier [0, 10] - [0, 13]
|
|
identifier [0, 15] - [0, 21]
|
|
keyword_argument [0, 23] - [0, 37]
|
|
name: identifier [0, 23] - [0, 32]
|
|
value: identifier [0, 33] - [0, 37]
|
|
body: block [1, 4] - [1, 9]
|
|
expression_statement [1, 4] - [1, 9]
|
|
assignment [1, 4] - [1, 9]
|
|
left: identifier [1, 4] - [1, 5]
|
|
right: integer [1, 8] - [1, 9]
|
|
```
|
|
|
|
but the Python AST looks like _this_:
|
|
|
|
```
|
|
Module: [1, 0] - [3, 0]
|
|
body: [
|
|
Assign: [1, 0] - [1, 39]
|
|
targets: [
|
|
Name: [1, 6] - [1, 9]
|
|
variable: Variable('Foo', None)
|
|
ctx: Store
|
|
]
|
|
value:
|
|
ClassExpr: [1, 0] - [1, 39]
|
|
name: 'Foo'
|
|
bases: [
|
|
Name: [1, 10] - [1, 13]
|
|
variable: Variable('int', None)
|
|
ctx: Load
|
|
Name: [1, 15] - [1, 21]
|
|
variable: Variable('object', None)
|
|
ctx: Load
|
|
]
|
|
keywords: [
|
|
keyword: [1, 23] - [1, 37]
|
|
arg: 'metaclass'
|
|
value:
|
|
Name: [1, 33] - [1, 37]
|
|
variable: Variable('type', None)
|
|
ctx: Load
|
|
]
|
|
inner_scope:
|
|
Class: [1, 0] - [1, 39]
|
|
name: 'Foo'
|
|
body: [
|
|
Assign: [2, 4] - [2, 9]
|
|
targets: [
|
|
Name: [2, 4] - [2, 5]
|
|
variable: Variable('x', None)
|
|
ctx: Store
|
|
]
|
|
value:
|
|
Num: [2, 8] - [2, 9]
|
|
n: 5
|
|
text: '5'
|
|
]
|
|
]
|
|
```
|
|
|
|
In particular, we unroll the `class` statement into an explicit assignment (which is the top node
|
|
for this statement in the AST) of a synthetic `ClassExpr`, which in turn contains a `Class` node
|
|
(which holds things like the body of the class). This requires too many nodes to simply reuse what's given to
|
|
us by `tree-sitter-python`, and so we must _synthesize_ additional nodes.
|
|
|
|
First of all, let us set up the outer node to be an `Assign` node:
|
|
```tsg
|
|
(class_definition) @class
|
|
{
|
|
let @class.node = (ast-node @class "Assign")
|
|
}
|
|
```
|
|
|
|
Next, we can do most of the work in a single stanza:
|
|
|
|
```tsg
|
|
(class_definition
|
|
name: (identifier) @name
|
|
":" @colon
|
|
) @class
|
|
{
|
|
|
|
; To make it clearer that the outer node is an assignment, we create an alias for it.
|
|
let @class.assign = @class.node
|
|
|
|
; Synthesized nodes: the left-hand side of the assignment, the class_expr node, and the class
|
|
; node.
|
|
|
|
let @class.assign_lhs = (ast-node @name "Name")
|
|
let @class.class_expr = (ast-node @class "ClassExpr")
|
|
let @class.inner_scope = (ast-node @class "Class")
|
|
|
|
edge @class.assign -> @class.assign_lhs
|
|
attr (@class.assign -> @class.assign_lhs) targets = 0
|
|
attr (@class.assign) value = @class.class_expr
|
|
attr (@class.assign) _location_end = (location-end @colon)
|
|
|
|
let class_name = (source-text @name)
|
|
|
|
; The left-hand side of the assignment, a `Name`.
|
|
attr (@class.assign_lhs) variable = class_name
|
|
attr (@class.assign_lhs) ctx = "store"
|
|
|
|
; The right hand side of the assignment, a `ClassExpr`.
|
|
attr (@class.class_expr) name = class_name
|
|
attr (@class.class_expr) inner_scope = @class.inner_scope
|
|
; `bases` will be set elsewhere
|
|
; `keywords` will be set elsewhere
|
|
attr (@class.class_expr) _location_end = (location-end @colon)
|
|
|
|
; The inner scope of the class_expr, a `Class`.
|
|
attr (@class.inner_scope) name = class_name
|
|
; body will be set in a separate stanza.
|
|
attr (@class.inner_scope) _location_end = (location-end @colon)
|
|
|
|
}
|
|
```
|
|
|
|
Let's go over these lines bit by bit. First, we create an alias for the outermost node (which will
|
|
become an assignment node) in order to make it clearer that it's an assignment. Next, we create
|
|
_new_ nodes for the inner synthesized nodes. Note that we can't assign these to `@class.node` as
|
|
that already points to the node that will become the assignment node. Instead, we create new scoped
|
|
variables (with suitable names), and assign them nodes (with appropriate kinds and locations using
|
|
`ast-node`).
|
|
```tsg
|
|
; To make it clearer that the outer node is an assignment, we create an alias for it.
|
|
let @class.assign = @class.node
|
|
|
|
; Synthesized nodes: the left-hand side of the assignment, the class_expr node, and the class
|
|
; node.
|
|
|
|
let @class.assign_lhs = (ast-node @name "Name")
|
|
let @class.class_expr = (ast-node @class "ClassExpr")
|
|
let @class.inner_scope = (ast-node @class "Class")
|
|
```
|
|
|
|
Next, we set up the outer assignment:
|
|
```tsg
|
|
edge @class.assign -> @class.assign_lhs
|
|
attr (@class.assign -> @class.assign_lhs) targets = 0
|
|
attr (@class.assign) value = @class.class_expr
|
|
attr (@class.assign) _location_end = (location-end @colon)
|
|
```
|
|
|
|
The remaining nodes all contain a field that refers to the name of the class, so put this in a local
|
|
variable for convenience:
|
|
```tsg
|
|
let class_name = (source-text @name)
|
|
```
|
|
We set up the left hand side of the assignment:
|
|
```tsg
|
|
; The left-hand side of the assignment, a `Name`.
|
|
attr (@class.assign_lhs) variable = class_name
|
|
attr (@class.assign_lhs) ctx = "store"
|
|
```
|
|
The `ClassExpr`:
|
|
```tsg
|
|
; The right hand side of the assignment, a `ClassExpr`.
|
|
attr (@class.class_expr) name = class_name
|
|
attr (@class.class_expr) inner_scope = @class.inner_scope
|
|
; `bases` will be set elsewhere
|
|
; `keywords` will be set elsewhere
|
|
attr (@class.class_expr) _location_end = (location-end @colon)
|
|
```
|
|
|
|
The `Class`:
|
|
```tsg
|
|
; The inner scope of the class_expr, a `Class`.
|
|
attr (@class.inner_scope) name = class_name
|
|
; body will be set elsewhere
|
|
attr (@class.inner_scope) _location_end = (location-end @colon)
|
|
|
|
```
|
|
|
|
The remaining stanzas take care of setting up the fields that contain lists of nodes, and these
|
|
follow the same scheme as before.
|
|
```tsg
|
|
; Class.body
|
|
(class_definition
|
|
body: (block (_) @stmt)
|
|
) @class
|
|
{
|
|
edge @class.inner_scope -> @stmt.node
|
|
attr (@class.inner_scope -> @stmt.node) body = (child-index @stmt)
|
|
}
|
|
|
|
; Class.bases
|
|
(class_definition
|
|
superclasses: (argument_list (identifier) @arg)
|
|
) @class
|
|
{
|
|
edge @class.class_expr -> @arg.node
|
|
attr (@class.class_expr -> @arg.node) bases = (child-index @arg)
|
|
attr (@arg.node) ctx = "load"
|
|
}
|
|
|
|
; Class.keywords
|
|
(class_definition
|
|
superclasses: (argument_list (keyword_argument) @arg)
|
|
) @class
|
|
{
|
|
edge @class.class_expr -> @arg.node
|
|
attr (@class.class_expr -> @arg.node) keywords = (child-index @arg)
|
|
}
|
|
```
|