mirror of
https://github.com/github/codeql.git
synced 2025-12-24 12:46:34 +01:00
173 lines
6.9 KiB
Markdown
173 lines
6.9 KiB
Markdown
# The Python tokenizer
|
|
|
|
This file describes the syntax and operational semantics of the state machine
|
|
that underlies our tokenizer.
|
|
|
|
## The state machine syntax
|
|
|
|
The state machine is described in a declarative fashion in the
|
|
`state_transition.txt` file. This file contains a sequence of declarations, as
|
|
described in the following subsections.
|
|
|
|
Additionally, lines may contain comments indicated using the `#` character, as
|
|
in Python itself.
|
|
|
|
In the remainder of the document, "identifier" means any sequence of characters
|
|
starting with a letter (`a-z` or `A-Z`) and followed by a sequence of letters,
|
|
digits, and/or underscores.
|
|
|
|
### Start declarations
|
|
This has the form `start: ` followed by the name of a table. It is used to
|
|
indicate what table is used as the starting point for the tokenization.
|
|
|
|
There should be exactly one of these declarations in the file.
|
|
|
|
### Alias declarations
|
|
These have the form
|
|
```
|
|
identifier = id_or_char or id_or_char or ...
|
|
```
|
|
where `id_or_char` is either a single character surrounded by single quotes
|
|
(e.g. `'a'`) or an identifier defined in another alias declaration.
|
|
|
|
Thus, aliases define _sets_ of characters: single-quoted characters representing
|
|
singleton sets, and `or` being set union.
|
|
>Note: A few character classes are predefined:
|
|
> - `ERROR` representing the error state of the state machine,
|
|
> - `IDENTIFIER` representing characters that can appear at the start of
|
|
> a Unicode identifier, and
|
|
> - `IDENTIFIER_CONTINUE` representing characters that can appear
|
|
> within a Unicode identifier.
|
|
|
|
### Table declarations
|
|
These have the form
|
|
|
|
```
|
|
table header {
|
|
state_transition
|
|
state_transition
|
|
...
|
|
}
|
|
```
|
|
where `header` is either an identifier or an identifier followed by another
|
|
identifier surrounded by parentheses. The latter implements a form of
|
|
"inheritance" between tables, and is explained in a later section.
|
|
|
|
The format of `state_transition`s is described in the next subsection.
|
|
|
|
### State transitions
|
|
Each state transition has the following form:
|
|
```
|
|
set_of_before_states -> after_state for set_of_characters optional_actions
|
|
```
|
|
Here, `set_of_before_states` is either a single identifier or a list of identifiers
|
|
with `or`s interspersed (mimicking the way sets of characters are specified) and
|
|
`after_state` is an identifier. These identifiers do not have to be declared
|
|
separately — they are implicitly declared when used.
|
|
>Note: A special state `0` (in the table indicated with the `start: `
|
|
>declaration represents the starting state for the entire tokenization.
|
|
|
|
The `set_of_characters` can either be
|
|
- the identifier corresponding to an alias,
|
|
- a single character (e.g. `'a'`),
|
|
- a list of sets of characters with `or`s interspersed, or
|
|
- an asterisk `*` representing _all_ characters that do not already have a
|
|
transition defined for the set of "before" states.
|
|
|
|
After the state transition is an optional list of actions, described next.
|
|
|
|
### Actions
|
|
Actions are specified using the keyword `do`. After this keyword, one or more
|
|
actions may be specified, each terminated with `;`, e.g.
|
|
```
|
|
foo -> bar for 'a' do action1; action2;
|
|
```
|
|
As the actions are very operational in nature, they will be described when we go
|
|
into the operational semantics of the state machine.
|
|
|
|
## Informal operational semantics
|
|
>Note: What follows is not based on a reading of the source code, but just
|
|
>experience with working with modifying the state machine. There may be
|
|
>significant inaccuracies.
|
|
|
|
At a high level, the purpose of the tokenizer is to partition the given input
|
|
into a sequence of strings representing tokens. The decision of where to put the
|
|
boundaries between these strings is done on a character-by-character basis. To
|
|
mark the start of a token, the action `mark` is used. Note that the mark is
|
|
placed _before_ the character that caused the action to be executed. That is, in
|
|
the following transition rule
|
|
```
|
|
foo -> bar for 'a' do mark;
|
|
```
|
|
the mark is placed _before_ the `a`.
|
|
|
|
Once the end of a token has been reached, the `emit` action is used. This
|
|
creates a token from the part of the input spanning from the most recent `mark`
|
|
up to (and including) the character that caused the transition to which the
|
|
`emit` action is attached.
|
|
|
|
As an example, consider the following state machine that splits a sequence of
|
|
zeroes and ones into tokens consisting of (maximal) runs of each character:
|
|
|
|
```
|
|
start: default
|
|
table default {
|
|
|
|
# This is essentially just an unconditional state transition.
|
|
0 -> zero_or_one for * do pushback;
|
|
|
|
zero_or_one -> zeros for '0' do mark;
|
|
zero_or_one -> ones for '1' do mark;
|
|
|
|
zeros -> zeros for '0'
|
|
zeros -> zero_or_one for * do pushback; emit(ZEROS);
|
|
|
|
ones -> zero_or_one for * do pushback; emit(ONES);
|
|
ones -> ones for '1'
|
|
}
|
|
```
|
|
The `pushback` action has the effect of "pushing back" the current character.
|
|
(In reality, all this does is move the pointer to the current character one step
|
|
back. It is thus not a problem to have several pushbacks in a row.)
|
|
|
|
> Note: The order in which the transition rules for a state is specified does
|
|
> not matter. Even if the `*` state is listed first, as with `ones` above, it
|
|
> does not take precedence over other more specific character sets.
|
|
|
|
After tokenizing a string with the above grammar, the result will be a sequence
|
|
of `ZEROS` and `ONES` tokens. Each of these will have three pieces of data
|
|
associated with it: the starting point (line and column), the end point (also
|
|
line and column), and the characters that make up the token. Note that `emit`
|
|
accepts a second argument (which must be a string) as well. For example, the
|
|
transition for code when reaching a newline is:
|
|
```
|
|
feed = '\r' or '\n'
|
|
...
|
|
code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
|
|
```
|
|
This has the effect of normalizing end of line characters to be `\n`.
|
|
|
|
>Note: The replacement text may have a different length than the distance to the
|
|
>most recent `span`. This may not be desirable.
|
|
|
|
The above snippet introduces another action: `newline`. This has the effect of
|
|
resetting the column counter to zero and incrementing the line counter.
|
|
|
|
>Note: There are some peculiarities about newlines, and the tokenizer will get
|
|
>confused if they are not handled through the `newline` action.
|
|
|
|
The last two actions have to do with maintaining a stack of parsing tables. At
|
|
all points, the behavior of the tokenizer is governed by the table that is on
|
|
top of the stack. The `push` action pushes the specified table (given as an
|
|
argument) on top of this stack. Naturally, the `pop` action does the opposite,
|
|
discarding the top element.
|
|
|
|
This leaves the final point of interest: what decides which transitions are
|
|
"active" at a given point?
|
|
|
|
The way this functions is essentially like method dispatch in Python (though
|
|
thankfully there is no multiple inheritance). Thus, given the current state and
|
|
the current character, we first look in the table on top of the stack. If this
|
|
table does not have a transition for the given state and character, we next look
|
|
at the table it inherits from, and so forth.
|