Files
codeql/python/extractor/tokenizer_generator/README.md

173 lines
6.9 KiB
Markdown

# The Python tokenizer
This file describes the syntax and operational semantics of the state machine
that underlies our tokenizer.
## The state machine syntax
The state machine is described in a declarative fashion in the
`state_transition.txt` file. This file contains a sequence of declarations, as
described in the following subsections.
Additionally, lines may contain comments indicated using the `#` character, as
in Python itself.
In the remainder of the document, "identifier" means any sequence of characters
starting with a letter (`a-z` or `A-Z`) and followed by a sequence of letters,
digits, and/or underscores.
### Start declarations
This has the form `start: ` followed by the name of a table. It is used to
indicate what table is used as the starting point for the tokenization.
There should be exactly one of these declarations in the file.
### Alias declarations
These have the form
```
identifier = id_or_char or id_or_char or ...
```
where `id_or_char` is either a single character surrounded by single quotes
(e.g. `'a'`) or an identifier defined in another alias declaration.
Thus, aliases define _sets_ of characters: single-quoted characters representing
singleton sets, and `or` being set union.
>Note: A few character classes are predefined:
> - `ERROR` representing the error state of the state machine,
> - `IDENTIFIER` representing characters that can appear at the start of
> a Unicode identifier, and
> - `IDENTIFIER_CONTINUE` representing characters that can appear
> within a Unicode identifier.
### Table declarations
These have the form
```
table header {
state_transition
state_transition
...
}
```
where `header` is either an identifier or an identifier followed by another
identifier surrounded by parentheses. The latter implements a form of
"inheritance" between tables, and is explained in a later section.
The format of `state_transition`s is described in the next subsection.
### State transitions
Each state transition has the following form:
```
set_of_before_states -> after_state for set_of_characters optional_actions
```
Here, `set_of_before_states` is either a single identifier or a list of identifiers
with `or`s interspersed (mimicking the way sets of characters are specified) and
`after_state` is an identifier. These identifiers do not have to be declared
separately — they are implicitly declared when used.
>Note: A special state `0` (in the table indicated with the `start: `
>declaration represents the starting state for the entire tokenization.
The `set_of_characters` can either be
- the identifier corresponding to an alias,
- a single character (e.g. `'a'`),
- a list of sets of characters with `or`s interspersed, or
- an asterisk `*` representing _all_ characters that do not already have a
transition defined for the set of "before" states.
After the state transition is an optional list of actions, described next.
### Actions
Actions are specified using the keyword `do`. After this keyword, one or more
actions may be specified, each terminated with `;`, e.g.
```
foo -> bar for 'a' do action1; action2;
```
As the actions are very operational in nature, they will be described when we go
into the operational semantics of the state machine.
## Informal operational semantics
>Note: What follows is not based on a reading of the source code, but just
>experience with working with modifying the state machine. There may be
>significant inaccuracies.
At a high level, the purpose of the tokenizer is to partition the given input
into a sequence of strings representing tokens. The decision of where to put the
boundaries between these strings is done on a character-by-character basis. To
mark the start of a token, the action `mark` is used. Note that the mark is
placed _before_ the character that caused the action to be executed. That is, in
the following transition rule
```
foo -> bar for 'a' do mark;
```
the mark is placed _before_ the `a`.
Once the end of a token has been reached, the `emit` action is used. This
creates a token from the part of the input spanning from the most recent `mark`
up to (and including) the character that caused the transition to which the
`emit` action is attached.
As an example, consider the following state machine that splits a sequence of
zeroes and ones into tokens consisting of (maximal) runs of each character:
```
start: default
table default {
# This is essentially just an unconditional state transition.
0 -> zero_or_one for * do pushback;
zero_or_one -> zeros for '0' do mark;
zero_or_one -> ones for '1' do mark;
zeros -> zeros for '0'
zeros -> zero_or_one for * do pushback; emit(ZEROS);
ones -> zero_or_one for * do pushback; emit(ONES);
ones -> ones for '1'
}
```
The `pushback` action has the effect of "pushing back" the current character.
(In reality, all this does is move the pointer to the current character one step
back. It is thus not a problem to have several pushbacks in a row.)
> Note: The order in which the transition rules for a state is specified does
> not matter. Even if the `*` state is listed first, as with `ones` above, it
> does not take precedence over other more specific character sets.
After tokenizing a string with the above grammar, the result will be a sequence
of `ZEROS` and `ONES` tokens. Each of these will have three pieces of data
associated with it: the starting point (line and column), the end point (also
line and column), and the characters that make up the token. Note that `emit`
accepts a second argument (which must be a string) as well. For example, the
transition for code when reaching a newline is:
```
feed = '\r' or '\n'
...
code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
```
This has the effect of normalizing end of line characters to be `\n`.
>Note: The replacement text may have a different length than the distance to the
>most recent `span`. This may not be desirable.
The above snippet introduces another action: `newline`. This has the effect of
resetting the column counter to zero and incrementing the line counter.
>Note: There are some peculiarities about newlines, and the tokenizer will get
>confused if they are not handled through the `newline` action.
The last two actions have to do with maintaining a stack of parsing tables. At
all points, the behavior of the tokenizer is governed by the table that is on
top of the stack. The `push` action pushes the specified table (given as an
argument) on top of this stack. Naturally, the `pop` action does the opposite,
discarding the top element.
This leaves the final point of interest: what decides which transitions are
"active" at a given point?
The way this functions is essentially like method dispatch in Python (though
thankfully there is no multiple inheritance). Thus, given the current state and
the current character, we first look in the table on top of the stack. If this
table does not have a transition for the given state and character, we next look
at the table it inherits from, and so forth.