6.9 KiB
The Python tokenizer
This file describes the syntax and operational semantics of the state machine that underlies our tokenizer.
The state machine syntax
The state machine is described in a declarative fashion in the
state_transition.txt file. This file contains a sequence of declarations, as
described in the following subsections.
Additionally, lines may contain comments indicated using the # character, as
in Python itself.
In the remainder of the document, "identifier" means any sequence of characters
starting with a letter (a-z or A-Z) and followed by a sequence of letters,
digits, and/or underscores.
Start declarations
This has the form start: followed by the name of a table. It is used to
indicate what table is used as the starting point for the tokenization.
There should be exactly one of these declarations in the file.
Alias declarations
These have the form
identifier = id_or_char or id_or_char or ...
where id_or_char is either a single character surrounded by single quotes
(e.g. 'a') or an identifier defined in another alias declaration.
Thus, aliases define sets of characters: single-quoted characters representing
singleton sets, and or being set union.
Note: A few character classes are predefined:
ERRORrepresenting the error state of the state machine,IDENTIFIERrepresenting characters that can appear at the start of a Unicode identifier, andIDENTIFIER_CONTINUErepresenting characters that can appear within a Unicode identifier.
Table declarations
These have the form
table header {
state_transition
state_transition
...
}
where header is either an identifier or an identifier followed by another
identifier surrounded by parentheses. The latter implements a form of
"inheritance" between tables, and is explained in a later section.
The format of state_transitions is described in the next subsection.
State transitions
Each state transition has the following form:
set_of_before_states -> after_state for set_of_characters optional_actions
Here, set_of_before_states is either a single identifier or a list of identifiers
with ors interspersed (mimicking the way sets of characters are specified) and
after_state is an identifier. These identifiers do not have to be declared
separately — they are implicitly declared when used.
Note: A special state
0(in the table indicated with thestart:declaration represents the starting state for the entire tokenization.
The set_of_characters can either be
- the identifier corresponding to an alias,
- a single character (e.g.
'a'), - a list of sets of characters with
ors interspersed, or - an asterisk
*representing all characters that do not already have a transition defined for the set of "before" states.
After the state transition is an optional list of actions, described next.
Actions
Actions are specified using the keyword do. After this keyword, one or more
actions may be specified, each terminated with ;, e.g.
foo -> bar for 'a' do action1; action2;
As the actions are very operational in nature, they will be described when we go into the operational semantics of the state machine.
Informal operational semantics
Note: What follows is not based on a reading of the source code, but just experience with working with modifying the state machine. There may be significant inaccuracies.
At a high level, the purpose of the tokenizer is to partition the given input
into a sequence of strings representing tokens. The decision of where to put the
boundaries between these strings is done on a character-by-character basis. To
mark the start of a token, the action mark is used. Note that the mark is
placed before the character that caused the action to be executed. That is, in
the following transition rule
foo -> bar for 'a' do mark;
the mark is placed before the a.
Once the end of a token has been reached, the emit action is used. This
creates a token from the part of the input spanning from the most recent mark
up to (and including) the character that caused the transition to which the
emit action is attached.
As an example, consider the following state machine that splits a sequence of zeroes and ones into tokens consisting of (maximal) runs of each character:
start: default
table default {
# This is essentially just an unconditional state transition.
0 -> zero_or_one for * do pushback;
zero_or_one -> zeros for '0' do mark;
zero_or_one -> ones for '1' do mark;
zeros -> zeros for '0'
zeros -> zero_or_one for * do pushback; emit(ZEROS);
ones -> zero_or_one for * do pushback; emit(ONES);
ones -> ones for '1'
}
The pushback action has the effect of "pushing back" the current character.
(In reality, all this does is move the pointer to the current character one step
back. It is thus not a problem to have several pushbacks in a row.)
Note: The order in which the transition rules for a state is specified does not matter. Even if the
*state is listed first, as withonesabove, it does not take precedence over other more specific character sets.
After tokenizing a string with the above grammar, the result will be a sequence
of ZEROS and ONES tokens. Each of these will have three pieces of data
associated with it: the starting point (line and column), the end point (also
line and column), and the characters that make up the token. Note that emit
accepts a second argument (which must be a string) as well. For example, the
transition for code when reaching a newline is:
feed = '\r' or '\n'
...
code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
This has the effect of normalizing end of line characters to be \n.
Note: The replacement text may have a different length than the distance to the most recent
span. This may not be desirable.
The above snippet introduces another action: newline. This has the effect of
resetting the column counter to zero and incrementing the line counter.
Note: There are some peculiarities about newlines, and the tokenizer will get confused if they are not handled through the
newlineaction.
The last two actions have to do with maintaining a stack of parsing tables. At
all points, the behavior of the tokenizer is governed by the table that is on
top of the stack. The push action pushes the specified table (given as an
argument) on top of this stack. Naturally, the pop action does the opposite,
discarding the top element.
This leaves the final point of interest: what decides which transitions are "active" at a given point?
The way this functions is essentially like method dispatch in Python (though thankfully there is no multiple inheritance). Thus, given the current state and the current character, we first look in the table on top of the stack. If this table does not have a transition for the given state and character, we next look at the table it inherits from, and so forth.