mirror of
https://github.com/github/codeql.git
synced 2026-04-27 17:55:19 +02:00
Python: Copy Python extractor to codeql repo
This commit is contained in:
172
python/extractor/tokenizer_generator/README.md
Normal file
172
python/extractor/tokenizer_generator/README.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# The Python tokenizer
|
||||
|
||||
This file describes the syntax and operational semantics of the state machine
|
||||
that underlies our tokenizer.
|
||||
|
||||
## The state machine syntax
|
||||
|
||||
The state machine is described in a declarative fashion in the
|
||||
`state_transition.txt` file. This file contains a sequence of declarations, as
|
||||
described in the following subsections.
|
||||
|
||||
Additionally, lines may contain comments indicated using the `#` character, as
|
||||
in Python itself.
|
||||
|
||||
In the remainder of the document, "identifier" means any sequence of characters
|
||||
starting with a letter (`a-z` or `A-Z`) and followed by a sequence of letters,
|
||||
digits, and/or underscores.
|
||||
|
||||
### Start declarations
|
||||
This has the form `start: ` followed by the name of a table. It is used to
|
||||
indicate what table is used as the starting point for the tokenization.
|
||||
|
||||
There should be exactly one of these declarations in the file.
|
||||
|
||||
### Alias declarations
|
||||
These have the form
|
||||
```
|
||||
identifier = id_or_char or id_or_char or ...
|
||||
```
|
||||
where `id_or_char` is either a single character surrounded by single quotes
|
||||
(e.g. `'a'`) or an identifier defined in another alias declaration.
|
||||
|
||||
Thus, aliases define _sets_ of characters: single-quoted characters representing
|
||||
singleton sets, and `or` being set union.
|
||||
>Note: A few character classes are predefined:
|
||||
> - `ERROR` representing the error state of the state machine,
|
||||
> - `IDENTIFIER` representing characters that can appear at the start of
|
||||
> a Unicode identifier, and
|
||||
> - `IDENTIFIER_CONTINUE` representing characters that can appear
|
||||
> within a Unicode identifier.
|
||||
|
||||
### Table declarations
|
||||
These have the form
|
||||
|
||||
```
|
||||
table header {
|
||||
state_transition
|
||||
state_transition
|
||||
...
|
||||
}
|
||||
```
|
||||
where `header` is either an identifier or an identifier followed by another
|
||||
identifier surrounded by parentheses. The latter implements a form of
|
||||
"inheritance" between tables, and is explained in a later section.
|
||||
|
||||
The format of `state_transition`s is described in the next subsection.
|
||||
|
||||
### State transitions
|
||||
Each state transition has the following form:
|
||||
```
|
||||
set_of_before_states -> after_state for set_of_characters optional_actions
|
||||
```
|
||||
Here, `set_of_before_states` is either a single identifier or a list of identifiers
|
||||
with `or`s interspersed (mimicking the way sets of characters are specified) and
|
||||
`after_state` is an identifier. These identifiers do not have to be declared
|
||||
separately — they are implicitly declared when used.
|
||||
>Note: A special state `0` (in the table indicated with the `start: `
|
||||
>declaration represents the starting state for the entire tokenization.
|
||||
|
||||
The `set_of_characters` can either be
|
||||
- the identifier corresponding to an alias,
|
||||
- a single character (e.g. `'a'`),
|
||||
- a list of sets of characters with `or`s interspersed, or
|
||||
- an asterisk `*` representing _all_ characters that do not already have a
|
||||
transition defined for the set of "before" states.
|
||||
|
||||
After the state transition is an optional list of actions, described next.
|
||||
|
||||
### Actions
|
||||
Actions are specified using the keyword `do`. After this keyword, one or more
|
||||
actions may be specified, each terminated with `;`, e.g.
|
||||
```
|
||||
foo -> bar for 'a' do action1; action2;
|
||||
```
|
||||
As the actions are very operational in nature, they will be described when we go
|
||||
into the operational semantics of the state machine.
|
||||
|
||||
## Informal operational semantics
|
||||
>Note: What follows is not based on a reading of the source code, but just
|
||||
>experience with working with modifying the state machine. There may be
|
||||
>significant inaccuracies.
|
||||
|
||||
At a high level, the purpose of the tokenizer is to partition the given input
|
||||
into a sequence of strings representing tokens. The decision of where to put the
|
||||
boundaries between these strings is done on a character-by-character basis. To
|
||||
mark the start of a token, the action `mark` is used. Note that the mark is
|
||||
placed _before_ the character that caused the action to be executed. That is, in
|
||||
the following transition rule
|
||||
```
|
||||
foo -> bar for 'a' do mark;
|
||||
```
|
||||
the mark is placed _before_ the `a`.
|
||||
|
||||
Once the end of a token has been reached, the `emit` action is used. This
|
||||
creates a token from the part of the input spanning from the most recent `mark`
|
||||
up to (and including) the character that caused the transition to which the
|
||||
`emit` action is attached.
|
||||
|
||||
As an example, consider the following state machine that splits a sequence of
|
||||
zeroes and ones into tokens consisting of (maximal) runs of each character:
|
||||
|
||||
```
|
||||
start: default
|
||||
table default {
|
||||
|
||||
# This is essentially just an unconditional state transition.
|
||||
0 -> zero_or_one for * do pushback;
|
||||
|
||||
zero_or_one -> zeros for '0' do mark;
|
||||
zero_or_one -> ones for '1' do mark;
|
||||
|
||||
zeros -> zeros for '0'
|
||||
zeros -> zero_or_one for * do pushback; emit(ZEROS);
|
||||
|
||||
ones -> zero_or_one for * do pushback; emit(ONES);
|
||||
ones -> ones for '1'
|
||||
}
|
||||
```
|
||||
The `pushback` action has the effect of "pushing back" the current character.
|
||||
(In reality, all this does is move the pointer to the current character one step
|
||||
back. It is thus not a problem to have several pushbacks in a row.)
|
||||
|
||||
> Note: The order in which the transition rules for a state is specified does
|
||||
> not matter. Even if the `*` state is listed first, as with `ones` above, it
|
||||
> does not take precedence over other more specific character sets.
|
||||
|
||||
After tokenizing a string with the above grammar, the result will be a sequence
|
||||
of `ZEROS` and `ONES` tokens. Each of these will have three pieces of data
|
||||
associated with it: the starting point (line and column), the end point (also
|
||||
line and column), and the characters that make up the token. Note that `emit`
|
||||
accepts a second argument (which must be a string) as well. For example, the
|
||||
transition for code when reaching a newline is:
|
||||
```
|
||||
feed = '\r' or '\n'
|
||||
...
|
||||
code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
|
||||
```
|
||||
This has the effect of normalizing end of line characters to be `\n`.
|
||||
|
||||
>Note: The replacement text may have a different length than the distance to the
|
||||
>most recent `span`. This may not be desirable.
|
||||
|
||||
The above snippet introduces another action: `newline`. This has the effect of
|
||||
resetting the column counter to zero and incrementing the line counter.
|
||||
|
||||
>Note: There are some peculiarities about newlines, and the tokenizer will get
|
||||
>confused if they are not handled through the `newline` action.
|
||||
|
||||
The last two actions have to do with maintaining a stack of parsing tables. At
|
||||
all points, the behavior of the tokenizer is governed by the table that is on
|
||||
top of the stack. The `push` action pushes the specified table (given as an
|
||||
argument) on top of this stack. Naturally, the `pop` action does the opposite,
|
||||
discarding the top element.
|
||||
|
||||
This leaves the final point of interest: what decides which transitions are
|
||||
"active" at a given point?
|
||||
|
||||
The way this functions is essentially like method dispatch in Python (though
|
||||
thankfully there is no multiple inheritance). Thus, given the current state and
|
||||
the current character, we first look in the table on top of the stack. If this
|
||||
table does not have a transition for the given state and character, we next look
|
||||
at the table it inherits from, and so forth.
|
||||
155
python/extractor/tokenizer_generator/compiled.py
Normal file
155
python/extractor/tokenizer_generator/compiled.py
Normal file
@@ -0,0 +1,155 @@
|
||||
|
||||
import unicodedata
|
||||
from . import machine
|
||||
|
||||
class SuperState:
|
||||
|
||||
def __init__(self, name, mapping):
|
||||
self.name = name
|
||||
self.mapping = mapping
|
||||
|
||||
def as_list_of_bytes(self):
|
||||
lst = dict_to_list(self.mapping)
|
||||
return [ table.as_bytes() for table in lst ]
|
||||
|
||||
def as_list_of_transitions(self):
|
||||
return dict_to_list(self.mapping)
|
||||
|
||||
action_id = 0
|
||||
all_actions = {}
|
||||
|
||||
class ActionList:
|
||||
|
||||
def __init__(self, actions, id):
|
||||
self.actions = actions
|
||||
self.id = id
|
||||
|
||||
@staticmethod
|
||||
def get(actions):
|
||||
global action_id
|
||||
assert isinstance(actions, tuple)
|
||||
if actions not in all_actions:
|
||||
all_actions[actions] = ActionList(actions, action_id)
|
||||
action_id += 1
|
||||
return all_actions[actions]
|
||||
|
||||
@staticmethod
|
||||
def listall():
|
||||
return sorted(all_actions.values(), key = lambda al: al.id)
|
||||
|
||||
next_pair_id = 0
|
||||
pairs = {}
|
||||
|
||||
class StateActionListPair:
|
||||
|
||||
def __init__(self, state, actionlist, id):
|
||||
self.state = state
|
||||
self.actionlist = actionlist
|
||||
self.id = id
|
||||
|
||||
@staticmethod
|
||||
def get(state, actionlist):
|
||||
global next_pair_id
|
||||
if actionlist is not None and not isinstance(actionlist, ActionList):
|
||||
actionlist = ActionList.get(actionlist)
|
||||
if (state, actionlist) not in pairs:
|
||||
pairs[(state, actionlist)] = StateActionListPair(state, actionlist, next_pair_id)
|
||||
next_pair_id += 1
|
||||
return pairs[(state, actionlist)]
|
||||
|
||||
@staticmethod
|
||||
def listall():
|
||||
return sorted(pairs.values(), key = lambda pair: pair.id)
|
||||
|
||||
next_table_id = 0
|
||||
table_ids = {}
|
||||
|
||||
class StateTransitionTable:
|
||||
|
||||
def __init__(self, mapping):
|
||||
self.mapping = mapping
|
||||
|
||||
def as_bytes(self):
|
||||
lst = dict_to_list(self.mapping)
|
||||
return bytes(pair.id for pair in lst)
|
||||
|
||||
def __getitem__(self, key):
|
||||
return self.mapping[key]
|
||||
|
||||
@property
|
||||
def id(self):
|
||||
global next_table_id
|
||||
b = self.as_bytes()
|
||||
if not b in table_ids:
|
||||
table_ids[b] = next_table_id
|
||||
next_table_id += 1
|
||||
return table_ids[b]
|
||||
|
||||
def dict_to_list(mapping):
|
||||
assert isinstance(mapping, dict)
|
||||
result = []
|
||||
for key, value in mapping.items():
|
||||
while key.id >= len(result):
|
||||
result.append(None)
|
||||
result[key.id] = value
|
||||
return result
|
||||
|
||||
|
||||
#Each character is one of id-start, id-continuation or other. Represent "other" as ERROR for all non-ascii characters.
|
||||
#See https://www.python.org/dev/peps/pep-3131 for an explanation of what is an identifier.
|
||||
OTHER_START = {0x1885, 0x1886, 0x2118, 0x212E, 0x309B, 0x309C}
|
||||
OTHER_CONTINUE = {0x00B7, 0x0387, 0x19DA}
|
||||
OTHER_CONTINUE.update(range(0x1369, 0x1372))
|
||||
ID_CATEGORIES = {"Lu", "Ll", "Lt", "Lm", "Lo", "Nl"}
|
||||
CONT_CATEGORIES = {"Mn", "Mc", "Nd", "Pc"}
|
||||
|
||||
CHUNK_SIZE = 64
|
||||
|
||||
class IdentifierTable:
|
||||
|
||||
def __init__(self):
|
||||
classes = []
|
||||
for i in range(0x110000):
|
||||
try:
|
||||
c = chr(i)
|
||||
except:
|
||||
continue
|
||||
cat = unicodedata.category(c)
|
||||
if cat in ID_CATEGORIES or i in OTHER_START:
|
||||
cls = machine.IDENTIFIER_CLASS.id
|
||||
elif cat in CONT_CATEGORIES or i in OTHER_CONTINUE:
|
||||
cls = machine.IDENTIFIER_CONTINUE_CLASS.id
|
||||
else:
|
||||
cls = machine.ERROR_CLASS.id
|
||||
assert cls in (0,1,2,3)
|
||||
classes.append(cls)
|
||||
result = []
|
||||
for i, cls in enumerate(classes):
|
||||
byte, bits = i>>2, cls<<((i&3)*2)
|
||||
while byte >= len(result):
|
||||
result.append(0)
|
||||
result[byte] |= bits
|
||||
while result[-1] == 0:
|
||||
result.pop()
|
||||
while len(result) % CHUNK_SIZE:
|
||||
result.append(0)
|
||||
self.table = result
|
||||
|
||||
def as_bytes(self):
|
||||
return bytes(self.table)
|
||||
|
||||
def as_two_level_table(self):
|
||||
index = []
|
||||
chunks = {}
|
||||
next_id = 0
|
||||
the_bytes = self.as_bytes()
|
||||
for n in range(0, len(the_bytes), CHUNK_SIZE):
|
||||
chunk = the_bytes[n:n+CHUNK_SIZE]
|
||||
if chunk in chunks:
|
||||
index.append(chunks[chunk])
|
||||
else:
|
||||
index.append(next_id)
|
||||
chunks[chunk] = next_id
|
||||
next_id += 1
|
||||
chunks = [ chunk for (i, chunk) in sorted((i, chunk) for chunk, i in chunks.items())]
|
||||
return chunks, index
|
||||
225
python/extractor/tokenizer_generator/gen_state_machine.py
Normal file
225
python/extractor/tokenizer_generator/gen_state_machine.py
Normal file
@@ -0,0 +1,225 @@
|
||||
'''
|
||||
Generate a state-machine based tokenizer from a state transition description and a template.
|
||||
|
||||
Parses the state transition description to compute a set of transition tables.
|
||||
Each table maps (state, character-class) pairs to (state, action) pairs.
|
||||
During tokenization each input character is converted to a class, then a new state and action is
|
||||
looked up using the current state and character-class.
|
||||
|
||||
The generated tables are:
|
||||
CLASS_TABLE:
|
||||
Maps ASCII code points to character class.
|
||||
ID_TABLE:
|
||||
Maps all unicode points to one of Identifier, Identifier-continuation, or other.
|
||||
The transition tables:
|
||||
Each table maps each state to a per-class transition table.
|
||||
Each per-class transition table maps each character-class to an index in the action table.
|
||||
ACTION_TABLE:
|
||||
Embedded in code as `action_table`; maps each index to a (state, action) pair.
|
||||
|
||||
Since the number of character-classes, states and (state, action) pairs is small. Everything is represented as
|
||||
a byte and tables as `bytes` object for Python 3, or `array.array` objects for Python 2.
|
||||
'''
|
||||
|
||||
|
||||
from .parser import parse
|
||||
from . import machine
|
||||
from .compiled import StateActionListPair, IdentifierTable
|
||||
|
||||
def emit_id_bytes(id_table):
|
||||
chunks, index = id_table.as_two_level_table()
|
||||
print("# %d entries in ID index" % len(index))
|
||||
index_bytes = bytes(index)
|
||||
print("ID_INDEX = toarray(")
|
||||
for n in range(0, len(index_bytes), 32):
|
||||
print(" %r" % index_bytes[n:n+32])
|
||||
print(")")
|
||||
print("ID_CHUNKS = (")
|
||||
for chunk in chunks:
|
||||
print(" toarray(%r)," % chunk)
|
||||
print(")")
|
||||
|
||||
def emit_transition_table(table, verbose=False):
|
||||
print("%s = (" % table.name.upper(), end="")
|
||||
for trans in table.as_list_of_transitions():
|
||||
print("B%02d," % trans.id, end=" ")
|
||||
print(")")
|
||||
|
||||
emitted_rows = set()
|
||||
|
||||
def emit_rows(table):
|
||||
for trans in table.as_list_of_transitions():
|
||||
id = trans.id
|
||||
if id in emitted_rows:
|
||||
continue
|
||||
emitted_rows.add(id)
|
||||
print("B%02d = toarray(%r)" % (id, trans.as_bytes()))
|
||||
|
||||
action_names = {}
|
||||
next_action_id = 0
|
||||
|
||||
def get_action_id(action):
|
||||
global next_action_id
|
||||
assert action is not None
|
||||
if action in action_names:
|
||||
return action_names[action]
|
||||
result = next_action_id
|
||||
next_action_id += 1
|
||||
action_names[action] = result
|
||||
return result
|
||||
|
||||
def emit_actions(table, indent=""):
|
||||
for pair in table:
|
||||
if pair.actionlist is None:
|
||||
continue
|
||||
action = pair.actionlist
|
||||
get_action_function(action, indent)
|
||||
|
||||
def generate_action_table(table, indent):
|
||||
result = []
|
||||
result.append(indent + "action_table = [\n " + indent)
|
||||
for i, pair in enumerate(table):
|
||||
if pair.actionlist is None:
|
||||
result.append("(%d, None), " % pair.state.id)
|
||||
else:
|
||||
result.append("(%d, self.action_%s), " % (pair.state.id, pair.actionlist.id))
|
||||
if (i & 3) == 3:
|
||||
result.append("\n " + indent)
|
||||
result.append("\n" + indent + "]")
|
||||
return "".join(result)
|
||||
|
||||
action_functions = set()
|
||||
|
||||
def get_action_function(actionlist, indent=""):
|
||||
if actionlist in action_functions:
|
||||
return
|
||||
action_functions.add(actionlist)
|
||||
last = actionlist.actions[-1]
|
||||
print(indent + "def action_%d(self):" % actionlist.id)
|
||||
emit = False
|
||||
for action in actionlist.actions:
|
||||
if action is machine.PUSHBACK:
|
||||
print(indent + " self.index -= 1")
|
||||
continue
|
||||
elif action is machine.POP:
|
||||
print(indent + " self.super_state = self.state_stack.pop()")
|
||||
elif isinstance(action, machine.Push):
|
||||
print(indent + " self.state_stack.append(self.super_state)")
|
||||
print(indent + " self.super_state = %s" % action.state.name.upper())
|
||||
elif action is machine.MARK:
|
||||
print(indent + " self.token_start_index = self.index")
|
||||
print(indent + " self.token_start = self.line, self.index-self.line_start_index")
|
||||
elif isinstance(action, machine.Emit):
|
||||
emit = True
|
||||
print(indent + " end = self.line, self.index-self.line_start_index+1")
|
||||
if action.text is None:
|
||||
print(indent + " result = [%s, self.text[self.token_start_index:self.index+1], self.token_start, end]" % action.kind)
|
||||
else:
|
||||
print(indent + " result = [%s, u%s, (self.line, self.index-self.line_start_index), end]" % (action.kind, action.text))
|
||||
print(indent + " self.token_start = end")
|
||||
print(indent + " self.token_start_index = self.index+1")
|
||||
elif action is machine.NEWLINE:
|
||||
print(indent + " self.line_start_index = self.index+1")
|
||||
print(indent + " self.line += 1")
|
||||
elif action is machine.EMIT_INDENT:
|
||||
assert action is last
|
||||
print(indent + " return self.emit_indent()")
|
||||
print()
|
||||
return
|
||||
else:
|
||||
assert False, "Unexpected action: %s" % action
|
||||
print(indent + " self.index += 1")
|
||||
if emit:
|
||||
print(indent + " return result")
|
||||
else:
|
||||
print(indent + " return None")
|
||||
print()
|
||||
return
|
||||
|
||||
def emit_char_classes(char_classes, verbose=False):
|
||||
for cls in sorted(set(char_classes.values()), key=lambda x : x.id):
|
||||
print("#%d = %r" % (cls.id, cls))
|
||||
table = [None] * 128
|
||||
by_id = {
|
||||
machine.IDENTIFIER_CLASS.id : machine.IDENTIFIER_CLASS,
|
||||
machine.IDENTIFIER_CONTINUE_CLASS.id : machine.IDENTIFIER_CONTINUE_CLASS,
|
||||
machine.ERROR_CLASS.id : machine.ERROR_CLASS
|
||||
}
|
||||
for c, cls in char_classes.items():
|
||||
by_id[cls.id] = cls
|
||||
if c is machine.IDENTIFIER or c is machine.IDENTIFIER_CONTINUE:
|
||||
continue
|
||||
table[ord(c)] = cls.id
|
||||
by_id[cls.id] = cls
|
||||
for i in range(128):
|
||||
assert table[i] is not None
|
||||
bytes_table = bytes(table)
|
||||
if verbose:
|
||||
print("# Class Table")
|
||||
for i in range(len(bytes_table)):
|
||||
b = bytes_table[i]
|
||||
print("# %r -> %s" % (chr(i), by_id[b]))
|
||||
print("CLASS_TABLE = toarray(%r)" % bytes_table)
|
||||
|
||||
|
||||
|
||||
PREFACE = """
|
||||
import codecs
|
||||
import re
|
||||
import sys
|
||||
|
||||
from blib2to3.pgen2.token import *
|
||||
|
||||
if sys.version < '3':
|
||||
from array import array
|
||||
def toarray(b):
|
||||
return array('B', b)
|
||||
else:
|
||||
def toarray(b):
|
||||
return b
|
||||
"""
|
||||
|
||||
def main():
|
||||
verbose = False
|
||||
import sys
|
||||
if len(sys.argv) != 3:
|
||||
print("Usage %s DESCRIPTION TEMPLATE" % sys.argv[0])
|
||||
sys.exit(1)
|
||||
descriptionfile = sys.argv[1]
|
||||
with open(descriptionfile) as fd:
|
||||
m = machine.Machine.load(fd.read())
|
||||
templatefile = sys.argv[2]
|
||||
with open(templatefile) as fd:
|
||||
template = fd.read()
|
||||
print("# This file is AUTO-GENERATED. DO NOT MODIFY")
|
||||
print('# To regenerate: run "python3 -m tokenizer_generator.gen_state_machine %s %s"' % (descriptionfile, templatefile))
|
||||
print(PREFACE)
|
||||
print("IDENTIFIER_CLASS = %d" % machine.IDENTIFIER_CLASS.id)
|
||||
print("IDENTIFIER_CONTINUE_CLASS = %d" % machine.IDENTIFIER_CONTINUE_CLASS.id)
|
||||
print("ERROR_CLASS = %d" % machine.ERROR_CLASS.id)
|
||||
emit_id_bytes(IdentifierTable())
|
||||
char_classes = m.get_classes()
|
||||
emit_char_classes(char_classes, verbose)
|
||||
print()
|
||||
tables = [state.compile(char_classes) for state in m.states.values() ]
|
||||
for table in tables:
|
||||
emit_rows(table)
|
||||
print()
|
||||
for table in tables:
|
||||
#pprint(table)
|
||||
emit_transition_table(table, verbose)
|
||||
print()
|
||||
print("TRANSITION_STATE_NAMES = {")
|
||||
for state in m.states.values():
|
||||
print(" id(%s): '%s'," % (state.name.upper(), state.name))
|
||||
print("}")
|
||||
print("START_SUPER_STATE = %s" % m.start.name.upper())
|
||||
prefix, suffix = template.split("#ACTIONS-HERE")
|
||||
print(prefix)
|
||||
actions = StateActionListPair.listall()
|
||||
emit_actions(actions, " ")
|
||||
action_table = generate_action_table(actions, " ")
|
||||
print(suffix.replace("#ACTION_TABLE_HERE", action_table))
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
485
python/extractor/tokenizer_generator/machine.py
Normal file
485
python/extractor/tokenizer_generator/machine.py
Normal file
@@ -0,0 +1,485 @@
|
||||
|
||||
import ast
|
||||
|
||||
from .parser import parse
|
||||
from collections import defaultdict
|
||||
from .compiled import SuperState, StateTransitionTable, StateActionListPair
|
||||
|
||||
|
||||
class Transition:
|
||||
|
||||
def __init__(self, from_state, to_state, what, do):
|
||||
assert isinstance(from_state, State)
|
||||
assert isinstance(to_state, State)
|
||||
self.from_state = from_state
|
||||
self.what = what
|
||||
if not do:
|
||||
do = None
|
||||
else:
|
||||
assert isinstance(do, list)
|
||||
for item in do:
|
||||
assert isinstance(item, Action)
|
||||
do = tuple(do)
|
||||
self.action = StateActionListPair.get(to_state, do)
|
||||
|
||||
def dump(self):
|
||||
if self.action.actionlist:
|
||||
return "%s -> %s for %s do %s" % (
|
||||
self.from_state,
|
||||
self.action.state,
|
||||
self.what,
|
||||
"; ".join(str(do) for do in self.action.actionlist.actions)
|
||||
)
|
||||
else:
|
||||
return "%s -> %s for %s" % (
|
||||
self.from_state,
|
||||
self.action.state,
|
||||
self.what
|
||||
)
|
||||
|
||||
next_state_id = 1
|
||||
states = {}
|
||||
|
||||
class State:
|
||||
|
||||
def __init__(self, name):
|
||||
global next_state_id
|
||||
if name.isdigit():
|
||||
assert name == "0"
|
||||
self.id = 0
|
||||
self.name = "START"
|
||||
else:
|
||||
self.name = name
|
||||
self.id = next_state_id
|
||||
next_state_id += 1
|
||||
|
||||
@staticmethod
|
||||
def get(name):
|
||||
if name not in states:
|
||||
states[name] = State(name)
|
||||
return states[name]
|
||||
|
||||
@staticmethod
|
||||
def count():
|
||||
return len(states)
|
||||
|
||||
def __repr__(self):
|
||||
return "state_%s(%s)" % (self.id, self.name)
|
||||
|
||||
@staticmethod
|
||||
def from_id(id):
|
||||
for state in states.values():
|
||||
if state.id == id:
|
||||
return state
|
||||
raise ValueError(id)
|
||||
|
||||
State.get("0")
|
||||
ERROR_ACTION = StateActionListPair.get(State.get("error"), None)
|
||||
|
||||
next_super_state_id = 0
|
||||
super_states = {}
|
||||
|
||||
class TransitionTable:
|
||||
|
||||
def __init__(self, name):
|
||||
global next_super_state_id
|
||||
self.name = name
|
||||
self.id = next_super_state_id
|
||||
next_super_state_id += 1
|
||||
self.parent = None
|
||||
self.transitions = []
|
||||
self._table = None
|
||||
|
||||
def add_transition(self, trans):
|
||||
self.transitions.append(trans)
|
||||
|
||||
def dump(self):
|
||||
if self.parent:
|
||||
lines = [ "TransitionTable %s(%s extends %s)" % (self.id, self.name, self.parent.name) ]
|
||||
else:
|
||||
lines = [ "TransitionTable %s(%s):" % (self.id, self.name) ]
|
||||
lines.extend(" " + t.dump() for t in self.transitions)
|
||||
return "\n".join(lines)
|
||||
|
||||
@staticmethod
|
||||
def get(name, parent=None):
|
||||
if name not in super_states:
|
||||
super_states[name] = TransitionTable(name)
|
||||
return super_states[name]
|
||||
|
||||
@staticmethod
|
||||
def count():
|
||||
return len(super_states)
|
||||
|
||||
def get_table(self, character_classes):
|
||||
'''Returns the transition table for all states in this super-state'''
|
||||
if self._table is None:
|
||||
from_transtions = defaultdict(list)
|
||||
for t in self.transitions:
|
||||
from_transtions[t.from_state].append(t)
|
||||
self._table = { state: self.get_transition_table(state, from_transtions.get(state, ()), character_classes) for state in states.values() }
|
||||
return self._table
|
||||
|
||||
def get_transition_table(self, state, transition_list, character_classes):
|
||||
table = {}
|
||||
if self.parent:
|
||||
parent_table = self.parent.get_table(character_classes)
|
||||
else:
|
||||
parent_table = None
|
||||
default = None
|
||||
for t in transition_list:
|
||||
assert state == t.from_state
|
||||
if isinstance(t.what, Any):
|
||||
default = t.action
|
||||
continue
|
||||
action = t.action
|
||||
classes = set(character_classes[c] for c in t.what)
|
||||
for cls in classes:
|
||||
if cls in table:
|
||||
raise ValueError("Duplicate transition from %s on %s" % (state, cls))
|
||||
else:
|
||||
table[cls] = action
|
||||
on_identifier = table.get(IDENTIFIER_CLASS, None)
|
||||
for cls in character_classes.values():
|
||||
if cls in table:
|
||||
continue
|
||||
if on_identifier and cls.is_identifier:
|
||||
table[cls] = on_identifier
|
||||
elif default:
|
||||
table[cls] = default
|
||||
elif parent_table and state in parent_table:
|
||||
table[cls] = parent_table[state][cls]
|
||||
else:
|
||||
table[cls] = ERROR_ACTION
|
||||
return StateTransitionTable(table)
|
||||
|
||||
def compile(self, character_classes):
|
||||
return SuperState(self.name, self.get_table(character_classes))
|
||||
|
||||
class Any:
|
||||
|
||||
def __repr__(self):
|
||||
return "*"
|
||||
|
||||
class Action:
|
||||
|
||||
def __repr__(self):
|
||||
return self.__class__.__name__.lower()
|
||||
|
||||
class Emit(Action):
|
||||
|
||||
def __init__(self, kind, text):
|
||||
assert isinstance(kind, str)
|
||||
assert kind.upper() == kind
|
||||
self.kind = kind
|
||||
self.text = text
|
||||
|
||||
def __repr__(self):
|
||||
if self.text is None:
|
||||
return "emit(" + self.kind + ")"
|
||||
else:
|
||||
return "emit(%s, %r)" % (self.kind, self.text)
|
||||
|
||||
def __eq__(self, other):
|
||||
return type(other) is Emit and other.kind == self.kind and other.text == self.text
|
||||
|
||||
def __hash__(self):
|
||||
return 353 ^ hash(self.kind) ^ hash(self.text)
|
||||
|
||||
class Push(Action):
|
||||
|
||||
def __init__(self, state):
|
||||
assert isinstance(state, TransitionTable)
|
||||
self.state = state
|
||||
|
||||
def __repr__(self):
|
||||
return "push(%s)" % self.state.name
|
||||
|
||||
def __eq__(self, other):
|
||||
return type(other) is Push and other.state == self.state
|
||||
|
||||
def __hash__(self):
|
||||
return 59 ^ hash(self.state)
|
||||
|
||||
class EmitIndent(Action):
|
||||
pass
|
||||
EMIT_INDENT = EmitIndent()
|
||||
|
||||
class Pop(Action):
|
||||
pass
|
||||
POP = Pop()
|
||||
|
||||
class Pushback(Action):
|
||||
pass
|
||||
PUSHBACK = Pushback()
|
||||
|
||||
class Mark(Action):
|
||||
pass
|
||||
MARK = Mark()
|
||||
|
||||
class Newline(Action):
|
||||
pass
|
||||
NEWLINE = Newline()
|
||||
|
||||
class Identifier:
|
||||
|
||||
def __repr__(self):
|
||||
return "UnicodeIdentifiers()"
|
||||
|
||||
IDENTIFIER = Identifier()
|
||||
|
||||
class IdentifierContinue:
|
||||
|
||||
def __repr__(self):
|
||||
return "IdentifierContinue()"
|
||||
|
||||
IDENTIFIER_CONTINUE = IdentifierContinue()
|
||||
|
||||
next_char_class_id = 0
|
||||
|
||||
class CharacterClass:
|
||||
|
||||
def __init__(self, chars, is_identifier = None):
|
||||
global next_char_class_id
|
||||
self.chars = chars
|
||||
self.id = next_char_class_id
|
||||
next_char_class_id += 1
|
||||
if is_identifier is None:
|
||||
self.is_identifier = chars.copy().pop().isidentifier()
|
||||
else:
|
||||
self.is_identifier = is_identifier
|
||||
|
||||
def __repr__(self):
|
||||
if self == IDENTIFIER_CLASS:
|
||||
return "IDENTIFIER_CLASS(%d)" % self.id
|
||||
elif self == ERROR_CLASS:
|
||||
return "ERROR_CLASS(%d)" % self.id
|
||||
else:
|
||||
return "CharacterClass %s %r" % (self.id, sorted(self.chars))
|
||||
|
||||
ERROR_CLASS = CharacterClass(set(), False)
|
||||
assert ERROR_CLASS.id == 0
|
||||
IDENTIFIER_CLASS = CharacterClass(set(), True)
|
||||
IDENTIFIER_CONTINUE_CLASS = CharacterClass(set(), False)
|
||||
|
||||
class Machine:
|
||||
|
||||
def __init__(self):
|
||||
self.aliases = {}
|
||||
self.states = {}
|
||||
self.aliases["IDENTIFIER"] = IDENTIFIER
|
||||
self.aliases["IDENTIFIER_CONTINUE"] = IDENTIFIER_CONTINUE
|
||||
self.aliases['SPACE'] = {' '}
|
||||
self.start = None
|
||||
|
||||
def add_state(self, name):
|
||||
assert name not in self.states
|
||||
result = TransitionTable.get(name)
|
||||
self.states[name] = result
|
||||
return result
|
||||
|
||||
def add_alias(self, name, choices):
|
||||
assert name not in self.aliases
|
||||
assert isinstance(choices, set), choices
|
||||
self.aliases[name] = choices
|
||||
|
||||
def dump(self):
|
||||
r = []
|
||||
a = r.append
|
||||
a("Starting super-state: %s" % self.start.name)
|
||||
a("")
|
||||
a("Aliases:")
|
||||
for name_alias in self.aliases.items():
|
||||
a(" %s = %r" % name_alias)
|
||||
a("")
|
||||
for name, state in self.states.items():
|
||||
a(state.dump())
|
||||
return "\n".join(r)
|
||||
|
||||
@staticmethod
|
||||
def load(src):
|
||||
tree = parse(src)
|
||||
m = Machine()
|
||||
w = Walker(m)
|
||||
w.visit(tree)
|
||||
return m
|
||||
|
||||
def get_classes(self):
|
||||
'''Get the character classes for this machine'''
|
||||
#There are two predefined classes: Unicode identifiers, and ERROR.
|
||||
#A character class is a set of characters, such that the transitions
|
||||
#and actions of the machine are identical for all characters in that class.
|
||||
char_to_transitions = defaultdict(set)
|
||||
for s in self.states.values():
|
||||
for t in s.transitions:
|
||||
w = t.what
|
||||
if isinstance(w, Any):
|
||||
continue
|
||||
for c in w:
|
||||
if c is IDENTIFIER or c is IDENTIFIER_CONTINUE:
|
||||
continue
|
||||
char_to_transitions[c].add((s, t.from_state, t.action))
|
||||
equivalence_sets = defaultdict(set)
|
||||
for c, transition_set in sorted(char_to_transitions.items()):
|
||||
equivalence_sets[frozenset(transition_set)].add(c)
|
||||
classes = {}
|
||||
for char_set in sorted(equivalence_sets.values()):
|
||||
charcls = CharacterClass(char_set)
|
||||
for c in char_set:
|
||||
classes[c] = charcls
|
||||
classes[IDENTIFIER] = IDENTIFIER_CLASS
|
||||
classes[IDENTIFIER_CONTINUE] = IDENTIFIER_CONTINUE_CLASS
|
||||
for i in range(128):
|
||||
c = chr(i)
|
||||
if c not in classes:
|
||||
if c.isidentifier():
|
||||
classes[c] = IDENTIFIER_CLASS
|
||||
elif c in "0123456789":
|
||||
classes[c] = IDENTIFIER_CONTINUE_CLASS
|
||||
else:
|
||||
classes[c] = ERROR_CLASS
|
||||
for cls in classes.values():
|
||||
if cls is IDENTIFIER_CLASS or cls is IDENTIFIER_CONTINUE_CLASS or cls is ERROR_CLASS:
|
||||
continue
|
||||
assert { c for c in cls.chars if c.isidentifier() } == cls.chars or not { c for c in cls.chars if c.isidentifier() }
|
||||
return classes
|
||||
|
||||
class Walker:
|
||||
|
||||
def __init__(self, machine):
|
||||
self.machine = machine
|
||||
|
||||
def visit(self, node):
|
||||
if hasattr(node, "type"):
|
||||
tag = node.type
|
||||
else:
|
||||
tag = node.data
|
||||
meth = getattr(self, "visit_" + tag, None)
|
||||
if meth is None:
|
||||
self.fail(node, tag)
|
||||
else:
|
||||
return meth(node)
|
||||
|
||||
def fail(self, node, tag):
|
||||
print(node)
|
||||
raise NotImplementedError(tag)
|
||||
|
||||
def visit_first_child(self, node):
|
||||
assert len(node.children) == 1
|
||||
return self.visit(node.children[0])
|
||||
|
||||
def visit_children(self, node):
|
||||
return [ self.visit(child) for child in node.children ]
|
||||
|
||||
visit_start = visit_first_child
|
||||
visit_machine = visit_children
|
||||
visit_declaration = visit_first_child
|
||||
|
||||
def visit_alias_decl(self, node):
|
||||
assert len(node.children) == 2
|
||||
choice = self.visit(node.children[1])
|
||||
self.machine.add_alias(node.children[0].value, choice)
|
||||
|
||||
def visit_alias(self, node):
|
||||
return self.machine.aliases[node.children[0].value]
|
||||
|
||||
def visit_char(self, node):
|
||||
c = ast.literal_eval(node.children[0].value)
|
||||
assert isinstance(c, str), c
|
||||
assert len(c) == 1, c
|
||||
return c
|
||||
|
||||
def visit_choice(self, node):
|
||||
#Convert choices into a set of characters
|
||||
result = set()
|
||||
for child in node.children:
|
||||
item = self.visit(child)
|
||||
if isinstance(item, set):
|
||||
result.update(item)
|
||||
else:
|
||||
result.add(item)
|
||||
return result
|
||||
|
||||
visit_item = visit_first_child
|
||||
|
||||
def visit_table_decl(self, node):
|
||||
self.current_state = self.visit(node.children[0])
|
||||
for transition in node.children[1:]:
|
||||
self.visit(transition)
|
||||
|
||||
def visit_table_header(self, node):
|
||||
name = node.children[0].value
|
||||
state = self.machine.add_state(name)
|
||||
if len(node.children) > 1:
|
||||
base = TransitionTable.get(node.children[1].value)
|
||||
state.parent = base
|
||||
return state
|
||||
|
||||
def visit_transition(self, node):
|
||||
# state_choice "->" state "for" (choice | "*") action_list?
|
||||
from_states = self.visit(node.children[0])
|
||||
to_state = self.visit(node.children[1])
|
||||
what = self.visit(node.children[2])
|
||||
if len(node.children) > 3:
|
||||
do = self.visit(node.children[3])
|
||||
else:
|
||||
do = []
|
||||
for state in from_states:
|
||||
trans = Transition(state, to_state, what, do)
|
||||
self.current_state.add_transition(trans)
|
||||
|
||||
visit_state_choice = visit_children
|
||||
|
||||
def visit_state(self, node):
|
||||
return State.get(node.children[0].value)
|
||||
|
||||
def visit_any(self, node):
|
||||
return Any()
|
||||
|
||||
visit_action_list = visit_children
|
||||
visit_action = visit_first_child
|
||||
|
||||
def visit_emit(self, node):
|
||||
if len(node.children) == 2:
|
||||
return Emit(node.children[0].value, self.visit(node.children[1]))
|
||||
else:
|
||||
return Emit(node.children[0].value, None)
|
||||
|
||||
def visit_optional_text(self, node):
|
||||
return node.children[0].value
|
||||
|
||||
def visit_push(self, node):
|
||||
state = TransitionTable.get(node.children[0].value)
|
||||
return Push(state)
|
||||
|
||||
def visit_emit_indent(self, node):
|
||||
return EMIT_INDENT
|
||||
|
||||
def visit_pushback(self, node):
|
||||
return PUSHBACK
|
||||
|
||||
def visit_pop(self, node):
|
||||
return POP
|
||||
|
||||
def visit_mark(self, node):
|
||||
return MARK
|
||||
|
||||
def visit_newline(self, node):
|
||||
return NEWLINE
|
||||
|
||||
def visit_start_decl(self, node):
|
||||
self.machine.start = TransitionTable.get(node.children[0].value)
|
||||
|
||||
|
||||
def main():
|
||||
import sys
|
||||
file = sys.argv[1]
|
||||
with open(file) as fd:
|
||||
tree = parse(fd.read())
|
||||
m = Machine()
|
||||
w = Walker(m)
|
||||
w.visit(tree)
|
||||
print(m.dump())
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
92
python/extractor/tokenizer_generator/parser.py
Normal file
92
python/extractor/tokenizer_generator/parser.py
Normal file
@@ -0,0 +1,92 @@
|
||||
|
||||
'''
|
||||
Explanation of the syntax
|
||||
|
||||
start_decl: The starting transition table
|
||||
alias_decl: Declare short hand, e.g. digits = '0' or '1' or ...
|
||||
table_decl: Declare transition table: name and list of transitions.
|
||||
transition: Transitions from one state to another. From is: state (or choice of states) -> new-state for possible-characters [ do action or actions; ]
|
||||
action: Actions are:
|
||||
"emit(kind [, text]): emits a token of kind using the givn text or text from the stream. The token starts at the last mark and ends at the current location.
|
||||
"push(table)": pushes a transition table to the stack.
|
||||
"pop" : pops a transition table from the stack.
|
||||
"pushback": pushes the last character back to the stream.
|
||||
"mark": marks the current location as the start of the next token.
|
||||
"emit_indent": Emits zero or more INDENT or DEDENT tokens depending on current indentation.
|
||||
"newline": Increments the line number and sets the column offset back to zero.
|
||||
|
||||
States:
|
||||
All states are given names.
|
||||
The state "0" is the start state and always exists.
|
||||
All other states are implicitly defined when used (this is for Python after all :)
|
||||
'*' means all states for which a transition is not explicitly defined.
|
||||
So the transitions:
|
||||
0 -> end for '\n'
|
||||
0 -> other for *
|
||||
0 -> a_b for 'a' or 'b'
|
||||
mean that '0' will transition to 'other' for all characters other than 'a', 'b' and `\n`.
|
||||
The order of transitions in the state machine description is irrelevant.
|
||||
'''
|
||||
|
||||
|
||||
grammar = r"""
|
||||
start : machine
|
||||
machine : declaration+
|
||||
declaration : alias_decl | table_decl | start_decl
|
||||
start_decl : "start" ":" IDENTIFIER
|
||||
table_decl : table_header "{" transition+ "}"
|
||||
table_header : "table" IDENTIFIER ( "(" IDENTIFIER ")" )?
|
||||
alias_decl : IDENTIFIER "=" choice
|
||||
choice : item ( "or" item)*
|
||||
item : alias | char
|
||||
alias : IDENTIFIER
|
||||
char : LITERAL
|
||||
transition : state_choice "->" state "for" (choice | any) action_list?
|
||||
any : "*"
|
||||
state_choice : state ( "or" state)*
|
||||
state : IDENTIFIER | DIGIT
|
||||
action_list : "do" action ";" (action ";")*
|
||||
action : emit | pop | push | pushback | mark | emit_indent | newline
|
||||
emit : "emit" "(" IDENTIFIER optional_text? ")"
|
||||
optional_text : "," LITERAL
|
||||
pop : "pop"
|
||||
push : "push" "(" IDENTIFIER ")"
|
||||
pushback : "pushback"
|
||||
mark : "mark"
|
||||
emit_indent : "emit_indent"
|
||||
newline : "newline"
|
||||
|
||||
LITERAL : ("\"" /[^"]/* "\"") | ("'" /[^']/* "'")
|
||||
IDENTIFIER : LETTER ( LETTER | DIGIT | "_" )*
|
||||
LETTER : "A".."Z" | "a".."z"
|
||||
DIGIT : "0".."9"
|
||||
WHITESPACE : (" " | "\t" | "\r" | "\n")+
|
||||
|
||||
%import common.NEWLINE
|
||||
COMMENT : "#" /(.)*/ NEWLINE
|
||||
%ignore WHITESPACE
|
||||
%ignore COMMENT
|
||||
"""
|
||||
|
||||
|
||||
|
||||
from lark import Lark
|
||||
|
||||
class Parser(Lark):
|
||||
|
||||
def __init__(self):
|
||||
Lark.__init__(self, grammar, parser="earley", lexer="standard")
|
||||
|
||||
def parse(src):
|
||||
parser = Parser()
|
||||
return parser.parse(src)
|
||||
|
||||
def main():
|
||||
import sys
|
||||
file = sys.argv[1]
|
||||
with open(file) as fd:
|
||||
tree = parse(fd.read())
|
||||
print(tree.pretty())
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
385
python/extractor/tokenizer_generator/state_transition.txt
Normal file
385
python/extractor/tokenizer_generator/state_transition.txt
Normal file
@@ -0,0 +1,385 @@
|
||||
# State machine specification for unified Python tokenizer
|
||||
# Handles all tokens for all versions of Python, including partial string tokens for handling f-strings.
|
||||
# Stating transition table is "default" and starting state is "0"
|
||||
#
|
||||
#
|
||||
|
||||
|
||||
#declarations
|
||||
prefix_chars = 'u' or 'U' or 'b' or 'B' or 'r' or 'R'
|
||||
one_to_nine = '1' or '2' or '3' or '4' or '5' or '6' or '7' or '8' or '9'
|
||||
digits = '0' or one_to_nine
|
||||
oct_digits = '0' or '1' or '2' or '3' or '4' or '5' or '6' or '7'
|
||||
hex_digits = digits or 'a' or 'A' or 'b' or 'B' or 'c' or 'C' or 'd' or 'D' or 'e' or 'E' or 'f' or 'F'
|
||||
feed = '\n' or '\r'
|
||||
|
||||
#tables
|
||||
table default {
|
||||
# 0 is starting state
|
||||
0 -> whitespace_line for * do pushback;
|
||||
|
||||
#String prefix states
|
||||
|
||||
# When we encounter a prefix character, we are faced with the possibility
|
||||
# that it is either the beginning of a string or of an identifier. With a
|
||||
# single character of lookahead available, we therefore have to be in an
|
||||
# intermediate state until we are able to determine which case we're in.
|
||||
|
||||
code -> maybe_string1 for prefix_chars do mark;
|
||||
maybe_string1 -> maybe_string2 for prefix_chars
|
||||
maybe_string1 or maybe_string2 -> quote_s for "'"
|
||||
maybe_string1 or maybe_string2 -> quote_d for '"'
|
||||
code -> quote_s for "'" do mark;
|
||||
code -> quote_d for '"' do mark;
|
||||
maybe_string1 or maybe_string2 -> in_identifier for * do pushback;
|
||||
|
||||
# In the following, `_s` means one single quote, `_ss` means two in a row,
|
||||
# etc. Likewise `_d` indicates double quotes.
|
||||
|
||||
quote_s -> quote_ss for "'"
|
||||
quote_d -> quote_dd for '"'
|
||||
quote_s -> instring for * do pushback ; push(string_s);
|
||||
quote_ss -> instring for "'" do push(string_sss);
|
||||
quote_ss -> code for * do pushback ; emit(STRING);
|
||||
quote_d -> instring for * do pushback ; push(string_d);
|
||||
quote_dd -> instring for '"' do push(string_ddd);
|
||||
quote_dd -> code for * do pushback ; emit(STRING);
|
||||
|
||||
#F-string prefix states
|
||||
|
||||
# The prefixes `u` and `b` are specific to Python 2, and f-strings are only
|
||||
# valid for Python 3. Thus, the only potential prefixes are permutations of
|
||||
# `f` and `fr` (upper/lowercase notwithstanding).
|
||||
|
||||
code -> maybe_fstring1 for 'f' or 'F' do mark;
|
||||
maybe_string1 -> maybe_fstring2 for 'f' or 'F'
|
||||
maybe_fstring1 -> maybe_fstring2 for 'r' or 'R'
|
||||
maybe_fstring1 or maybe_fstring2 -> fquote_s for "'"
|
||||
maybe_fstring1 or maybe_fstring2 -> fquote_d for '"'
|
||||
maybe_fstring1 or maybe_fstring2 -> in_identifier for * do pushback;
|
||||
fquote_s -> fquote_ss for "'"
|
||||
fquote_d -> fquote_dd for '"'
|
||||
fquote_s -> instring for * do pushback ; push(fstring_start_s);
|
||||
fquote_ss -> instring for "'" do push(fstring_start_sss);
|
||||
fquote_ss -> code for * do pushback ; emit(STRING);
|
||||
fquote_d -> instring for * do pushback ; push(fstring_start_d);
|
||||
fquote_dd -> instring for '"' do push(fstring_start_ddd);
|
||||
fquote_dd -> code for * do pushback ; emit(STRING);
|
||||
|
||||
#String states
|
||||
instring -> instring for *
|
||||
instring -> unicode_or_escape for '\\'
|
||||
unicode_or_escape -> unicode_or_raw for 'N'
|
||||
unicode_or_raw -> unicode for '{'
|
||||
unicode_or_raw -> instring for *
|
||||
unicode -> instring for '}'
|
||||
unicode -> unicode for *
|
||||
unicode_or_escape -> escape for * do pushback;
|
||||
|
||||
escape -> instring for feed do newline;
|
||||
escape -> instring for *
|
||||
|
||||
# When inside a parenthesized expression, newlines indicate the continuation
|
||||
# of the expression, and not a return to a context where statements may
|
||||
# appear. This is captured using the `paren` table.
|
||||
|
||||
code -> code for '(' do emit(LPAR, "("); push(paren);
|
||||
code -> code for '[' do emit(LSQB, "["); push(paren);
|
||||
code -> code for '{' do emit(LBRACE, "{"); push(paren);
|
||||
code -> code for ')' do emit(RPAR, ")");
|
||||
code -> code for ']' do emit(RSQB, "]");
|
||||
code -> code for '}' do emit(RBRACE, "}");
|
||||
code -> code for '`' do emit(BACKQUOTE, '`');
|
||||
|
||||
# Operators
|
||||
code -> assign for '=' do mark;
|
||||
code -> le for '<' do mark;
|
||||
code -> ge for '>' do mark;
|
||||
code -> bang for '!' do mark;
|
||||
le -> binop for '<'
|
||||
le -> code for '>' do emit(OP);
|
||||
ge -> binop for '>'
|
||||
bang or le or ge or assign -> code for '=' do emit(OP);
|
||||
le or ge or assign -> code for * do pushback; emit(OP);
|
||||
bang -> code for 'r' or 'a' or 's' or 'd' do emit(CONVERSION);
|
||||
code -> colon for ':'
|
||||
colon -> code for '=' do emit(COLONEQUAL, ":=");
|
||||
colon -> code for * do pushback; emit(COLON, ":");
|
||||
code -> code for ',' do emit(COMMA, ",");
|
||||
code -> code for ';' do emit(SEMI, ";");
|
||||
code -> at for '@' do mark;
|
||||
at -> code for '=' do emit(OP);
|
||||
at -> code for * do pushback; emit(AT, "@");
|
||||
code -> dot for '.' do mark;
|
||||
dot -> float for digits
|
||||
dot -> code for * do pushback; emit(DOT, ".");
|
||||
binop or slash or star or dash -> code for '=' do emit(OP);
|
||||
binop or slash or star or dash -> code for * do pushback; emit(OP);
|
||||
code -> star for '*' do mark;
|
||||
star -> binop for '*'
|
||||
code -> slash for '/' do mark;
|
||||
slash -> binop for '/'
|
||||
code -> dash for '-' do mark;
|
||||
dash -> code for '>' do emit(RARROW);
|
||||
code -> binop for '+' or '%' or '&' or '|' or '^' do mark;
|
||||
code -> code for '~' do emit(OP, '~');
|
||||
|
||||
# Numeric literals
|
||||
|
||||
# Python admits a large variety of numeric literals, and the handling of
|
||||
# various constructs is a bit inconsistent. For instance, prefixed zeroes are
|
||||
# not allowed in front of integer numerals (unless all digits are between 0
|
||||
# and 7, in which case it is treated as an octal number), but _are_ allowed if
|
||||
# there is some other context that makes it a float or complex number. Thus,
|
||||
# `09` is invalid, but `09.` and `09j` are valid. This means we have to be
|
||||
# very careful in what we commit to in our tokenization, hence the rather
|
||||
# complicated construction below.
|
||||
|
||||
code -> int for one_to_nine do mark;
|
||||
int -> int for digits
|
||||
zero or zero_int or binary or octal or int or hex -> code for 'l' or 'L' do emit(NUMBER);
|
||||
int -> int_sep for '_'
|
||||
int_sep -> int for digits
|
||||
int_sep -> error for * do emit(ERRORTOKEN);
|
||||
code -> zero for '0' do mark;
|
||||
zero -> zero_int for digits
|
||||
zero -> zero_int_sep for '_'
|
||||
zero_int -> zero_int for digits
|
||||
zero_int -> zero_int_sep for '_'
|
||||
zero_int_sep -> zero_int for digits
|
||||
zero_int_sep -> error for * do emit(ERRORTOKEN);
|
||||
zero -> octal for 'o' or 'O'
|
||||
octal -> octal for oct_digits
|
||||
octal -> octal_sep for '_'
|
||||
octal_sep -> octal for oct_digits
|
||||
octal_sep -> error for * do emit(ERRORTOKEN);
|
||||
zero or octal or hex or binary -> code for * do pushback; emit(NUMBER);
|
||||
zero -> binary for 'b' or 'B'
|
||||
binary -> binary for '0' or '1'
|
||||
binary -> binary_sep for '_'
|
||||
binary_sep -> binary for '0' or '1'
|
||||
binary_sep -> error for * do emit(ERRORTOKEN);
|
||||
zero -> hex for 'x' or 'X'
|
||||
hex -> hex for hex_digits
|
||||
hex -> hex_sep for '_'
|
||||
hex_sep -> hex for hex_digits
|
||||
hex_sep -> error for * do emit(ERRORTOKEN);
|
||||
zero or zero_int or int -> int_dot for '.'
|
||||
zero_int or int -> code for * do pushback; emit(NUMBER);
|
||||
int_dot or float -> float for digits
|
||||
float -> float_sep for '_'
|
||||
float_sep -> float for digits
|
||||
float_sep -> error for * do emit(ERRORTOKEN);
|
||||
int_dot -> code for * do pushback; emit(NUMBER);
|
||||
float or zero or zero_int or int or int_dot -> float_e for 'e'
|
||||
float or zero or zero_int or int or int_dot -> float_E for 'E'
|
||||
# `1 if 1else 0` is valid syntax, so we cannot assume 'e' always indicates a float.
|
||||
float_e -> code for 'l' do pushback; pushback; emit(NUMBER);
|
||||
float_e or float_E -> float_E for '+' or '-'
|
||||
float_e or float_E or float_x -> float_x for digits
|
||||
float_x -> float_x_sep for '_'
|
||||
float_x_sep -> float_x for digits
|
||||
float_x_sep -> error for * do emit(ERRORTOKEN);
|
||||
float or float_x -> code for * do pushback; emit(NUMBER);
|
||||
|
||||
# Identifiers (e.g. names and keywords)
|
||||
code -> in_identifier for IDENTIFIER do mark;
|
||||
in_identifier -> in_identifier for IDENTIFIER or digits or IDENTIFIER_CONTINUE
|
||||
code -> dollar_name for '$' do mark;
|
||||
dollar_name -> dollar_name for IDENTIFIER or digits or IDENTIFIER_CONTINUE
|
||||
code -> in_identifier for '_' do mark;
|
||||
in_identifier -> in_identifier for '_'
|
||||
in_identifier -> code for * do pushback; emit(NAME);
|
||||
dollar_name -> code for * do pushback; emit(DOLLARNAME);
|
||||
|
||||
# Comments
|
||||
code -> line_end_comment for '#' do mark;
|
||||
line_end_comment -> code for feed do pushback; emit(COMMENT);
|
||||
line_end_comment -> line_end_comment for *
|
||||
comment -> whitespace_line for feed do pushback; emit(COMMENT);
|
||||
comment -> comment for *
|
||||
code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
|
||||
whitespace_line -> whitespace_line for SPACE or '\t' or '\f'
|
||||
whitespace_line -> whitespace_line for feed do newline;
|
||||
whitespace_line -> code for * do emit_indent;
|
||||
whitespace_line -> comment for '#' do mark;
|
||||
code -> code for SPACE or '\t'
|
||||
|
||||
# Line continuations and error states.
|
||||
code or float_e or float_E -> error for * do emit(ERRORTOKEN);
|
||||
code -> pending_continuation for '\\'
|
||||
pending_continuation -> line_continuation for feed do newline;
|
||||
line_continuation -> code for * do pushback; mark;
|
||||
pending_continuation -> error for * do emit(ERRORTOKEN);
|
||||
error -> code for * do pushback;
|
||||
code -> code for * do mark; emit(ERRORTOKEN);
|
||||
zero or int_dot or zero_int or int or float or float_x -> code for 'j' or 'J' do emit(NUMBER);
|
||||
}
|
||||
|
||||
table paren(default) {
|
||||
code -> code for feed do mark; newline;
|
||||
code -> code for ')' do emit(RPAR, ")"); pop;
|
||||
code -> code for ']' do emit(RSQB, "]"); pop;
|
||||
code -> code for '}' do emit(RBRACE, "}"); pop;
|
||||
}
|
||||
|
||||
#String starting with '
|
||||
table string_s(default) {
|
||||
instring -> code for "'" do pop; emit(STRING);
|
||||
instring -> error for feed do pop; emit(ERRORTOKEN); newline;
|
||||
}
|
||||
|
||||
#String starting with "
|
||||
table string_d(default) {
|
||||
instring -> code for '"' do pop; emit(STRING);
|
||||
instring -> error for feed do pop; emit(ERRORTOKEN); newline;
|
||||
}
|
||||
|
||||
#String starting with '''
|
||||
table string_sss(default) {
|
||||
instring -> string_x for "'"
|
||||
instring -> instring for feed do newline;
|
||||
string_x -> string_xx for "'"
|
||||
string_x -> instring for feed do newline;
|
||||
string_x -> instring for * do pushback;
|
||||
string_xx -> code for "'" do pop; emit(STRING);
|
||||
string_xx -> instring for feed do newline;
|
||||
string_xx -> instring for * do pushback;
|
||||
}
|
||||
|
||||
#String starting with """
|
||||
table string_ddd(default) {
|
||||
instring -> string_x for '"'
|
||||
instring -> instring for feed do newline;
|
||||
string_x -> string_xx for '"'
|
||||
string_x -> instring for feed do newline;
|
||||
string_x -> instring for * do pushback;
|
||||
string_xx -> code for '"' do pop; emit(STRING);
|
||||
string_xx -> instring for feed do newline;
|
||||
string_xx -> instring for * do pushback;
|
||||
}
|
||||
|
||||
#F-string part common to all fstrings
|
||||
table fstring_sdsssddd(default) {
|
||||
instring -> brace for '{'
|
||||
|
||||
escape -> brace for '{'
|
||||
|
||||
brace -> instring for '{'
|
||||
brace -> code for * do pushback ; emit(FSTRING_MID); push(fstring_expr);
|
||||
}
|
||||
|
||||
#F-string part common to ' and "
|
||||
table fstring_sd(fstring_sdsssddd) {
|
||||
instring -> error for feed do pop; emit(ERRORTOKEN); newline;
|
||||
}
|
||||
|
||||
#F-string start for string starting with '
|
||||
table fstring_start_s(fstring_sd) {
|
||||
instring -> code for "'" do pop; emit(STRING);
|
||||
|
||||
# If this rule is removed or moved to a higher table, the QL tests start failing for unclear reasons.
|
||||
# It's identical to a rule in default.
|
||||
brace -> instring for '{'
|
||||
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_s); push(fstring_expr);
|
||||
}
|
||||
|
||||
#F-string part for string starting with '
|
||||
table fstring_s(fstring_sd) {
|
||||
instring -> code for "'" do pop; emit(FSTRING_END);
|
||||
}
|
||||
|
||||
#F-string start for string starting with "
|
||||
table fstring_start_d(fstring_sd) {
|
||||
instring -> code for '"' do pop; emit(STRING);
|
||||
|
||||
# If this rule is removed or moved to a higher table, the QL tests start failing for unclear reasons.
|
||||
# It's identical to a rule in fstring_sdsssddd.
|
||||
brace -> instring for '{'
|
||||
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_d); push(fstring_expr);
|
||||
}
|
||||
|
||||
#F-string part for string starting with "
|
||||
table fstring_d(fstring_sd) {
|
||||
instring -> code for '"' do pop; emit(FSTRING_END);
|
||||
}
|
||||
|
||||
#F-string part common to ''' and """
|
||||
table fstring_sssddd(fstring_sdsssddd) {
|
||||
instring -> instring for feed do newline;
|
||||
|
||||
string_x -> instring for feed do newline;
|
||||
string_x -> instring for * do pushback;
|
||||
|
||||
string_xx -> instring for feed do newline;
|
||||
string_xx -> instring for * do pushback;
|
||||
}
|
||||
|
||||
#F-string start for string starting with '''
|
||||
table fstring_start_sss(fstring_sssddd) {
|
||||
instring -> string_x for "'"
|
||||
|
||||
string_x -> string_xx for "'"
|
||||
|
||||
string_xx -> code for "'" do pop; emit(STRING);
|
||||
|
||||
brace -> instring for '{'
|
||||
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_sss); push(fstring_expr);
|
||||
}
|
||||
|
||||
#F-string part for string starting with '''
|
||||
table fstring_sss(fstring_sssddd) {
|
||||
instring -> string_x for "'"
|
||||
|
||||
string_x -> string_xx for "'"
|
||||
|
||||
string_xx -> code for "'" do pop; emit(FSTRING_END);
|
||||
}
|
||||
|
||||
#F-string start for string starting with """
|
||||
table fstring_start_ddd(fstring_sssddd) {
|
||||
instring -> string_x for '"'
|
||||
|
||||
string_x -> string_xx for '"'
|
||||
|
||||
string_xx -> code for '"' do pop; emit(STRING);
|
||||
|
||||
brace -> instring for '{'
|
||||
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_ddd); push(fstring_expr);
|
||||
}
|
||||
|
||||
#F-string part for string starting with """
|
||||
table fstring_ddd(fstring_sssddd) {
|
||||
instring -> string_x for '"'
|
||||
|
||||
string_x -> string_xx for '"'
|
||||
|
||||
string_xx -> code for '"' do pop; emit(FSTRING_END);
|
||||
}
|
||||
|
||||
#Expression within an f-string
|
||||
table fstring_expr(paren) {
|
||||
code -> instring for '}' do pop; mark;
|
||||
code -> instring for ':' do emit(COLON); push(format_specifier);
|
||||
instring -> instring for '}' do pop; mark;
|
||||
}
|
||||
|
||||
fspec_type = 'b' or 'c' or 'd' or 'e' or 'E' or 'f' or 'F' or 'g' or 'G' or 'n' or 'o' or 's' or 'x' or 'X' or '%'
|
||||
fspec_align = '<' or '>' or '=' or '^'
|
||||
fspec_sign = '+' or '-' or ' '
|
||||
|
||||
table format_specifier(default) {
|
||||
instring -> code for '{' do emit(FSTRING_SPEC);
|
||||
instring -> instring for '}' do pushback; emit(FSTRING_SPEC); pop;
|
||||
|
||||
code -> instring for '}' do mark;
|
||||
}
|
||||
|
||||
|
||||
|
||||
#Special state for when dedents are pending.
|
||||
table pending_dedent(default) {
|
||||
code -> code for * do pop; emit_indent;
|
||||
}
|
||||
|
||||
start: default
|
||||
25
python/extractor/tokenizer_generator/test.py
Normal file
25
python/extractor/tokenizer_generator/test.py
Normal file
@@ -0,0 +1,25 @@
|
||||
|
||||
from . import test_tokenizer
|
||||
import sys
|
||||
from blib2to3.pgen2.token import tok_name
|
||||
|
||||
def printtoken(type, token, start, end): # for testing
|
||||
token_range = "%d,%d-%d,%d:" % (start + end)
|
||||
print("%-20s%-15s%r" %
|
||||
(token_range, tok_name[type], token)
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
verbose = sys.argv[1] == "-v"
|
||||
if verbose:
|
||||
inputfile = sys.argv[2]
|
||||
else:
|
||||
inputfile = sys.argv[1]
|
||||
with open(inputfile, "r") as input:
|
||||
t = test_tokenizer.Tokenizer(input.read()+"\n")
|
||||
for tkn in t.tokens(verbose):
|
||||
printtoken(*tkn)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
172
python/extractor/tokenizer_generator/tokenizer_template.py
Normal file
172
python/extractor/tokenizer_generator/tokenizer_template.py
Normal file
@@ -0,0 +1,172 @@
|
||||
'''
|
||||
Lookup table based tokenizer with state popping and pushing capabilities.
|
||||
The ability to push and pop state is required for handling parenthesised expressions,
|
||||
indentation, and f-strings. We also use it for handling the different quotation mark types,
|
||||
but it is not essential for that, merely convenient.
|
||||
|
||||
'''
|
||||
|
||||
|
||||
|
||||
class Tokenizer(object):
|
||||
|
||||
def __init__(self, text):
|
||||
self.text = text
|
||||
self.index = 0
|
||||
self.line_start_index = 0
|
||||
self.token_start_index = 0
|
||||
self.token_start = 1, 0
|
||||
self.line = 1
|
||||
self.super_state = START_SUPER_STATE
|
||||
self.state_stack = []
|
||||
self.indents = [0]
|
||||
#ACTIONS-HERE
|
||||
def tokens(self, debug=False):
|
||||
text = self.text
|
||||
cls_table = CLASS_TABLE
|
||||
id_index = ID_INDEX
|
||||
id_chunks = ID_CHUNKS
|
||||
max_id = len(id_index)*256
|
||||
#ACTION_TABLE_HERE
|
||||
state = 0
|
||||
try:
|
||||
if debug:
|
||||
while True:
|
||||
c = ord(text[self.index])
|
||||
if c < 128:
|
||||
cls = cls_table[c]
|
||||
elif c >= max_id:
|
||||
cls = ERROR_CLASS
|
||||
else:
|
||||
b = id_chunks[id_index[c>>8]][(c>>2)&63]
|
||||
cls = (b>>((c&3)*2))&3
|
||||
prev_state = state
|
||||
print("char = '%s', state=%d, cls=%d" % (text[self.index], state, cls))
|
||||
state, transition = action_table[self.super_state[state][cls]]
|
||||
print ("%s -> %s on %r in %s" % (prev_state, state, text[self.index], TRANSITION_STATE_NAMES[id(self.super_state)]))
|
||||
if transition:
|
||||
tkn = transition()
|
||||
if tkn:
|
||||
yield tkn
|
||||
else:
|
||||
self.index += 1
|
||||
else:
|
||||
while True:
|
||||
c = ord(text[self.index])
|
||||
if c < 128:
|
||||
cls = cls_table[c]
|
||||
elif c >= max_id:
|
||||
cls = ERROR_CLASS
|
||||
else:
|
||||
b = id_chunks[id_index[c>>8]][(c>>2)&63]
|
||||
cls = (b>>((c&3)*2))&3
|
||||
state, transition = action_table[self.super_state[state][cls]]
|
||||
if transition:
|
||||
tkn = transition()
|
||||
if tkn:
|
||||
yield tkn
|
||||
else:
|
||||
self.index += 1
|
||||
except IndexError as ex:
|
||||
if self.index != len(text):
|
||||
#Reraise index error
|
||||
cls = cls_table[c]
|
||||
trans = self.super_state[state]
|
||||
action_index = trans[cls]
|
||||
action_table[action_index]
|
||||
# Not raised? Must have been raised in transition function.
|
||||
raise ex
|
||||
tkn = self.emit_indent()
|
||||
while tkn is not None:
|
||||
yield tkn
|
||||
tkn = self.emit_indent()
|
||||
end = self.line, self.index-self.line_start_index
|
||||
yield ENDMARKER, u"", self.token_start, end
|
||||
return
|
||||
|
||||
def emit_indent(self):
|
||||
indent = 0
|
||||
index = self.line_start_index
|
||||
current = self.index
|
||||
here = self.line, current-self.line_start_index
|
||||
while index < current:
|
||||
if self.text[index] == ' ':
|
||||
indent += 1
|
||||
elif self.text[index] == '\t':
|
||||
indent = (indent+8) & -8
|
||||
elif self.text[index] == '\f':
|
||||
indent = 0
|
||||
else:
|
||||
#Unexpected state. Emit error token
|
||||
while len(self.indents) > 1:
|
||||
self.indents.pop()
|
||||
result = ERRORTOKEN, self.text[self.token_start_index:self.index+1], self.token_start, here
|
||||
self.token_start = here
|
||||
self.line_start_index = self.index
|
||||
return result
|
||||
index += 1
|
||||
if indent == self.indents[-1]:
|
||||
self.token_start = here
|
||||
self.token_start_index = self.index
|
||||
return None
|
||||
elif indent > self.indents[-1]:
|
||||
self.indents.append(indent)
|
||||
start = self.line, 0
|
||||
result = INDENT, self.text[self.line_start_index:current], start, here
|
||||
self.token_start = here
|
||||
self.token_start_index = current
|
||||
return result
|
||||
else:
|
||||
self.indents.pop()
|
||||
if indent > self.indents[-1]:
|
||||
#Illegal indent
|
||||
result = ILLEGALINDENT, u"", here, here
|
||||
else:
|
||||
result = DEDENT, u"", here, here
|
||||
if indent < self.indents[-1]:
|
||||
#More dedents to do
|
||||
self.state_stack.append(self.super_state)
|
||||
self.super_state = PENDING_DEDENT
|
||||
self.token_start = here
|
||||
self.token_start_index = self.index
|
||||
return result
|
||||
|
||||
|
||||
ENCODING_RE = re.compile(br'.*coding[:=]\s*([-\w.]+).*')
|
||||
NEWLINE_BYTES = b'\n'
|
||||
|
||||
def encoding_from_source(source):
|
||||
'Returns encoding of source (bytes), plus source strip of any BOM markers.'
|
||||
#Check for BOM
|
||||
if source.startswith(codecs.BOM_UTF8):
|
||||
return 'utf8', source[len(codecs.BOM_UTF8):]
|
||||
if source.startswith(codecs.BOM_UTF16_BE):
|
||||
return 'utf-16be', source[len(codecs.BOM_UTF16_BE):]
|
||||
if source.startswith(codecs.BOM_UTF16_LE):
|
||||
return 'utf-16le', source[len(codecs.BOM_UTF16_LE):]
|
||||
try:
|
||||
first_new_line = source.find(NEWLINE_BYTES)
|
||||
first_line = source[:first_new_line]
|
||||
second_new_line = source.find(NEWLINE_BYTES, first_new_line+1)
|
||||
second_line = source[first_new_line+1:second_new_line]
|
||||
match = ENCODING_RE.match(first_line) or ENCODING_RE.match(second_line)
|
||||
if match:
|
||||
ascii_encoding = match.groups()[0]
|
||||
encoding = ascii_encoding.decode("ascii")
|
||||
# Handle non-standard encodings that are recognised by the interpreter.
|
||||
if encoding.startswith("utf-8-"):
|
||||
encoding = "utf-8"
|
||||
elif encoding == "iso-latin-1":
|
||||
encoding = "iso-8859-1"
|
||||
elif encoding.startswith("latin-1-"):
|
||||
encoding = "iso-8859-1"
|
||||
elif encoding.startswith("iso-8859-1-"):
|
||||
encoding = "iso-8859-1"
|
||||
elif encoding.startswith("iso-latin-1-"):
|
||||
encoding = "iso-8859-1"
|
||||
return encoding, source
|
||||
except Exception as ex:
|
||||
print(ex)
|
||||
#Failed to determine encoding -- Just treat as default.
|
||||
pass
|
||||
return 'utf-8', source
|
||||
Reference in New Issue
Block a user