Python: Copy Python extractor to codeql repo

This commit is contained in:
Taus
2024-02-28 15:15:21 +00:00
parent 297a17975d
commit 6dec323cfc
369 changed files with 165346 additions and 0 deletions

View File

@@ -0,0 +1,172 @@
# The Python tokenizer
This file describes the syntax and operational semantics of the state machine
that underlies our tokenizer.
## The state machine syntax
The state machine is described in a declarative fashion in the
`state_transition.txt` file. This file contains a sequence of declarations, as
described in the following subsections.
Additionally, lines may contain comments indicated using the `#` character, as
in Python itself.
In the remainder of the document, "identifier" means any sequence of characters
starting with a letter (`a-z` or `A-Z`) and followed by a sequence of letters,
digits, and/or underscores.
### Start declarations
This has the form `start: ` followed by the name of a table. It is used to
indicate what table is used as the starting point for the tokenization.
There should be exactly one of these declarations in the file.
### Alias declarations
These have the form
```
identifier = id_or_char or id_or_char or ...
```
where `id_or_char` is either a single character surrounded by single quotes
(e.g. `'a'`) or an identifier defined in another alias declaration.
Thus, aliases define _sets_ of characters: single-quoted characters representing
singleton sets, and `or` being set union.
>Note: A few character classes are predefined:
> - `ERROR` representing the error state of the state machine,
> - `IDENTIFIER` representing characters that can appear at the start of
> a Unicode identifier, and
> - `IDENTIFIER_CONTINUE` representing characters that can appear
> within a Unicode identifier.
### Table declarations
These have the form
```
table header {
state_transition
state_transition
...
}
```
where `header` is either an identifier or an identifier followed by another
identifier surrounded by parentheses. The latter implements a form of
"inheritance" between tables, and is explained in a later section.
The format of `state_transition`s is described in the next subsection.
### State transitions
Each state transition has the following form:
```
set_of_before_states -> after_state for set_of_characters optional_actions
```
Here, `set_of_before_states` is either a single identifier or a list of identifiers
with `or`s interspersed (mimicking the way sets of characters are specified) and
`after_state` is an identifier. These identifiers do not have to be declared
separately — they are implicitly declared when used.
>Note: A special state `0` (in the table indicated with the `start: `
>declaration represents the starting state for the entire tokenization.
The `set_of_characters` can either be
- the identifier corresponding to an alias,
- a single character (e.g. `'a'`),
- a list of sets of characters with `or`s interspersed, or
- an asterisk `*` representing _all_ characters that do not already have a
transition defined for the set of "before" states.
After the state transition is an optional list of actions, described next.
### Actions
Actions are specified using the keyword `do`. After this keyword, one or more
actions may be specified, each terminated with `;`, e.g.
```
foo -> bar for 'a' do action1; action2;
```
As the actions are very operational in nature, they will be described when we go
into the operational semantics of the state machine.
## Informal operational semantics
>Note: What follows is not based on a reading of the source code, but just
>experience with working with modifying the state machine. There may be
>significant inaccuracies.
At a high level, the purpose of the tokenizer is to partition the given input
into a sequence of strings representing tokens. The decision of where to put the
boundaries between these strings is done on a character-by-character basis. To
mark the start of a token, the action `mark` is used. Note that the mark is
placed _before_ the character that caused the action to be executed. That is, in
the following transition rule
```
foo -> bar for 'a' do mark;
```
the mark is placed _before_ the `a`.
Once the end of a token has been reached, the `emit` action is used. This
creates a token from the part of the input spanning from the most recent `mark`
up to (and including) the character that caused the transition to which the
`emit` action is attached.
As an example, consider the following state machine that splits a sequence of
zeroes and ones into tokens consisting of (maximal) runs of each character:
```
start: default
table default {
# This is essentially just an unconditional state transition.
0 -> zero_or_one for * do pushback;
zero_or_one -> zeros for '0' do mark;
zero_or_one -> ones for '1' do mark;
zeros -> zeros for '0'
zeros -> zero_or_one for * do pushback; emit(ZEROS);
ones -> zero_or_one for * do pushback; emit(ONES);
ones -> ones for '1'
}
```
The `pushback` action has the effect of "pushing back" the current character.
(In reality, all this does is move the pointer to the current character one step
back. It is thus not a problem to have several pushbacks in a row.)
> Note: The order in which the transition rules for a state is specified does
> not matter. Even if the `*` state is listed first, as with `ones` above, it
> does not take precedence over other more specific character sets.
After tokenizing a string with the above grammar, the result will be a sequence
of `ZEROS` and `ONES` tokens. Each of these will have three pieces of data
associated with it: the starting point (line and column), the end point (also
line and column), and the characters that make up the token. Note that `emit`
accepts a second argument (which must be a string) as well. For example, the
transition for code when reaching a newline is:
```
feed = '\r' or '\n'
...
code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
```
This has the effect of normalizing end of line characters to be `\n`.
>Note: The replacement text may have a different length than the distance to the
>most recent `span`. This may not be desirable.
The above snippet introduces another action: `newline`. This has the effect of
resetting the column counter to zero and incrementing the line counter.
>Note: There are some peculiarities about newlines, and the tokenizer will get
>confused if they are not handled through the `newline` action.
The last two actions have to do with maintaining a stack of parsing tables. At
all points, the behavior of the tokenizer is governed by the table that is on
top of the stack. The `push` action pushes the specified table (given as an
argument) on top of this stack. Naturally, the `pop` action does the opposite,
discarding the top element.
This leaves the final point of interest: what decides which transitions are
"active" at a given point?
The way this functions is essentially like method dispatch in Python (though
thankfully there is no multiple inheritance). Thus, given the current state and
the current character, we first look in the table on top of the stack. If this
table does not have a transition for the given state and character, we next look
at the table it inherits from, and so forth.

View File

@@ -0,0 +1,155 @@
import unicodedata
from . import machine
class SuperState:
def __init__(self, name, mapping):
self.name = name
self.mapping = mapping
def as_list_of_bytes(self):
lst = dict_to_list(self.mapping)
return [ table.as_bytes() for table in lst ]
def as_list_of_transitions(self):
return dict_to_list(self.mapping)
action_id = 0
all_actions = {}
class ActionList:
def __init__(self, actions, id):
self.actions = actions
self.id = id
@staticmethod
def get(actions):
global action_id
assert isinstance(actions, tuple)
if actions not in all_actions:
all_actions[actions] = ActionList(actions, action_id)
action_id += 1
return all_actions[actions]
@staticmethod
def listall():
return sorted(all_actions.values(), key = lambda al: al.id)
next_pair_id = 0
pairs = {}
class StateActionListPair:
def __init__(self, state, actionlist, id):
self.state = state
self.actionlist = actionlist
self.id = id
@staticmethod
def get(state, actionlist):
global next_pair_id
if actionlist is not None and not isinstance(actionlist, ActionList):
actionlist = ActionList.get(actionlist)
if (state, actionlist) not in pairs:
pairs[(state, actionlist)] = StateActionListPair(state, actionlist, next_pair_id)
next_pair_id += 1
return pairs[(state, actionlist)]
@staticmethod
def listall():
return sorted(pairs.values(), key = lambda pair: pair.id)
next_table_id = 0
table_ids = {}
class StateTransitionTable:
def __init__(self, mapping):
self.mapping = mapping
def as_bytes(self):
lst = dict_to_list(self.mapping)
return bytes(pair.id for pair in lst)
def __getitem__(self, key):
return self.mapping[key]
@property
def id(self):
global next_table_id
b = self.as_bytes()
if not b in table_ids:
table_ids[b] = next_table_id
next_table_id += 1
return table_ids[b]
def dict_to_list(mapping):
assert isinstance(mapping, dict)
result = []
for key, value in mapping.items():
while key.id >= len(result):
result.append(None)
result[key.id] = value
return result
#Each character is one of id-start, id-continuation or other. Represent "other" as ERROR for all non-ascii characters.
#See https://www.python.org/dev/peps/pep-3131 for an explanation of what is an identifier.
OTHER_START = {0x1885, 0x1886, 0x2118, 0x212E, 0x309B, 0x309C}
OTHER_CONTINUE = {0x00B7, 0x0387, 0x19DA}
OTHER_CONTINUE.update(range(0x1369, 0x1372))
ID_CATEGORIES = {"Lu", "Ll", "Lt", "Lm", "Lo", "Nl"}
CONT_CATEGORIES = {"Mn", "Mc", "Nd", "Pc"}
CHUNK_SIZE = 64
class IdentifierTable:
def __init__(self):
classes = []
for i in range(0x110000):
try:
c = chr(i)
except:
continue
cat = unicodedata.category(c)
if cat in ID_CATEGORIES or i in OTHER_START:
cls = machine.IDENTIFIER_CLASS.id
elif cat in CONT_CATEGORIES or i in OTHER_CONTINUE:
cls = machine.IDENTIFIER_CONTINUE_CLASS.id
else:
cls = machine.ERROR_CLASS.id
assert cls in (0,1,2,3)
classes.append(cls)
result = []
for i, cls in enumerate(classes):
byte, bits = i>>2, cls<<((i&3)*2)
while byte >= len(result):
result.append(0)
result[byte] |= bits
while result[-1] == 0:
result.pop()
while len(result) % CHUNK_SIZE:
result.append(0)
self.table = result
def as_bytes(self):
return bytes(self.table)
def as_two_level_table(self):
index = []
chunks = {}
next_id = 0
the_bytes = self.as_bytes()
for n in range(0, len(the_bytes), CHUNK_SIZE):
chunk = the_bytes[n:n+CHUNK_SIZE]
if chunk in chunks:
index.append(chunks[chunk])
else:
index.append(next_id)
chunks[chunk] = next_id
next_id += 1
chunks = [ chunk for (i, chunk) in sorted((i, chunk) for chunk, i in chunks.items())]
return chunks, index

View File

@@ -0,0 +1,225 @@
'''
Generate a state-machine based tokenizer from a state transition description and a template.
Parses the state transition description to compute a set of transition tables.
Each table maps (state, character-class) pairs to (state, action) pairs.
During tokenization each input character is converted to a class, then a new state and action is
looked up using the current state and character-class.
The generated tables are:
CLASS_TABLE:
Maps ASCII code points to character class.
ID_TABLE:
Maps all unicode points to one of Identifier, Identifier-continuation, or other.
The transition tables:
Each table maps each state to a per-class transition table.
Each per-class transition table maps each character-class to an index in the action table.
ACTION_TABLE:
Embedded in code as `action_table`; maps each index to a (state, action) pair.
Since the number of character-classes, states and (state, action) pairs is small. Everything is represented as
a byte and tables as `bytes` object for Python 3, or `array.array` objects for Python 2.
'''
from .parser import parse
from . import machine
from .compiled import StateActionListPair, IdentifierTable
def emit_id_bytes(id_table):
chunks, index = id_table.as_two_level_table()
print("# %d entries in ID index" % len(index))
index_bytes = bytes(index)
print("ID_INDEX = toarray(")
for n in range(0, len(index_bytes), 32):
print(" %r" % index_bytes[n:n+32])
print(")")
print("ID_CHUNKS = (")
for chunk in chunks:
print(" toarray(%r)," % chunk)
print(")")
def emit_transition_table(table, verbose=False):
print("%s = (" % table.name.upper(), end="")
for trans in table.as_list_of_transitions():
print("B%02d," % trans.id, end=" ")
print(")")
emitted_rows = set()
def emit_rows(table):
for trans in table.as_list_of_transitions():
id = trans.id
if id in emitted_rows:
continue
emitted_rows.add(id)
print("B%02d = toarray(%r)" % (id, trans.as_bytes()))
action_names = {}
next_action_id = 0
def get_action_id(action):
global next_action_id
assert action is not None
if action in action_names:
return action_names[action]
result = next_action_id
next_action_id += 1
action_names[action] = result
return result
def emit_actions(table, indent=""):
for pair in table:
if pair.actionlist is None:
continue
action = pair.actionlist
get_action_function(action, indent)
def generate_action_table(table, indent):
result = []
result.append(indent + "action_table = [\n " + indent)
for i, pair in enumerate(table):
if pair.actionlist is None:
result.append("(%d, None), " % pair.state.id)
else:
result.append("(%d, self.action_%s), " % (pair.state.id, pair.actionlist.id))
if (i & 3) == 3:
result.append("\n " + indent)
result.append("\n" + indent + "]")
return "".join(result)
action_functions = set()
def get_action_function(actionlist, indent=""):
if actionlist in action_functions:
return
action_functions.add(actionlist)
last = actionlist.actions[-1]
print(indent + "def action_%d(self):" % actionlist.id)
emit = False
for action in actionlist.actions:
if action is machine.PUSHBACK:
print(indent + " self.index -= 1")
continue
elif action is machine.POP:
print(indent + " self.super_state = self.state_stack.pop()")
elif isinstance(action, machine.Push):
print(indent + " self.state_stack.append(self.super_state)")
print(indent + " self.super_state = %s" % action.state.name.upper())
elif action is machine.MARK:
print(indent + " self.token_start_index = self.index")
print(indent + " self.token_start = self.line, self.index-self.line_start_index")
elif isinstance(action, machine.Emit):
emit = True
print(indent + " end = self.line, self.index-self.line_start_index+1")
if action.text is None:
print(indent + " result = [%s, self.text[self.token_start_index:self.index+1], self.token_start, end]" % action.kind)
else:
print(indent + " result = [%s, u%s, (self.line, self.index-self.line_start_index), end]" % (action.kind, action.text))
print(indent + " self.token_start = end")
print(indent + " self.token_start_index = self.index+1")
elif action is machine.NEWLINE:
print(indent + " self.line_start_index = self.index+1")
print(indent + " self.line += 1")
elif action is machine.EMIT_INDENT:
assert action is last
print(indent + " return self.emit_indent()")
print()
return
else:
assert False, "Unexpected action: %s" % action
print(indent + " self.index += 1")
if emit:
print(indent + " return result")
else:
print(indent + " return None")
print()
return
def emit_char_classes(char_classes, verbose=False):
for cls in sorted(set(char_classes.values()), key=lambda x : x.id):
print("#%d = %r" % (cls.id, cls))
table = [None] * 128
by_id = {
machine.IDENTIFIER_CLASS.id : machine.IDENTIFIER_CLASS,
machine.IDENTIFIER_CONTINUE_CLASS.id : machine.IDENTIFIER_CONTINUE_CLASS,
machine.ERROR_CLASS.id : machine.ERROR_CLASS
}
for c, cls in char_classes.items():
by_id[cls.id] = cls
if c is machine.IDENTIFIER or c is machine.IDENTIFIER_CONTINUE:
continue
table[ord(c)] = cls.id
by_id[cls.id] = cls
for i in range(128):
assert table[i] is not None
bytes_table = bytes(table)
if verbose:
print("# Class Table")
for i in range(len(bytes_table)):
b = bytes_table[i]
print("# %r -> %s" % (chr(i), by_id[b]))
print("CLASS_TABLE = toarray(%r)" % bytes_table)
PREFACE = """
import codecs
import re
import sys
from blib2to3.pgen2.token import *
if sys.version < '3':
from array import array
def toarray(b):
return array('B', b)
else:
def toarray(b):
return b
"""
def main():
verbose = False
import sys
if len(sys.argv) != 3:
print("Usage %s DESCRIPTION TEMPLATE" % sys.argv[0])
sys.exit(1)
descriptionfile = sys.argv[1]
with open(descriptionfile) as fd:
m = machine.Machine.load(fd.read())
templatefile = sys.argv[2]
with open(templatefile) as fd:
template = fd.read()
print("# This file is AUTO-GENERATED. DO NOT MODIFY")
print('# To regenerate: run "python3 -m tokenizer_generator.gen_state_machine %s %s"' % (descriptionfile, templatefile))
print(PREFACE)
print("IDENTIFIER_CLASS = %d" % machine.IDENTIFIER_CLASS.id)
print("IDENTIFIER_CONTINUE_CLASS = %d" % machine.IDENTIFIER_CONTINUE_CLASS.id)
print("ERROR_CLASS = %d" % machine.ERROR_CLASS.id)
emit_id_bytes(IdentifierTable())
char_classes = m.get_classes()
emit_char_classes(char_classes, verbose)
print()
tables = [state.compile(char_classes) for state in m.states.values() ]
for table in tables:
emit_rows(table)
print()
for table in tables:
#pprint(table)
emit_transition_table(table, verbose)
print()
print("TRANSITION_STATE_NAMES = {")
for state in m.states.values():
print(" id(%s): '%s'," % (state.name.upper(), state.name))
print("}")
print("START_SUPER_STATE = %s" % m.start.name.upper())
prefix, suffix = template.split("#ACTIONS-HERE")
print(prefix)
actions = StateActionListPair.listall()
emit_actions(actions, " ")
action_table = generate_action_table(actions, " ")
print(suffix.replace("#ACTION_TABLE_HERE", action_table))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,485 @@
import ast
from .parser import parse
from collections import defaultdict
from .compiled import SuperState, StateTransitionTable, StateActionListPair
class Transition:
def __init__(self, from_state, to_state, what, do):
assert isinstance(from_state, State)
assert isinstance(to_state, State)
self.from_state = from_state
self.what = what
if not do:
do = None
else:
assert isinstance(do, list)
for item in do:
assert isinstance(item, Action)
do = tuple(do)
self.action = StateActionListPair.get(to_state, do)
def dump(self):
if self.action.actionlist:
return "%s -> %s for %s do %s" % (
self.from_state,
self.action.state,
self.what,
"; ".join(str(do) for do in self.action.actionlist.actions)
)
else:
return "%s -> %s for %s" % (
self.from_state,
self.action.state,
self.what
)
next_state_id = 1
states = {}
class State:
def __init__(self, name):
global next_state_id
if name.isdigit():
assert name == "0"
self.id = 0
self.name = "START"
else:
self.name = name
self.id = next_state_id
next_state_id += 1
@staticmethod
def get(name):
if name not in states:
states[name] = State(name)
return states[name]
@staticmethod
def count():
return len(states)
def __repr__(self):
return "state_%s(%s)" % (self.id, self.name)
@staticmethod
def from_id(id):
for state in states.values():
if state.id == id:
return state
raise ValueError(id)
State.get("0")
ERROR_ACTION = StateActionListPair.get(State.get("error"), None)
next_super_state_id = 0
super_states = {}
class TransitionTable:
def __init__(self, name):
global next_super_state_id
self.name = name
self.id = next_super_state_id
next_super_state_id += 1
self.parent = None
self.transitions = []
self._table = None
def add_transition(self, trans):
self.transitions.append(trans)
def dump(self):
if self.parent:
lines = [ "TransitionTable %s(%s extends %s)" % (self.id, self.name, self.parent.name) ]
else:
lines = [ "TransitionTable %s(%s):" % (self.id, self.name) ]
lines.extend(" " + t.dump() for t in self.transitions)
return "\n".join(lines)
@staticmethod
def get(name, parent=None):
if name not in super_states:
super_states[name] = TransitionTable(name)
return super_states[name]
@staticmethod
def count():
return len(super_states)
def get_table(self, character_classes):
'''Returns the transition table for all states in this super-state'''
if self._table is None:
from_transtions = defaultdict(list)
for t in self.transitions:
from_transtions[t.from_state].append(t)
self._table = { state: self.get_transition_table(state, from_transtions.get(state, ()), character_classes) for state in states.values() }
return self._table
def get_transition_table(self, state, transition_list, character_classes):
table = {}
if self.parent:
parent_table = self.parent.get_table(character_classes)
else:
parent_table = None
default = None
for t in transition_list:
assert state == t.from_state
if isinstance(t.what, Any):
default = t.action
continue
action = t.action
classes = set(character_classes[c] for c in t.what)
for cls in classes:
if cls in table:
raise ValueError("Duplicate transition from %s on %s" % (state, cls))
else:
table[cls] = action
on_identifier = table.get(IDENTIFIER_CLASS, None)
for cls in character_classes.values():
if cls in table:
continue
if on_identifier and cls.is_identifier:
table[cls] = on_identifier
elif default:
table[cls] = default
elif parent_table and state in parent_table:
table[cls] = parent_table[state][cls]
else:
table[cls] = ERROR_ACTION
return StateTransitionTable(table)
def compile(self, character_classes):
return SuperState(self.name, self.get_table(character_classes))
class Any:
def __repr__(self):
return "*"
class Action:
def __repr__(self):
return self.__class__.__name__.lower()
class Emit(Action):
def __init__(self, kind, text):
assert isinstance(kind, str)
assert kind.upper() == kind
self.kind = kind
self.text = text
def __repr__(self):
if self.text is None:
return "emit(" + self.kind + ")"
else:
return "emit(%s, %r)" % (self.kind, self.text)
def __eq__(self, other):
return type(other) is Emit and other.kind == self.kind and other.text == self.text
def __hash__(self):
return 353 ^ hash(self.kind) ^ hash(self.text)
class Push(Action):
def __init__(self, state):
assert isinstance(state, TransitionTable)
self.state = state
def __repr__(self):
return "push(%s)" % self.state.name
def __eq__(self, other):
return type(other) is Push and other.state == self.state
def __hash__(self):
return 59 ^ hash(self.state)
class EmitIndent(Action):
pass
EMIT_INDENT = EmitIndent()
class Pop(Action):
pass
POP = Pop()
class Pushback(Action):
pass
PUSHBACK = Pushback()
class Mark(Action):
pass
MARK = Mark()
class Newline(Action):
pass
NEWLINE = Newline()
class Identifier:
def __repr__(self):
return "UnicodeIdentifiers()"
IDENTIFIER = Identifier()
class IdentifierContinue:
def __repr__(self):
return "IdentifierContinue()"
IDENTIFIER_CONTINUE = IdentifierContinue()
next_char_class_id = 0
class CharacterClass:
def __init__(self, chars, is_identifier = None):
global next_char_class_id
self.chars = chars
self.id = next_char_class_id
next_char_class_id += 1
if is_identifier is None:
self.is_identifier = chars.copy().pop().isidentifier()
else:
self.is_identifier = is_identifier
def __repr__(self):
if self == IDENTIFIER_CLASS:
return "IDENTIFIER_CLASS(%d)" % self.id
elif self == ERROR_CLASS:
return "ERROR_CLASS(%d)" % self.id
else:
return "CharacterClass %s %r" % (self.id, sorted(self.chars))
ERROR_CLASS = CharacterClass(set(), False)
assert ERROR_CLASS.id == 0
IDENTIFIER_CLASS = CharacterClass(set(), True)
IDENTIFIER_CONTINUE_CLASS = CharacterClass(set(), False)
class Machine:
def __init__(self):
self.aliases = {}
self.states = {}
self.aliases["IDENTIFIER"] = IDENTIFIER
self.aliases["IDENTIFIER_CONTINUE"] = IDENTIFIER_CONTINUE
self.aliases['SPACE'] = {' '}
self.start = None
def add_state(self, name):
assert name not in self.states
result = TransitionTable.get(name)
self.states[name] = result
return result
def add_alias(self, name, choices):
assert name not in self.aliases
assert isinstance(choices, set), choices
self.aliases[name] = choices
def dump(self):
r = []
a = r.append
a("Starting super-state: %s" % self.start.name)
a("")
a("Aliases:")
for name_alias in self.aliases.items():
a(" %s = %r" % name_alias)
a("")
for name, state in self.states.items():
a(state.dump())
return "\n".join(r)
@staticmethod
def load(src):
tree = parse(src)
m = Machine()
w = Walker(m)
w.visit(tree)
return m
def get_classes(self):
'''Get the character classes for this machine'''
#There are two predefined classes: Unicode identifiers, and ERROR.
#A character class is a set of characters, such that the transitions
#and actions of the machine are identical for all characters in that class.
char_to_transitions = defaultdict(set)
for s in self.states.values():
for t in s.transitions:
w = t.what
if isinstance(w, Any):
continue
for c in w:
if c is IDENTIFIER or c is IDENTIFIER_CONTINUE:
continue
char_to_transitions[c].add((s, t.from_state, t.action))
equivalence_sets = defaultdict(set)
for c, transition_set in sorted(char_to_transitions.items()):
equivalence_sets[frozenset(transition_set)].add(c)
classes = {}
for char_set in sorted(equivalence_sets.values()):
charcls = CharacterClass(char_set)
for c in char_set:
classes[c] = charcls
classes[IDENTIFIER] = IDENTIFIER_CLASS
classes[IDENTIFIER_CONTINUE] = IDENTIFIER_CONTINUE_CLASS
for i in range(128):
c = chr(i)
if c not in classes:
if c.isidentifier():
classes[c] = IDENTIFIER_CLASS
elif c in "0123456789":
classes[c] = IDENTIFIER_CONTINUE_CLASS
else:
classes[c] = ERROR_CLASS
for cls in classes.values():
if cls is IDENTIFIER_CLASS or cls is IDENTIFIER_CONTINUE_CLASS or cls is ERROR_CLASS:
continue
assert { c for c in cls.chars if c.isidentifier() } == cls.chars or not { c for c in cls.chars if c.isidentifier() }
return classes
class Walker:
def __init__(self, machine):
self.machine = machine
def visit(self, node):
if hasattr(node, "type"):
tag = node.type
else:
tag = node.data
meth = getattr(self, "visit_" + tag, None)
if meth is None:
self.fail(node, tag)
else:
return meth(node)
def fail(self, node, tag):
print(node)
raise NotImplementedError(tag)
def visit_first_child(self, node):
assert len(node.children) == 1
return self.visit(node.children[0])
def visit_children(self, node):
return [ self.visit(child) for child in node.children ]
visit_start = visit_first_child
visit_machine = visit_children
visit_declaration = visit_first_child
def visit_alias_decl(self, node):
assert len(node.children) == 2
choice = self.visit(node.children[1])
self.machine.add_alias(node.children[0].value, choice)
def visit_alias(self, node):
return self.machine.aliases[node.children[0].value]
def visit_char(self, node):
c = ast.literal_eval(node.children[0].value)
assert isinstance(c, str), c
assert len(c) == 1, c
return c
def visit_choice(self, node):
#Convert choices into a set of characters
result = set()
for child in node.children:
item = self.visit(child)
if isinstance(item, set):
result.update(item)
else:
result.add(item)
return result
visit_item = visit_first_child
def visit_table_decl(self, node):
self.current_state = self.visit(node.children[0])
for transition in node.children[1:]:
self.visit(transition)
def visit_table_header(self, node):
name = node.children[0].value
state = self.machine.add_state(name)
if len(node.children) > 1:
base = TransitionTable.get(node.children[1].value)
state.parent = base
return state
def visit_transition(self, node):
# state_choice "->" state "for" (choice | "*") action_list?
from_states = self.visit(node.children[0])
to_state = self.visit(node.children[1])
what = self.visit(node.children[2])
if len(node.children) > 3:
do = self.visit(node.children[3])
else:
do = []
for state in from_states:
trans = Transition(state, to_state, what, do)
self.current_state.add_transition(trans)
visit_state_choice = visit_children
def visit_state(self, node):
return State.get(node.children[0].value)
def visit_any(self, node):
return Any()
visit_action_list = visit_children
visit_action = visit_first_child
def visit_emit(self, node):
if len(node.children) == 2:
return Emit(node.children[0].value, self.visit(node.children[1]))
else:
return Emit(node.children[0].value, None)
def visit_optional_text(self, node):
return node.children[0].value
def visit_push(self, node):
state = TransitionTable.get(node.children[0].value)
return Push(state)
def visit_emit_indent(self, node):
return EMIT_INDENT
def visit_pushback(self, node):
return PUSHBACK
def visit_pop(self, node):
return POP
def visit_mark(self, node):
return MARK
def visit_newline(self, node):
return NEWLINE
def visit_start_decl(self, node):
self.machine.start = TransitionTable.get(node.children[0].value)
def main():
import sys
file = sys.argv[1]
with open(file) as fd:
tree = parse(fd.read())
m = Machine()
w = Walker(m)
w.visit(tree)
print(m.dump())
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,92 @@
'''
Explanation of the syntax
start_decl: The starting transition table
alias_decl: Declare short hand, e.g. digits = '0' or '1' or ...
table_decl: Declare transition table: name and list of transitions.
transition: Transitions from one state to another. From is: state (or choice of states) -> new-state for possible-characters [ do action or actions; ]
action: Actions are:
"emit(kind [, text]): emits a token of kind using the givn text or text from the stream. The token starts at the last mark and ends at the current location.
"push(table)": pushes a transition table to the stack.
"pop" : pops a transition table from the stack.
"pushback": pushes the last character back to the stream.
"mark": marks the current location as the start of the next token.
"emit_indent": Emits zero or more INDENT or DEDENT tokens depending on current indentation.
"newline": Increments the line number and sets the column offset back to zero.
States:
All states are given names.
The state "0" is the start state and always exists.
All other states are implicitly defined when used (this is for Python after all :)
'*' means all states for which a transition is not explicitly defined.
So the transitions:
0 -> end for '\n'
0 -> other for *
0 -> a_b for 'a' or 'b'
mean that '0' will transition to 'other' for all characters other than 'a', 'b' and `\n`.
The order of transitions in the state machine description is irrelevant.
'''
grammar = r"""
start : machine
machine : declaration+
declaration : alias_decl | table_decl | start_decl
start_decl : "start" ":" IDENTIFIER
table_decl : table_header "{" transition+ "}"
table_header : "table" IDENTIFIER ( "(" IDENTIFIER ")" )?
alias_decl : IDENTIFIER "=" choice
choice : item ( "or" item)*
item : alias | char
alias : IDENTIFIER
char : LITERAL
transition : state_choice "->" state "for" (choice | any) action_list?
any : "*"
state_choice : state ( "or" state)*
state : IDENTIFIER | DIGIT
action_list : "do" action ";" (action ";")*
action : emit | pop | push | pushback | mark | emit_indent | newline
emit : "emit" "(" IDENTIFIER optional_text? ")"
optional_text : "," LITERAL
pop : "pop"
push : "push" "(" IDENTIFIER ")"
pushback : "pushback"
mark : "mark"
emit_indent : "emit_indent"
newline : "newline"
LITERAL : ("\"" /[^"]/* "\"") | ("'" /[^']/* "'")
IDENTIFIER : LETTER ( LETTER | DIGIT | "_" )*
LETTER : "A".."Z" | "a".."z"
DIGIT : "0".."9"
WHITESPACE : (" " | "\t" | "\r" | "\n")+
%import common.NEWLINE
COMMENT : "#" /(.)*/ NEWLINE
%ignore WHITESPACE
%ignore COMMENT
"""
from lark import Lark
class Parser(Lark):
def __init__(self):
Lark.__init__(self, grammar, parser="earley", lexer="standard")
def parse(src):
parser = Parser()
return parser.parse(src)
def main():
import sys
file = sys.argv[1]
with open(file) as fd:
tree = parse(fd.read())
print(tree.pretty())
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,385 @@
# State machine specification for unified Python tokenizer
# Handles all tokens for all versions of Python, including partial string tokens for handling f-strings.
# Stating transition table is "default" and starting state is "0"
#
#
#declarations
prefix_chars = 'u' or 'U' or 'b' or 'B' or 'r' or 'R'
one_to_nine = '1' or '2' or '3' or '4' or '5' or '6' or '7' or '8' or '9'
digits = '0' or one_to_nine
oct_digits = '0' or '1' or '2' or '3' or '4' or '5' or '6' or '7'
hex_digits = digits or 'a' or 'A' or 'b' or 'B' or 'c' or 'C' or 'd' or 'D' or 'e' or 'E' or 'f' or 'F'
feed = '\n' or '\r'
#tables
table default {
# 0 is starting state
0 -> whitespace_line for * do pushback;
#String prefix states
# When we encounter a prefix character, we are faced with the possibility
# that it is either the beginning of a string or of an identifier. With a
# single character of lookahead available, we therefore have to be in an
# intermediate state until we are able to determine which case we're in.
code -> maybe_string1 for prefix_chars do mark;
maybe_string1 -> maybe_string2 for prefix_chars
maybe_string1 or maybe_string2 -> quote_s for "'"
maybe_string1 or maybe_string2 -> quote_d for '"'
code -> quote_s for "'" do mark;
code -> quote_d for '"' do mark;
maybe_string1 or maybe_string2 -> in_identifier for * do pushback;
# In the following, `_s` means one single quote, `_ss` means two in a row,
# etc. Likewise `_d` indicates double quotes.
quote_s -> quote_ss for "'"
quote_d -> quote_dd for '"'
quote_s -> instring for * do pushback ; push(string_s);
quote_ss -> instring for "'" do push(string_sss);
quote_ss -> code for * do pushback ; emit(STRING);
quote_d -> instring for * do pushback ; push(string_d);
quote_dd -> instring for '"' do push(string_ddd);
quote_dd -> code for * do pushback ; emit(STRING);
#F-string prefix states
# The prefixes `u` and `b` are specific to Python 2, and f-strings are only
# valid for Python 3. Thus, the only potential prefixes are permutations of
# `f` and `fr` (upper/lowercase notwithstanding).
code -> maybe_fstring1 for 'f' or 'F' do mark;
maybe_string1 -> maybe_fstring2 for 'f' or 'F'
maybe_fstring1 -> maybe_fstring2 for 'r' or 'R'
maybe_fstring1 or maybe_fstring2 -> fquote_s for "'"
maybe_fstring1 or maybe_fstring2 -> fquote_d for '"'
maybe_fstring1 or maybe_fstring2 -> in_identifier for * do pushback;
fquote_s -> fquote_ss for "'"
fquote_d -> fquote_dd for '"'
fquote_s -> instring for * do pushback ; push(fstring_start_s);
fquote_ss -> instring for "'" do push(fstring_start_sss);
fquote_ss -> code for * do pushback ; emit(STRING);
fquote_d -> instring for * do pushback ; push(fstring_start_d);
fquote_dd -> instring for '"' do push(fstring_start_ddd);
fquote_dd -> code for * do pushback ; emit(STRING);
#String states
instring -> instring for *
instring -> unicode_or_escape for '\\'
unicode_or_escape -> unicode_or_raw for 'N'
unicode_or_raw -> unicode for '{'
unicode_or_raw -> instring for *
unicode -> instring for '}'
unicode -> unicode for *
unicode_or_escape -> escape for * do pushback;
escape -> instring for feed do newline;
escape -> instring for *
# When inside a parenthesized expression, newlines indicate the continuation
# of the expression, and not a return to a context where statements may
# appear. This is captured using the `paren` table.
code -> code for '(' do emit(LPAR, "("); push(paren);
code -> code for '[' do emit(LSQB, "["); push(paren);
code -> code for '{' do emit(LBRACE, "{"); push(paren);
code -> code for ')' do emit(RPAR, ")");
code -> code for ']' do emit(RSQB, "]");
code -> code for '}' do emit(RBRACE, "}");
code -> code for '`' do emit(BACKQUOTE, '`');
# Operators
code -> assign for '=' do mark;
code -> le for '<' do mark;
code -> ge for '>' do mark;
code -> bang for '!' do mark;
le -> binop for '<'
le -> code for '>' do emit(OP);
ge -> binop for '>'
bang or le or ge or assign -> code for '=' do emit(OP);
le or ge or assign -> code for * do pushback; emit(OP);
bang -> code for 'r' or 'a' or 's' or 'd' do emit(CONVERSION);
code -> colon for ':'
colon -> code for '=' do emit(COLONEQUAL, ":=");
colon -> code for * do pushback; emit(COLON, ":");
code -> code for ',' do emit(COMMA, ",");
code -> code for ';' do emit(SEMI, ";");
code -> at for '@' do mark;
at -> code for '=' do emit(OP);
at -> code for * do pushback; emit(AT, "@");
code -> dot for '.' do mark;
dot -> float for digits
dot -> code for * do pushback; emit(DOT, ".");
binop or slash or star or dash -> code for '=' do emit(OP);
binop or slash or star or dash -> code for * do pushback; emit(OP);
code -> star for '*' do mark;
star -> binop for '*'
code -> slash for '/' do mark;
slash -> binop for '/'
code -> dash for '-' do mark;
dash -> code for '>' do emit(RARROW);
code -> binop for '+' or '%' or '&' or '|' or '^' do mark;
code -> code for '~' do emit(OP, '~');
# Numeric literals
# Python admits a large variety of numeric literals, and the handling of
# various constructs is a bit inconsistent. For instance, prefixed zeroes are
# not allowed in front of integer numerals (unless all digits are between 0
# and 7, in which case it is treated as an octal number), but _are_ allowed if
# there is some other context that makes it a float or complex number. Thus,
# `09` is invalid, but `09.` and `09j` are valid. This means we have to be
# very careful in what we commit to in our tokenization, hence the rather
# complicated construction below.
code -> int for one_to_nine do mark;
int -> int for digits
zero or zero_int or binary or octal or int or hex -> code for 'l' or 'L' do emit(NUMBER);
int -> int_sep for '_'
int_sep -> int for digits
int_sep -> error for * do emit(ERRORTOKEN);
code -> zero for '0' do mark;
zero -> zero_int for digits
zero -> zero_int_sep for '_'
zero_int -> zero_int for digits
zero_int -> zero_int_sep for '_'
zero_int_sep -> zero_int for digits
zero_int_sep -> error for * do emit(ERRORTOKEN);
zero -> octal for 'o' or 'O'
octal -> octal for oct_digits
octal -> octal_sep for '_'
octal_sep -> octal for oct_digits
octal_sep -> error for * do emit(ERRORTOKEN);
zero or octal or hex or binary -> code for * do pushback; emit(NUMBER);
zero -> binary for 'b' or 'B'
binary -> binary for '0' or '1'
binary -> binary_sep for '_'
binary_sep -> binary for '0' or '1'
binary_sep -> error for * do emit(ERRORTOKEN);
zero -> hex for 'x' or 'X'
hex -> hex for hex_digits
hex -> hex_sep for '_'
hex_sep -> hex for hex_digits
hex_sep -> error for * do emit(ERRORTOKEN);
zero or zero_int or int -> int_dot for '.'
zero_int or int -> code for * do pushback; emit(NUMBER);
int_dot or float -> float for digits
float -> float_sep for '_'
float_sep -> float for digits
float_sep -> error for * do emit(ERRORTOKEN);
int_dot -> code for * do pushback; emit(NUMBER);
float or zero or zero_int or int or int_dot -> float_e for 'e'
float or zero or zero_int or int or int_dot -> float_E for 'E'
# `1 if 1else 0` is valid syntax, so we cannot assume 'e' always indicates a float.
float_e -> code for 'l' do pushback; pushback; emit(NUMBER);
float_e or float_E -> float_E for '+' or '-'
float_e or float_E or float_x -> float_x for digits
float_x -> float_x_sep for '_'
float_x_sep -> float_x for digits
float_x_sep -> error for * do emit(ERRORTOKEN);
float or float_x -> code for * do pushback; emit(NUMBER);
# Identifiers (e.g. names and keywords)
code -> in_identifier for IDENTIFIER do mark;
in_identifier -> in_identifier for IDENTIFIER or digits or IDENTIFIER_CONTINUE
code -> dollar_name for '$' do mark;
dollar_name -> dollar_name for IDENTIFIER or digits or IDENTIFIER_CONTINUE
code -> in_identifier for '_' do mark;
in_identifier -> in_identifier for '_'
in_identifier -> code for * do pushback; emit(NAME);
dollar_name -> code for * do pushback; emit(DOLLARNAME);
# Comments
code -> line_end_comment for '#' do mark;
line_end_comment -> code for feed do pushback; emit(COMMENT);
line_end_comment -> line_end_comment for *
comment -> whitespace_line for feed do pushback; emit(COMMENT);
comment -> comment for *
code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
whitespace_line -> whitespace_line for SPACE or '\t' or '\f'
whitespace_line -> whitespace_line for feed do newline;
whitespace_line -> code for * do emit_indent;
whitespace_line -> comment for '#' do mark;
code -> code for SPACE or '\t'
# Line continuations and error states.
code or float_e or float_E -> error for * do emit(ERRORTOKEN);
code -> pending_continuation for '\\'
pending_continuation -> line_continuation for feed do newline;
line_continuation -> code for * do pushback; mark;
pending_continuation -> error for * do emit(ERRORTOKEN);
error -> code for * do pushback;
code -> code for * do mark; emit(ERRORTOKEN);
zero or int_dot or zero_int or int or float or float_x -> code for 'j' or 'J' do emit(NUMBER);
}
table paren(default) {
code -> code for feed do mark; newline;
code -> code for ')' do emit(RPAR, ")"); pop;
code -> code for ']' do emit(RSQB, "]"); pop;
code -> code for '}' do emit(RBRACE, "}"); pop;
}
#String starting with '
table string_s(default) {
instring -> code for "'" do pop; emit(STRING);
instring -> error for feed do pop; emit(ERRORTOKEN); newline;
}
#String starting with "
table string_d(default) {
instring -> code for '"' do pop; emit(STRING);
instring -> error for feed do pop; emit(ERRORTOKEN); newline;
}
#String starting with '''
table string_sss(default) {
instring -> string_x for "'"
instring -> instring for feed do newline;
string_x -> string_xx for "'"
string_x -> instring for feed do newline;
string_x -> instring for * do pushback;
string_xx -> code for "'" do pop; emit(STRING);
string_xx -> instring for feed do newline;
string_xx -> instring for * do pushback;
}
#String starting with """
table string_ddd(default) {
instring -> string_x for '"'
instring -> instring for feed do newline;
string_x -> string_xx for '"'
string_x -> instring for feed do newline;
string_x -> instring for * do pushback;
string_xx -> code for '"' do pop; emit(STRING);
string_xx -> instring for feed do newline;
string_xx -> instring for * do pushback;
}
#F-string part common to all fstrings
table fstring_sdsssddd(default) {
instring -> brace for '{'
escape -> brace for '{'
brace -> instring for '{'
brace -> code for * do pushback ; emit(FSTRING_MID); push(fstring_expr);
}
#F-string part common to ' and "
table fstring_sd(fstring_sdsssddd) {
instring -> error for feed do pop; emit(ERRORTOKEN); newline;
}
#F-string start for string starting with '
table fstring_start_s(fstring_sd) {
instring -> code for "'" do pop; emit(STRING);
# If this rule is removed or moved to a higher table, the QL tests start failing for unclear reasons.
# It's identical to a rule in default.
brace -> instring for '{'
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_s); push(fstring_expr);
}
#F-string part for string starting with '
table fstring_s(fstring_sd) {
instring -> code for "'" do pop; emit(FSTRING_END);
}
#F-string start for string starting with "
table fstring_start_d(fstring_sd) {
instring -> code for '"' do pop; emit(STRING);
# If this rule is removed or moved to a higher table, the QL tests start failing for unclear reasons.
# It's identical to a rule in fstring_sdsssddd.
brace -> instring for '{'
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_d); push(fstring_expr);
}
#F-string part for string starting with "
table fstring_d(fstring_sd) {
instring -> code for '"' do pop; emit(FSTRING_END);
}
#F-string part common to ''' and """
table fstring_sssddd(fstring_sdsssddd) {
instring -> instring for feed do newline;
string_x -> instring for feed do newline;
string_x -> instring for * do pushback;
string_xx -> instring for feed do newline;
string_xx -> instring for * do pushback;
}
#F-string start for string starting with '''
table fstring_start_sss(fstring_sssddd) {
instring -> string_x for "'"
string_x -> string_xx for "'"
string_xx -> code for "'" do pop; emit(STRING);
brace -> instring for '{'
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_sss); push(fstring_expr);
}
#F-string part for string starting with '''
table fstring_sss(fstring_sssddd) {
instring -> string_x for "'"
string_x -> string_xx for "'"
string_xx -> code for "'" do pop; emit(FSTRING_END);
}
#F-string start for string starting with """
table fstring_start_ddd(fstring_sssddd) {
instring -> string_x for '"'
string_x -> string_xx for '"'
string_xx -> code for '"' do pop; emit(STRING);
brace -> instring for '{'
brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_ddd); push(fstring_expr);
}
#F-string part for string starting with """
table fstring_ddd(fstring_sssddd) {
instring -> string_x for '"'
string_x -> string_xx for '"'
string_xx -> code for '"' do pop; emit(FSTRING_END);
}
#Expression within an f-string
table fstring_expr(paren) {
code -> instring for '}' do pop; mark;
code -> instring for ':' do emit(COLON); push(format_specifier);
instring -> instring for '}' do pop; mark;
}
fspec_type = 'b' or 'c' or 'd' or 'e' or 'E' or 'f' or 'F' or 'g' or 'G' or 'n' or 'o' or 's' or 'x' or 'X' or '%'
fspec_align = '<' or '>' or '=' or '^'
fspec_sign = '+' or '-' or ' '
table format_specifier(default) {
instring -> code for '{' do emit(FSTRING_SPEC);
instring -> instring for '}' do pushback; emit(FSTRING_SPEC); pop;
code -> instring for '}' do mark;
}
#Special state for when dedents are pending.
table pending_dedent(default) {
code -> code for * do pop; emit_indent;
}
start: default

View File

@@ -0,0 +1,25 @@
from . import test_tokenizer
import sys
from blib2to3.pgen2.token import tok_name
def printtoken(type, token, start, end): # for testing
token_range = "%d,%d-%d,%d:" % (start + end)
print("%-20s%-15s%r" %
(token_range, tok_name[type], token)
)
def main():
verbose = sys.argv[1] == "-v"
if verbose:
inputfile = sys.argv[2]
else:
inputfile = sys.argv[1]
with open(inputfile, "r") as input:
t = test_tokenizer.Tokenizer(input.read()+"\n")
for tkn in t.tokens(verbose):
printtoken(*tkn)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,172 @@
'''
Lookup table based tokenizer with state popping and pushing capabilities.
The ability to push and pop state is required for handling parenthesised expressions,
indentation, and f-strings. We also use it for handling the different quotation mark types,
but it is not essential for that, merely convenient.
'''
class Tokenizer(object):
def __init__(self, text):
self.text = text
self.index = 0
self.line_start_index = 0
self.token_start_index = 0
self.token_start = 1, 0
self.line = 1
self.super_state = START_SUPER_STATE
self.state_stack = []
self.indents = [0]
#ACTIONS-HERE
def tokens(self, debug=False):
text = self.text
cls_table = CLASS_TABLE
id_index = ID_INDEX
id_chunks = ID_CHUNKS
max_id = len(id_index)*256
#ACTION_TABLE_HERE
state = 0
try:
if debug:
while True:
c = ord(text[self.index])
if c < 128:
cls = cls_table[c]
elif c >= max_id:
cls = ERROR_CLASS
else:
b = id_chunks[id_index[c>>8]][(c>>2)&63]
cls = (b>>((c&3)*2))&3
prev_state = state
print("char = '%s', state=%d, cls=%d" % (text[self.index], state, cls))
state, transition = action_table[self.super_state[state][cls]]
print ("%s -> %s on %r in %s" % (prev_state, state, text[self.index], TRANSITION_STATE_NAMES[id(self.super_state)]))
if transition:
tkn = transition()
if tkn:
yield tkn
else:
self.index += 1
else:
while True:
c = ord(text[self.index])
if c < 128:
cls = cls_table[c]
elif c >= max_id:
cls = ERROR_CLASS
else:
b = id_chunks[id_index[c>>8]][(c>>2)&63]
cls = (b>>((c&3)*2))&3
state, transition = action_table[self.super_state[state][cls]]
if transition:
tkn = transition()
if tkn:
yield tkn
else:
self.index += 1
except IndexError as ex:
if self.index != len(text):
#Reraise index error
cls = cls_table[c]
trans = self.super_state[state]
action_index = trans[cls]
action_table[action_index]
# Not raised? Must have been raised in transition function.
raise ex
tkn = self.emit_indent()
while tkn is not None:
yield tkn
tkn = self.emit_indent()
end = self.line, self.index-self.line_start_index
yield ENDMARKER, u"", self.token_start, end
return
def emit_indent(self):
indent = 0
index = self.line_start_index
current = self.index
here = self.line, current-self.line_start_index
while index < current:
if self.text[index] == ' ':
indent += 1
elif self.text[index] == '\t':
indent = (indent+8) & -8
elif self.text[index] == '\f':
indent = 0
else:
#Unexpected state. Emit error token
while len(self.indents) > 1:
self.indents.pop()
result = ERRORTOKEN, self.text[self.token_start_index:self.index+1], self.token_start, here
self.token_start = here
self.line_start_index = self.index
return result
index += 1
if indent == self.indents[-1]:
self.token_start = here
self.token_start_index = self.index
return None
elif indent > self.indents[-1]:
self.indents.append(indent)
start = self.line, 0
result = INDENT, self.text[self.line_start_index:current], start, here
self.token_start = here
self.token_start_index = current
return result
else:
self.indents.pop()
if indent > self.indents[-1]:
#Illegal indent
result = ILLEGALINDENT, u"", here, here
else:
result = DEDENT, u"", here, here
if indent < self.indents[-1]:
#More dedents to do
self.state_stack.append(self.super_state)
self.super_state = PENDING_DEDENT
self.token_start = here
self.token_start_index = self.index
return result
ENCODING_RE = re.compile(br'.*coding[:=]\s*([-\w.]+).*')
NEWLINE_BYTES = b'\n'
def encoding_from_source(source):
'Returns encoding of source (bytes), plus source strip of any BOM markers.'
#Check for BOM
if source.startswith(codecs.BOM_UTF8):
return 'utf8', source[len(codecs.BOM_UTF8):]
if source.startswith(codecs.BOM_UTF16_BE):
return 'utf-16be', source[len(codecs.BOM_UTF16_BE):]
if source.startswith(codecs.BOM_UTF16_LE):
return 'utf-16le', source[len(codecs.BOM_UTF16_LE):]
try:
first_new_line = source.find(NEWLINE_BYTES)
first_line = source[:first_new_line]
second_new_line = source.find(NEWLINE_BYTES, first_new_line+1)
second_line = source[first_new_line+1:second_new_line]
match = ENCODING_RE.match(first_line) or ENCODING_RE.match(second_line)
if match:
ascii_encoding = match.groups()[0]
encoding = ascii_encoding.decode("ascii")
# Handle non-standard encodings that are recognised by the interpreter.
if encoding.startswith("utf-8-"):
encoding = "utf-8"
elif encoding == "iso-latin-1":
encoding = "iso-8859-1"
elif encoding.startswith("latin-1-"):
encoding = "iso-8859-1"
elif encoding.startswith("iso-8859-1-"):
encoding = "iso-8859-1"
elif encoding.startswith("iso-latin-1-"):
encoding = "iso-8859-1"
return encoding, source
except Exception as ex:
print(ex)
#Failed to determine encoding -- Just treat as default.
pass
return 'utf-8', source