Python: Copy Python extractor to codeql repo

2026-04-27 17:55:19 +02:00 · 2024-02-28 15:15:21 +00:00
parent 297a17975d
commit 6dec323cfc
369 changed files with 165346 additions and 0 deletions
--- a/python/extractor/tokenizer_generator/README.md
+++ b/python/extractor/tokenizer_generator/README.md
@@ -0,0 +1,172 @@
+# The Python tokenizer
+
+This file describes the syntax and operational semantics of the state machine
+that underlies our tokenizer.
+
+## The state machine syntax
+
+The state machine is described in a declarative fashion in the
+`state_transition.txt` file. This file contains a sequence of declarations, as
+described in the following subsections.
+
+Additionally, lines may contain comments indicated using the `#` character, as
+in Python itself.
+
+In the remainder of the document, "identifier" means any sequence of characters
+starting with a letter (`a-z` or `A-Z`) and followed by a sequence of letters,
+digits, and/or underscores.
+
+### Start declarations
+This has the form `start: ` followed by the name of a table. It is used to
+indicate what table is used as the starting point for the tokenization.
+
+There should be exactly one of these declarations in the file.
+
+### Alias declarations
+These have the form
+```
+identifier = id_or_char or id_or_char or ...
+```
+where `id_or_char` is either a single character surrounded by single quotes
+(e.g. `'a'`) or an identifier defined in another alias declaration.
+
+Thus, aliases define _sets_ of characters: single-quoted characters representing
+singleton sets, and `or` being set union.
+>Note: A few character classes are predefined: 
+> - `ERROR` representing the error state of the state machine,
+> - `IDENTIFIER` representing characters that can appear at the start of
+>   a Unicode identifier, and
+> - `IDENTIFIER_CONTINUE` representing characters that can appear
+>   within a Unicode identifier.
+
+### Table declarations
+These have the form 
+
+```
+table header {
+    state_transition
+    state_transition
+    ...
+}
+```
+where `header` is either an identifier or an identifier followed by another
+identifier surrounded by parentheses. The latter implements a form of
+"inheritance" between tables, and is explained in a later section.
+
+The format of `state_transition`s is described in the next subsection.
+
+### State transitions
+Each state transition has the following form:
+```
+set_of_before_states -> after_state for set_of_characters optional_actions
+```
+Here, `set_of_before_states` is either a single identifier or a list of identifiers
+with `or`s interspersed (mimicking the way sets of characters are specified) and
+`after_state` is an identifier. These identifiers do not have to be declared
+separately &mdash; they are implicitly declared when used.
+>Note: A special state `0` (in the table indicated with the `start: `
+>declaration represents the starting state for the entire tokenization.
+
+The `set_of_characters` can either be
+- the identifier corresponding to an alias,
+- a single character (e.g. `'a'`),
+- a list of sets of characters with `or`s interspersed, or
+- an asterisk `*` representing _all_ characters that do not already have a
+  transition defined for the set of "before" states.
+
+After the state transition is an optional list of actions, described next.
+
+### Actions
+Actions are specified using the keyword `do`. After this keyword, one or more
+actions may be specified, each terminated with `;`, e.g.
+```
+foo -> bar for 'a' do action1; action2;
+```
+As the actions are very operational in nature, they will be described when we go
+into the operational semantics of the state machine.
+
+## Informal operational semantics
+>Note: What follows is not based on a reading of the source code, but just
+>experience with working with modifying the state machine. There may be
+>significant inaccuracies.
+
+At a high level, the purpose of the tokenizer is to partition the given input
+into a sequence of strings representing tokens. The decision of where to put the
+boundaries between these strings is done on a character-by-character basis. To
+mark the start of a token, the action `mark` is used. Note that the mark is
+placed _before_ the character that caused the action to be executed. That is, in
+the following transition rule
+```
+foo -> bar for 'a' do mark;
+```
+the mark is placed _before_ the `a`.
+
+Once the end of a token has been reached, the `emit` action is used. This
+creates a token from the part of the input spanning from the most recent `mark`
+up to (and including) the character that caused the transition to which the
+`emit` action is attached.
+
+As an example, consider the following state machine that splits a sequence of
+zeroes and ones into tokens consisting of (maximal) runs of each character:
+
+```
+start: default
+table default {
+
+    # This is essentially just an unconditional state transition.
+    0 -> zero_or_one for * do pushback;
+
+    zero_or_one -> zeros for '0' do mark;
+    zero_or_one -> ones for '1' do mark;
+
+    zeros -> zeros for '0'
+    zeros -> zero_or_one for * do pushback; emit(ZEROS);
+
+    ones -> zero_or_one for * do pushback; emit(ONES);
+    ones -> ones for '1'
+}
+```
+The `pushback` action has the effect of "pushing back" the current character.
+(In reality, all this does is move the pointer to the current character one step
+back. It is thus not a problem to have several pushbacks in a row.)
+
+> Note: The order in which the transition rules for a state is specified does
+> not matter. Even if the `*` state is listed first, as with `ones` above, it
+> does not take precedence over other more specific character sets.
+
+After tokenizing a string with the above grammar, the result will be a sequence
+of `ZEROS` and `ONES` tokens. Each of these will have three pieces of data
+associated with it: the starting point (line and column), the end point (also
+line and column), and the characters that make up the token. Note that `emit`
+accepts a second argument (which must be a string) as well. For example, the
+transition for code when reaching a newline is:
+```
+feed = '\r' or '\n'
+...
+    code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
+```
+This has the effect of normalizing end of line characters to be `\n`.
+
+>Note: The replacement text may have a different length than the distance to the
+>most recent `span`. This may not be desirable.
+
+The above snippet introduces another action: `newline`. This has the effect of
+resetting the column counter to zero and incrementing the line counter.
+
+>Note: There are some peculiarities about newlines, and the tokenizer will get
+>confused if they are not handled through the `newline` action.
+
+The last two actions have to do with maintaining a stack of parsing tables. At
+all points, the behavior of the tokenizer is governed by the table that is on
+top of the stack. The `push` action pushes the specified table (given as an
+argument) on top of this stack. Naturally, the `pop` action does the opposite,
+discarding the top element.
+
+This leaves the final point of interest: what decides which transitions are
+"active" at a given point?
+
+The way this functions is essentially like method dispatch in Python (though
+thankfully there is no multiple inheritance). Thus, given the current state and
+the current character, we first look in the table on top of the stack. If this
+table does not have a transition for the given state and character, we next look
+at the table it inherits from, and so forth.
--- a/python/extractor/tokenizer_generator/compiled.py
+++ b/python/extractor/tokenizer_generator/compiled.py
@@ -0,0 +1,155 @@
+
+import unicodedata
+from . import machine
+
+class SuperState:
+
+    def __init__(self, name, mapping):
+        self.name = name
+        self.mapping = mapping
+
+    def as_list_of_bytes(self):
+        lst = dict_to_list(self.mapping)
+        return [ table.as_bytes() for table in lst ]
+
+    def as_list_of_transitions(self):
+        return dict_to_list(self.mapping)
+
+action_id = 0
+all_actions = {}
+
+class ActionList:
+
+    def __init__(self, actions, id):
+        self.actions = actions
+        self.id = id
+
+    @staticmethod
+    def get(actions):
+        global action_id
+        assert isinstance(actions, tuple)
+        if actions not in all_actions:
+            all_actions[actions] = ActionList(actions, action_id)
+            action_id += 1
+        return all_actions[actions]
+
+    @staticmethod
+    def listall():
+        return sorted(all_actions.values(), key = lambda al: al.id)
+
+next_pair_id = 0
+pairs = {}
+
+class StateActionListPair:
+
+    def __init__(self, state, actionlist, id):
+        self.state = state
+        self.actionlist = actionlist
+        self.id = id
+
+    @staticmethod
+    def get(state, actionlist):
+        global next_pair_id
+        if actionlist is not None and not isinstance(actionlist, ActionList):
+            actionlist = ActionList.get(actionlist)
+        if (state, actionlist) not in pairs:
+            pairs[(state, actionlist)] = StateActionListPair(state, actionlist, next_pair_id)
+            next_pair_id += 1
+        return pairs[(state, actionlist)]
+
+    @staticmethod
+    def listall():
+        return sorted(pairs.values(), key = lambda pair: pair.id)
+
+next_table_id = 0
+table_ids = {}
+
+class StateTransitionTable:
+
+    def __init__(self, mapping):
+        self.mapping = mapping
+
+    def as_bytes(self):
+        lst = dict_to_list(self.mapping)
+        return bytes(pair.id for pair in lst)
+
+    def __getitem__(self, key):
+        return self.mapping[key]
+
+    @property
+    def id(self):
+        global next_table_id
+        b = self.as_bytes()
+        if not b in table_ids:
+            table_ids[b] = next_table_id
+            next_table_id += 1
+        return table_ids[b]
+
+def dict_to_list(mapping):
+    assert isinstance(mapping, dict)
+    result = []
+    for key, value in mapping.items():
+        while key.id >= len(result):
+            result.append(None)
+        result[key.id] = value
+    return result
+
+
+#Each character is one of id-start, id-continuation or other. Represent "other" as ERROR for all non-ascii characters.
+#See https://www.python.org/dev/peps/pep-3131 for an explanation of what is an identifier.
+OTHER_START = {0x1885, 0x1886, 0x2118, 0x212E, 0x309B, 0x309C}
+OTHER_CONTINUE = {0x00B7, 0x0387, 0x19DA}
+OTHER_CONTINUE.update(range(0x1369, 0x1372))
+ID_CATEGORIES = {"Lu", "Ll", "Lt", "Lm", "Lo", "Nl"}
+CONT_CATEGORIES = {"Mn", "Mc", "Nd", "Pc"}
+
+CHUNK_SIZE = 64
+
+class IdentifierTable:
+
+    def __init__(self):
+        classes = []
+        for i in range(0x110000):
+            try:
+                c = chr(i)
+            except:
+                continue
+            cat = unicodedata.category(c)
+            if cat in ID_CATEGORIES or i in OTHER_START:
+                cls = machine.IDENTIFIER_CLASS.id
+            elif cat in CONT_CATEGORIES or i in OTHER_CONTINUE:
+                cls = machine.IDENTIFIER_CONTINUE_CLASS.id
+            else:
+                cls = machine.ERROR_CLASS.id
+            assert cls in (0,1,2,3)
+            classes.append(cls)
+        result = []
+        for i, cls in enumerate(classes):
+            byte, bits = i>>2, cls<<((i&3)*2)
+            while byte >= len(result):
+                result.append(0)
+            result[byte] |= bits
+        while result[-1] == 0:
+            result.pop()
+        while len(result) % CHUNK_SIZE:
+            result.append(0)
+        self.table = result
+
+    def as_bytes(self):
+        return bytes(self.table)
+
+    def as_two_level_table(self):
+        index = []
+        chunks = {}
+        next_id = 0
+        the_bytes = self.as_bytes()
+        for n in range(0, len(the_bytes), CHUNK_SIZE):
+            chunk = the_bytes[n:n+CHUNK_SIZE]
+            if chunk in chunks:
+                index.append(chunks[chunk])
+            else:
+                index.append(next_id)
+                chunks[chunk] = next_id
+                next_id += 1
+        chunks = [ chunk for (i, chunk) in sorted((i, chunk) for chunk, i in chunks.items())]
+        return chunks, index
--- a/python/extractor/tokenizer_generator/gen_state_machine.py
+++ b/python/extractor/tokenizer_generator/gen_state_machine.py
@@ -0,0 +1,225 @@
+'''
+Generate a state-machine based tokenizer from a state transition description and a template.
+
+Parses the state transition description to compute a set of transition tables.
+Each table maps (state, character-class) pairs to (state, action) pairs.
+During tokenization each input character is converted to a class, then a new state and action is
+looked up using the current state and character-class.
+
+The generated tables are:
+    CLASS_TABLE:
+        Maps ASCII code points to character class.
+    ID_TABLE:
+        Maps all unicode points to one of Identifier, Identifier-continuation, or other.
+    The transition tables:
+        Each table maps each state to a per-class transition table.
+        Each per-class transition table maps each character-class to an index in the action table.
+    ACTION_TABLE:
+        Embedded in code as `action_table`; maps each index to a (state, action) pair.
+
+Since the number of character-classes, states and (state, action) pairs is small. Everything is represented as
+a byte and tables as `bytes` object for Python 3, or `array.array` objects for Python 2.
+'''
+
+
+from .parser import parse
+from . import machine
+from .compiled import StateActionListPair, IdentifierTable
+
+def emit_id_bytes(id_table):
+    chunks, index = id_table.as_two_level_table()
+    print("# %d entries in ID index" % len(index))
+    index_bytes = bytes(index)
+    print("ID_INDEX = toarray(")
+    for n in range(0, len(index_bytes), 32):
+        print("    %r" % index_bytes[n:n+32])
+    print(")")
+    print("ID_CHUNKS = (")
+    for chunk in chunks:
+        print("    toarray(%r)," % chunk)
+    print(")")
+
+def emit_transition_table(table, verbose=False):
+    print("%s = (" % table.name.upper(), end="")
+    for trans in table.as_list_of_transitions():
+        print("B%02d," % trans.id, end=" ")
+    print(")")
+
+emitted_rows = set()
+
+def emit_rows(table):
+    for trans in table.as_list_of_transitions():
+        id = trans.id
+        if id in emitted_rows:
+            continue
+        emitted_rows.add(id)
+        print("B%02d = toarray(%r)" % (id, trans.as_bytes()))
+
+action_names = {}
+next_action_id = 0
+
+def get_action_id(action):
+    global next_action_id
+    assert action is not None
+    if action in action_names:
+        return action_names[action]
+    result = next_action_id
+    next_action_id += 1
+    action_names[action] = result
+    return result
+
+def emit_actions(table, indent=""):
+    for pair in table:
+        if pair.actionlist is None:
+            continue
+        action = pair.actionlist
+        get_action_function(action, indent)
+
+def generate_action_table(table, indent):
+    result = []
+    result.append(indent + "action_table = [\n    " + indent)
+    for i, pair in enumerate(table):
+        if pair.actionlist is None:
+            result.append("(%d, None), " % pair.state.id)
+        else:
+            result.append("(%d, self.action_%s), " % (pair.state.id, pair.actionlist.id))
+        if (i & 3) == 3:
+            result.append("\n    " + indent)
+    result.append("\n" + indent + "]")
+    return "".join(result)
+
+action_functions = set()
+
+def get_action_function(actionlist, indent=""):
+    if actionlist in action_functions:
+        return
+    action_functions.add(actionlist)
+    last = actionlist.actions[-1]
+    print(indent + "def action_%d(self):" % actionlist.id)
+    emit = False
+    for action in actionlist.actions:
+        if action is machine.PUSHBACK:
+            print(indent + "    self.index -= 1")
+            continue
+        elif action is machine.POP:
+            print(indent + "    self.super_state = self.state_stack.pop()")
+        elif isinstance(action, machine.Push):
+            print(indent + "    self.state_stack.append(self.super_state)")
+            print(indent + "    self.super_state = %s" % action.state.name.upper())
+        elif action is machine.MARK:
+            print(indent + "    self.token_start_index = self.index")
+            print(indent + "    self.token_start = self.line, self.index-self.line_start_index")
+        elif isinstance(action, machine.Emit):
+            emit = True
+            print(indent + "    end = self.line, self.index-self.line_start_index+1")
+            if action.text is None:
+                print(indent + "    result = [%s, self.text[self.token_start_index:self.index+1], self.token_start, end]" % action.kind)
+            else:
+                print(indent + "    result = [%s, u%s, (self.line, self.index-self.line_start_index), end]" % (action.kind, action.text))
+            print(indent + "    self.token_start = end")
+            print(indent + "    self.token_start_index = self.index+1")
+        elif action is machine.NEWLINE:
+            print(indent + "    self.line_start_index = self.index+1")
+            print(indent + "    self.line += 1")
+        elif action is machine.EMIT_INDENT:
+            assert action is last
+            print(indent + "    return self.emit_indent()")
+            print()
+            return
+        else:
+            assert False, "Unexpected action: %s" % action
+    print(indent + "    self.index += 1")
+    if emit:
+        print(indent + "    return result")
+    else:
+        print(indent + "    return None")
+    print()
+    return
+
+def emit_char_classes(char_classes, verbose=False):
+    for cls in sorted(set(char_classes.values()), key=lambda x : x.id):
+        print("#%d = %r" % (cls.id, cls))
+    table = [None] * 128
+    by_id = {
+        machine.IDENTIFIER_CLASS.id : machine.IDENTIFIER_CLASS,
+        machine.IDENTIFIER_CONTINUE_CLASS.id : machine.IDENTIFIER_CONTINUE_CLASS,
+        machine.ERROR_CLASS.id : machine.ERROR_CLASS
+    }
+    for c, cls in char_classes.items():
+        by_id[cls.id] = cls
+        if c is machine.IDENTIFIER or c is machine.IDENTIFIER_CONTINUE:
+            continue
+        table[ord(c)] = cls.id
+        by_id[cls.id] = cls
+    for i in range(128):
+        assert table[i] is not None
+    bytes_table = bytes(table)
+    if verbose:
+        print("# Class Table")
+        for i in range(len(bytes_table)):
+            b = bytes_table[i]
+            print("# %r -> %s" % (chr(i), by_id[b]))
+    print("CLASS_TABLE = toarray(%r)" % bytes_table)
+
+
+
+PREFACE = """
+import codecs
+import re
+import sys
+
+from blib2to3.pgen2.token import *
+
+if sys.version < '3':
+    from array import array
+    def toarray(b):
+        return array('B', b)
+else:
+    def toarray(b):
+        return b
+"""
+
+def main():
+    verbose = False
+    import sys
+    if len(sys.argv) != 3:
+        print("Usage %s DESCRIPTION TEMPLATE" % sys.argv[0])
+        sys.exit(1)
+    descriptionfile = sys.argv[1]
+    with open(descriptionfile) as fd:
+        m = machine.Machine.load(fd.read())
+    templatefile = sys.argv[2]
+    with open(templatefile) as fd:
+        template = fd.read()
+    print("# This file is AUTO-GENERATED. DO NOT MODIFY")
+    print('# To regenerate: run "python3 -m tokenizer_generator.gen_state_machine %s %s"' % (descriptionfile, templatefile))
+    print(PREFACE)
+    print("IDENTIFIER_CLASS = %d" % machine.IDENTIFIER_CLASS.id)
+    print("IDENTIFIER_CONTINUE_CLASS = %d" % machine.IDENTIFIER_CONTINUE_CLASS.id)
+    print("ERROR_CLASS = %d" % machine.ERROR_CLASS.id)
+    emit_id_bytes(IdentifierTable())
+    char_classes = m.get_classes()
+    emit_char_classes(char_classes, verbose)
+    print()
+    tables = [state.compile(char_classes) for state in m.states.values() ]
+    for table in tables:
+        emit_rows(table)
+    print()
+    for table in tables:
+        #pprint(table)
+        emit_transition_table(table, verbose)
+    print()
+    print("TRANSITION_STATE_NAMES = {")
+    for state in m.states.values():
+        print("    id(%s): '%s'," % (state.name.upper(), state.name))
+    print("}")
+    print("START_SUPER_STATE = %s" % m.start.name.upper())
+    prefix, suffix = template.split("#ACTIONS-HERE")
+    print(prefix)
+    actions = StateActionListPair.listall()
+    emit_actions(actions, "    ")
+    action_table = generate_action_table(actions, "        ")
+    print(suffix.replace("#ACTION_TABLE_HERE", action_table))
+
+if __name__ == "__main__":
+    main()
--- a/python/extractor/tokenizer_generator/machine.py
+++ b/python/extractor/tokenizer_generator/machine.py
@@ -0,0 +1,485 @@
+
+import ast
+
+from .parser import parse
+from collections import defaultdict
+from .compiled import SuperState, StateTransitionTable, StateActionListPair
+
+
+class Transition:
+
+    def __init__(self, from_state, to_state, what, do):
+        assert isinstance(from_state, State)
+        assert isinstance(to_state, State)
+        self.from_state = from_state
+        self.what = what
+        if not do:
+            do = None
+        else:
+            assert isinstance(do, list)
+            for item in do:
+                assert isinstance(item, Action)
+            do = tuple(do)
+        self.action = StateActionListPair.get(to_state, do)
+
+    def dump(self):
+        if self.action.actionlist:
+            return "%s -> %s for %s do %s" % (
+                self.from_state,
+                self.action.state,
+                self.what,
+                "; ".join(str(do) for do in self.action.actionlist.actions)
+            )
+        else:
+            return "%s -> %s for %s" % (
+                self.from_state,
+                self.action.state,
+                self.what
+            )
+
+next_state_id = 1
+states = {}
+
+class State:
+
+    def __init__(self, name):
+        global next_state_id
+        if name.isdigit():
+            assert name == "0"
+            self.id = 0
+            self.name = "START"
+        else:
+            self.name = name
+            self.id = next_state_id
+            next_state_id += 1
+
+    @staticmethod
+    def get(name):
+        if name not in states:
+            states[name] = State(name)
+        return states[name]
+
+    @staticmethod
+    def count():
+        return len(states)
+
+    def __repr__(self):
+        return "state_%s(%s)" % (self.id, self.name)
+
+    @staticmethod
+    def from_id(id):
+        for state in states.values():
+            if state.id == id:
+                return state
+        raise ValueError(id)
+
+State.get("0")
+ERROR_ACTION = StateActionListPair.get(State.get("error"), None)
+
+next_super_state_id = 0
+super_states = {}
+
+class TransitionTable:
+
+    def __init__(self, name):
+        global next_super_state_id
+        self.name = name
+        self.id = next_super_state_id
+        next_super_state_id += 1
+        self.parent = None
+        self.transitions = []
+        self._table = None
+
+    def add_transition(self, trans):
+        self.transitions.append(trans)
+
+    def dump(self):
+        if self.parent:
+            lines = [ "TransitionTable %s(%s extends %s)" % (self.id, self.name, self.parent.name) ]
+        else:
+            lines = [ "TransitionTable %s(%s):" % (self.id, self.name) ]
+        lines.extend("    " + t.dump() for t in self.transitions)
+        return "\n".join(lines)
+
+    @staticmethod
+    def get(name, parent=None):
+        if name not in super_states:
+            super_states[name] = TransitionTable(name)
+        return super_states[name]
+
+    @staticmethod
+    def count():
+        return len(super_states)
+
+    def get_table(self, character_classes):
+        '''Returns the transition table for all states in this super-state'''
+        if self._table is None:
+            from_transtions = defaultdict(list)
+            for t in self.transitions:
+                from_transtions[t.from_state].append(t)
+            self._table = { state: self.get_transition_table(state, from_transtions.get(state, ()), character_classes) for state in states.values() }
+        return self._table
+
+    def get_transition_table(self, state, transition_list, character_classes):
+        table = {}
+        if self.parent:
+            parent_table = self.parent.get_table(character_classes)
+        else:
+            parent_table = None
+        default = None
+        for t in transition_list:
+            assert state == t.from_state
+            if isinstance(t.what, Any):
+                default = t.action
+                continue
+            action = t.action
+            classes = set(character_classes[c] for c in t.what)
+            for cls in classes:
+                if cls in table:
+                    raise ValueError("Duplicate transition from %s on %s" % (state, cls))
+                else:
+                    table[cls] = action
+        on_identifier = table.get(IDENTIFIER_CLASS, None)
+        for cls in character_classes.values():
+            if cls in table:
+                continue
+            if on_identifier and cls.is_identifier:
+                table[cls] = on_identifier
+            elif default:
+                table[cls] = default
+            elif parent_table and state in parent_table:
+                table[cls] = parent_table[state][cls]
+            else:
+                table[cls] = ERROR_ACTION
+        return StateTransitionTable(table)
+
+    def compile(self, character_classes):
+        return SuperState(self.name, self.get_table(character_classes))
+
+class Any:
+
+    def __repr__(self):
+        return "*"
+
+class Action:
+
+    def __repr__(self):
+        return self.__class__.__name__.lower()
+
+class Emit(Action):
+
+    def __init__(self, kind, text):
+        assert isinstance(kind, str)
+        assert kind.upper() == kind
+        self.kind = kind
+        self.text = text
+
+    def __repr__(self):
+        if self.text is None:
+            return "emit(" + self.kind + ")"
+        else:
+            return "emit(%s, %r)" % (self.kind, self.text)
+
+    def __eq__(self, other):
+        return type(other) is Emit and other.kind == self.kind and other.text == self.text
+
+    def __hash__(self):
+        return 353 ^ hash(self.kind) ^ hash(self.text)
+
+class Push(Action):
+
+    def __init__(self, state):
+        assert isinstance(state, TransitionTable)
+        self.state = state
+
+    def __repr__(self):
+        return "push(%s)" % self.state.name
+
+    def __eq__(self, other):
+        return type(other) is Push and other.state == self.state
+
+    def __hash__(self):
+        return 59 ^ hash(self.state)
+
+class EmitIndent(Action):
+    pass
+EMIT_INDENT = EmitIndent()
+
+class Pop(Action):
+    pass
+POP = Pop()
+
+class Pushback(Action):
+    pass
+PUSHBACK = Pushback()
+
+class Mark(Action):
+    pass
+MARK = Mark()
+
+class Newline(Action):
+    pass
+NEWLINE = Newline()
+
+class Identifier:
+
+    def __repr__(self):
+        return "UnicodeIdentifiers()"
+
+IDENTIFIER = Identifier()
+
+class IdentifierContinue:
+
+    def __repr__(self):
+        return "IdentifierContinue()"
+
+IDENTIFIER_CONTINUE = IdentifierContinue()
+
+next_char_class_id = 0
+
+class CharacterClass:
+
+    def __init__(self, chars, is_identifier = None):
+        global next_char_class_id
+        self.chars = chars
+        self.id = next_char_class_id
+        next_char_class_id += 1
+        if is_identifier is None:
+            self.is_identifier = chars.copy().pop().isidentifier()
+        else:
+            self.is_identifier = is_identifier
+
+    def __repr__(self):
+        if self == IDENTIFIER_CLASS:
+            return "IDENTIFIER_CLASS(%d)" % self.id
+        elif self == ERROR_CLASS:
+            return "ERROR_CLASS(%d)" % self.id
+        else:
+            return "CharacterClass %s %r" % (self.id, sorted(self.chars))
+
+ERROR_CLASS = CharacterClass(set(), False)
+assert ERROR_CLASS.id == 0
+IDENTIFIER_CLASS = CharacterClass(set(), True)
+IDENTIFIER_CONTINUE_CLASS = CharacterClass(set(), False)
+
+class Machine:
+
+    def __init__(self):
+        self.aliases = {}
+        self.states = {}
+        self.aliases["IDENTIFIER"] = IDENTIFIER
+        self.aliases["IDENTIFIER_CONTINUE"] = IDENTIFIER_CONTINUE
+        self.aliases['SPACE'] = {' '}
+        self.start = None
+
+    def add_state(self, name):
+        assert name not in self.states
+        result = TransitionTable.get(name)
+        self.states[name] = result
+        return result
+
+    def add_alias(self, name, choices):
+        assert name not in self.aliases
+        assert isinstance(choices, set), choices
+        self.aliases[name] = choices
+
+    def dump(self):
+        r = []
+        a = r.append
+        a("Starting super-state: %s" % self.start.name)
+        a("")
+        a("Aliases:")
+        for name_alias in self.aliases.items():
+            a("   %s = %r" % name_alias)
+        a("")
+        for name, state in self.states.items():
+            a(state.dump())
+        return "\n".join(r)
+
+    @staticmethod
+    def load(src):
+        tree = parse(src)
+        m = Machine()
+        w = Walker(m)
+        w.visit(tree)
+        return m
+
+    def get_classes(self):
+        '''Get the character classes for this machine'''
+        #There are two predefined classes: Unicode identifiers, and ERROR.
+        #A character class is a set of characters, such that the transitions
+        #and actions of the machine are identical for all characters in that class.
+        char_to_transitions = defaultdict(set)
+        for s in self.states.values():
+            for t in s.transitions:
+                w = t.what
+                if isinstance(w, Any):
+                    continue
+                for c in w:
+                    if c is IDENTIFIER or c is IDENTIFIER_CONTINUE:
+                        continue
+                    char_to_transitions[c].add((s, t.from_state, t.action))
+        equivalence_sets = defaultdict(set)
+        for c, transition_set in sorted(char_to_transitions.items()):
+            equivalence_sets[frozenset(transition_set)].add(c)
+        classes = {}
+        for char_set in sorted(equivalence_sets.values()):
+            charcls = CharacterClass(char_set)
+            for c in char_set:
+                classes[c] = charcls
+        classes[IDENTIFIER] = IDENTIFIER_CLASS
+        classes[IDENTIFIER_CONTINUE] = IDENTIFIER_CONTINUE_CLASS
+        for i in range(128):
+            c = chr(i)
+            if c not in classes:
+                if c.isidentifier():
+                    classes[c] = IDENTIFIER_CLASS
+                elif c in "0123456789":
+                    classes[c] = IDENTIFIER_CONTINUE_CLASS
+                else:
+                    classes[c] = ERROR_CLASS
+        for cls in classes.values():
+            if cls is IDENTIFIER_CLASS or cls is IDENTIFIER_CONTINUE_CLASS or cls is ERROR_CLASS:
+                continue
+            assert { c for c in cls.chars if c.isidentifier() } == cls.chars or not { c for c in cls.chars if c.isidentifier() }
+        return classes
+
+class Walker:
+
+    def __init__(self, machine):
+        self.machine = machine
+
+    def visit(self, node):
+        if hasattr(node, "type"):
+            tag = node.type
+        else:
+            tag = node.data
+        meth = getattr(self, "visit_" + tag, None)
+        if meth is None:
+            self.fail(node, tag)
+        else:
+            return meth(node)
+
+    def fail(self, node, tag):
+        print(node)
+        raise NotImplementedError(tag)
+
+    def visit_first_child(self, node):
+        assert len(node.children) == 1
+        return self.visit(node.children[0])
+
+    def visit_children(self, node):
+        return [ self.visit(child) for child in node.children ]
+
+    visit_start = visit_first_child
+    visit_machine = visit_children
+    visit_declaration = visit_first_child
+
+    def visit_alias_decl(self, node):
+        assert len(node.children) == 2
+        choice = self.visit(node.children[1])
+        self.machine.add_alias(node.children[0].value, choice)
+
+    def visit_alias(self, node):
+        return self.machine.aliases[node.children[0].value]
+
+    def visit_char(self, node):
+        c = ast.literal_eval(node.children[0].value)
+        assert isinstance(c, str), c
+        assert len(c) == 1, c
+        return c
+
+    def visit_choice(self, node):
+        #Convert choices into a set of characters
+        result = set()
+        for child in node.children:
+            item = self.visit(child)
+            if isinstance(item, set):
+                result.update(item)
+            else:
+                result.add(item)
+        return result
+
+    visit_item = visit_first_child
+
+    def visit_table_decl(self, node):
+        self.current_state = self.visit(node.children[0])
+        for transition in node.children[1:]:
+            self.visit(transition)
+
+    def visit_table_header(self, node):
+        name = node.children[0].value
+        state = self.machine.add_state(name)
+        if len(node.children) > 1:
+            base = TransitionTable.get(node.children[1].value)
+            state.parent = base
+        return state
+
+    def visit_transition(self, node):
+        # state_choice "->" state "for" (choice | "*") action_list?
+        from_states = self.visit(node.children[0])
+        to_state = self.visit(node.children[1])
+        what = self.visit(node.children[2])
+        if len(node.children) > 3:
+            do = self.visit(node.children[3])
+        else:
+            do = []
+        for state in from_states:
+            trans = Transition(state, to_state, what, do)
+            self.current_state.add_transition(trans)
+
+    visit_state_choice = visit_children
+
+    def visit_state(self, node):
+        return State.get(node.children[0].value)
+
+    def visit_any(self, node):
+        return Any()
+
+    visit_action_list = visit_children
+    visit_action = visit_first_child
+
+    def visit_emit(self, node):
+        if len(node.children) == 2:
+            return Emit(node.children[0].value, self.visit(node.children[1]))
+        else:
+            return Emit(node.children[0].value, None)
+
+    def visit_optional_text(self, node):
+        return node.children[0].value
+
+    def visit_push(self, node):
+        state = TransitionTable.get(node.children[0].value)
+        return Push(state)
+
+    def visit_emit_indent(self, node):
+        return EMIT_INDENT
+
+    def visit_pushback(self, node):
+        return PUSHBACK
+
+    def visit_pop(self, node):
+        return POP
+
+    def visit_mark(self, node):
+        return MARK
+
+    def visit_newline(self, node):
+        return NEWLINE
+
+    def visit_start_decl(self, node):
+        self.machine.start = TransitionTable.get(node.children[0].value)
+
+
+def main():
+    import sys
+    file = sys.argv[1]
+    with open(file) as fd:
+        tree = parse(fd.read())
+    m = Machine()
+    w = Walker(m)
+    w.visit(tree)
+    print(m.dump())
+
+if __name__ == "__main__":
+    main()
--- a/python/extractor/tokenizer_generator/parser.py
+++ b/python/extractor/tokenizer_generator/parser.py
@@ -0,0 +1,92 @@
+
+'''
+    Explanation of the syntax
+
+    start_decl: The starting transition table
+    alias_decl: Declare short hand, e.g. digits = '0' or '1' or ...
+    table_decl: Declare transition table: name and list of transitions.
+    transition: Transitions from one state to another. From is: state (or choice of states) -> new-state for possible-characters [ do action or actions; ]
+    action: Actions are:
+        "emit(kind [, text]): emits a token of kind using the givn text or text from the stream. The token starts at the last mark and ends at the current location.
+        "push(table)": pushes a transition table to the stack.
+        "pop" : pops a transition table from the stack.
+        "pushback": pushes the last character back to the stream.
+        "mark": marks the current location as the start of the next token.
+        "emit_indent": Emits zero or more INDENT or DEDENT tokens depending on current indentation.
+        "newline": Increments the line number and sets the column offset back to zero.
+
+    States:
+        All states are given names.
+        The state "0" is the start state and always exists.
+        All other states are implicitly defined when used (this is for Python after all :)
+        '*' means all states for which a transition is not explicitly defined.
+        So the transitions:
+            0 -> end for '\n'
+            0 -> other for *
+            0 -> a_b for 'a' or 'b'
+        mean that '0' will transition to 'other' for all characters other than 'a', 'b' and `\n`.
+        The order of transitions in the state machine description is irrelevant.
+'''
+
+
+grammar = r"""
+start           : machine
+machine         : declaration+
+declaration     : alias_decl | table_decl | start_decl
+start_decl      : "start" ":" IDENTIFIER
+table_decl      : table_header "{" transition+ "}"
+table_header    : "table" IDENTIFIER ( "(" IDENTIFIER ")" )?
+alias_decl      : IDENTIFIER "=" choice
+choice          : item ( "or" item)*
+item            : alias | char
+alias           : IDENTIFIER
+char            : LITERAL
+transition      : state_choice "->" state "for" (choice | any) action_list?
+any             : "*"
+state_choice    : state ( "or" state)*
+state           : IDENTIFIER | DIGIT
+action_list     : "do" action ";" (action ";")*
+action          : emit | pop | push | pushback | mark | emit_indent | newline
+emit            : "emit" "(" IDENTIFIER optional_text? ")"
+optional_text   : "," LITERAL
+pop             : "pop"
+push            : "push" "(" IDENTIFIER ")"
+pushback        : "pushback"
+mark           : "mark"
+emit_indent     : "emit_indent"
+newline         : "newline"
+
+LITERAL         :  ("\"" /[^"]/* "\"") | ("'" /[^']/* "'")
+IDENTIFIER      :  LETTER ( LETTER | DIGIT | "_" )*
+LETTER          :  "A".."Z" | "a".."z"
+DIGIT           :  "0".."9"
+WHITESPACE      :  (" " | "\t" | "\r" | "\n")+
+
+%import common.NEWLINE
+COMMENT         :  "#"  /(.)*/ NEWLINE
+%ignore WHITESPACE
+%ignore COMMENT
+"""
+
+
+
+from lark import Lark
+
+class Parser(Lark):
+
+    def __init__(self):
+        Lark.__init__(self, grammar, parser="earley", lexer="standard")
+
+def parse(src):
+    parser = Parser()
+    return parser.parse(src)
+
+def main():
+    import sys
+    file = sys.argv[1]
+    with open(file) as fd:
+        tree = parse(fd.read())
+    print(tree.pretty())
+
+if __name__ == "__main__":
+    main()
--- a/python/extractor/tokenizer_generator/state_transition.txt
+++ b/python/extractor/tokenizer_generator/state_transition.txt
@@ -0,0 +1,385 @@
+# State machine specification for unified Python tokenizer
+# Handles all tokens for all versions of Python, including partial string tokens for handling f-strings.
+# Stating transition table is "default" and starting state is "0"
+#
+#
+
+
+#declarations
+prefix_chars = 'u' or 'U' or 'b' or 'B' or 'r' or 'R'
+one_to_nine = '1' or '2' or '3' or '4' or '5' or '6' or '7' or '8' or '9'
+digits = '0' or one_to_nine
+oct_digits = '0' or '1' or '2' or '3' or '4' or '5' or '6' or '7'
+hex_digits = digits or 'a' or 'A' or 'b' or 'B' or 'c' or 'C' or 'd' or 'D' or 'e' or 'E' or 'f' or 'F'
+feed = '\n' or '\r'
+
+#tables
+table default {
+    # 0 is starting state
+    0 -> whitespace_line for * do pushback;
+
+    #String prefix states
+
+    # When we encounter a prefix character, we are faced with the possibility
+    # that it is either the beginning of a string or of an identifier. With a
+    # single character of lookahead available, we therefore have to be in an
+    # intermediate state until we are able to determine which case we're in.
+
+    code -> maybe_string1 for prefix_chars do mark;
+    maybe_string1 -> maybe_string2 for prefix_chars
+    maybe_string1 or maybe_string2 -> quote_s for "'"
+    maybe_string1 or maybe_string2 -> quote_d for '"'
+    code -> quote_s for "'" do mark;
+    code -> quote_d for '"' do mark;
+    maybe_string1 or maybe_string2 -> in_identifier for * do pushback;
+
+    # In the following, `_s` means one single quote, `_ss` means two in a row,
+    # etc. Likewise `_d` indicates double quotes.
+
+    quote_s -> quote_ss for "'"
+    quote_d -> quote_dd for '"'
+    quote_s -> instring for * do pushback ; push(string_s);
+    quote_ss -> instring for "'" do push(string_sss);
+    quote_ss -> code for * do pushback ; emit(STRING);
+    quote_d -> instring for * do pushback ; push(string_d);
+    quote_dd -> instring for '"' do push(string_ddd);
+    quote_dd -> code for * do pushback ; emit(STRING);
+
+    #F-string prefix states
+
+    # The prefixes `u` and `b` are specific to Python 2, and f-strings are only
+    # valid for Python 3. Thus, the only potential prefixes are  permutations of
+    # `f` and `fr` (upper/lowercase notwithstanding).
+
+    code -> maybe_fstring1 for 'f' or 'F' do mark;
+    maybe_string1 -> maybe_fstring2 for 'f' or 'F'
+    maybe_fstring1 -> maybe_fstring2 for 'r' or 'R'
+    maybe_fstring1 or maybe_fstring2 -> fquote_s for "'"
+    maybe_fstring1 or maybe_fstring2 -> fquote_d for '"'
+    maybe_fstring1 or maybe_fstring2 -> in_identifier for * do pushback;
+    fquote_s -> fquote_ss for "'"
+    fquote_d -> fquote_dd for '"'
+    fquote_s -> instring for * do pushback ; push(fstring_start_s);
+    fquote_ss -> instring for "'" do push(fstring_start_sss);
+    fquote_ss -> code for * do pushback ; emit(STRING);
+    fquote_d -> instring for * do pushback ; push(fstring_start_d);
+    fquote_dd -> instring for '"' do push(fstring_start_ddd);
+    fquote_dd -> code for * do pushback ; emit(STRING);
+
+    #String states
+    instring -> instring for *
+    instring -> unicode_or_escape for '\\'
+        unicode_or_escape -> unicode_or_raw for 'N'
+            unicode_or_raw -> unicode for '{'
+            unicode_or_raw -> instring for *
+            unicode -> instring for '}'
+            unicode -> unicode for *
+        unicode_or_escape -> escape for * do pushback;
+
+    escape -> instring for feed do newline;
+    escape -> instring for *
+
+    # When inside a parenthesized expression, newlines indicate the continuation
+    # of the expression, and not a return to a context where statements may
+    # appear. This is captured using the `paren` table.
+
+    code -> code for '(' do emit(LPAR, "("); push(paren);
+    code -> code for '[' do emit(LSQB, "["); push(paren);
+    code -> code for '{' do emit(LBRACE, "{"); push(paren);
+    code -> code for ')' do emit(RPAR, ")");
+    code -> code for ']' do emit(RSQB, "]");
+    code -> code for '}' do emit(RBRACE, "}");
+    code -> code for '`' do emit(BACKQUOTE, '`');
+
+    # Operators
+    code -> assign for '=' do mark;
+    code -> le for '<' do mark;
+    code -> ge for '>' do mark;
+    code -> bang for '!' do mark;
+    le -> binop for '<'
+    le -> code for '>' do emit(OP);
+    ge -> binop for '>'
+    bang or le or ge or assign -> code for '=' do emit(OP);
+    le or ge or assign -> code for * do pushback; emit(OP);
+    bang -> code for 'r' or 'a' or 's' or 'd' do emit(CONVERSION);
+    code -> colon for ':'
+    colon -> code for '=' do emit(COLONEQUAL, ":=");
+    colon -> code for * do pushback; emit(COLON, ":");
+    code -> code for ',' do emit(COMMA, ",");
+    code -> code for ';' do emit(SEMI, ";");
+    code -> at for '@' do mark;
+    at -> code for '=' do emit(OP);
+    at -> code for * do pushback; emit(AT, "@");
+    code -> dot for '.' do mark;
+    dot -> float for digits
+    dot -> code for * do pushback; emit(DOT, ".");
+    binop or slash or star or dash -> code for '=' do emit(OP);
+    binop or slash or star or dash -> code for * do pushback; emit(OP);
+    code -> star for '*' do mark;
+    star -> binop for '*'
+    code -> slash for '/' do mark;
+    slash -> binop for '/'
+    code -> dash for  '-' do mark;
+    dash -> code for '>' do emit(RARROW);
+    code -> binop for '+' or '%' or '&' or '|' or '^' do mark;
+    code -> code for '~'  do emit(OP, '~');
+
+    # Numeric literals
+
+    # Python admits a large variety of numeric literals, and the handling of
+    # various constructs is a bit inconsistent. For instance, prefixed zeroes are
+    # not allowed in front of integer numerals (unless all digits are between 0
+    # and 7, in which case it is treated as an octal number), but _are_ allowed if
+    # there is some other context that makes it a float or complex number. Thus,
+    # `09` is invalid, but `09.` and `09j` are valid. This means we have to be
+    # very careful in what we commit to in our tokenization, hence the rather
+    # complicated construction below.
+
+    code -> int for one_to_nine do mark;
+    int -> int for digits
+    zero or zero_int or binary or octal or int or hex -> code for 'l' or 'L' do emit(NUMBER);
+    int -> int_sep for '_'
+    int_sep -> int for digits
+    int_sep -> error for * do emit(ERRORTOKEN);
+    code -> zero for '0' do mark;
+    zero -> zero_int for digits
+    zero -> zero_int_sep for '_'
+    zero_int -> zero_int for digits
+    zero_int -> zero_int_sep for '_'
+    zero_int_sep -> zero_int for digits
+    zero_int_sep -> error for * do emit(ERRORTOKEN);
+    zero -> octal for 'o' or 'O'
+    octal -> octal for oct_digits
+    octal -> octal_sep for '_'
+    octal_sep -> octal for oct_digits
+    octal_sep -> error for * do emit(ERRORTOKEN);
+    zero or octal or hex or binary -> code for * do pushback; emit(NUMBER);
+    zero -> binary for 'b' or 'B'
+    binary -> binary for '0' or '1'
+    binary -> binary_sep for '_'
+    binary_sep -> binary for '0' or '1'
+    binary_sep -> error for * do emit(ERRORTOKEN);
+    zero -> hex for 'x' or 'X'
+    hex -> hex for hex_digits
+    hex -> hex_sep for '_'
+    hex_sep -> hex for hex_digits
+    hex_sep -> error for * do emit(ERRORTOKEN);
+    zero or zero_int or int -> int_dot for '.'
+    zero_int or int -> code for * do pushback; emit(NUMBER);
+    int_dot or float -> float for digits
+    float -> float_sep for '_'
+    float_sep -> float for digits
+    float_sep -> error for * do emit(ERRORTOKEN);
+    int_dot -> code for * do pushback; emit(NUMBER);
+    float or zero or zero_int or int or int_dot -> float_e for 'e'
+    float or zero or zero_int or int or int_dot -> float_E for 'E'
+    # `1 if 1else 0` is valid syntax, so we cannot assume 'e' always indicates a float.
+    float_e -> code for 'l' do pushback; pushback; emit(NUMBER);
+    float_e or float_E -> float_E for '+' or '-'
+    float_e or float_E or float_x -> float_x for digits
+    float_x -> float_x_sep for '_'
+    float_x_sep -> float_x for digits
+    float_x_sep -> error for * do emit(ERRORTOKEN);
+    float or float_x -> code for * do pushback; emit(NUMBER);
+
+    # Identifiers (e.g. names and keywords)
+    code -> in_identifier for IDENTIFIER do mark;
+    in_identifier -> in_identifier for IDENTIFIER or digits or IDENTIFIER_CONTINUE
+    code -> dollar_name for '$' do mark;
+    dollar_name -> dollar_name for IDENTIFIER or digits or IDENTIFIER_CONTINUE
+    code -> in_identifier for '_' do mark;
+    in_identifier -> in_identifier for '_'
+    in_identifier -> code for * do pushback; emit(NAME);
+    dollar_name -> code for * do pushback; emit(DOLLARNAME);
+
+    # Comments
+    code -> line_end_comment for '#' do mark;
+    line_end_comment -> code for feed do pushback; emit(COMMENT);
+    line_end_comment -> line_end_comment for *
+    comment -> whitespace_line for feed do pushback; emit(COMMENT);
+    comment -> comment for *
+    code -> whitespace_line for feed do emit(NEWLINE, "\n"); newline;
+    whitespace_line -> whitespace_line for SPACE or '\t' or '\f'
+    whitespace_line -> whitespace_line for feed do newline;
+    whitespace_line -> code for * do emit_indent;
+    whitespace_line -> comment for '#' do mark;
+    code -> code for SPACE or '\t'
+
+    # Line continuations and error states.
+    code or float_e or float_E -> error for * do emit(ERRORTOKEN);
+    code -> pending_continuation for '\\'
+    pending_continuation -> line_continuation for feed do newline;
+    line_continuation -> code for * do pushback; mark;
+    pending_continuation -> error for * do emit(ERRORTOKEN);
+    error -> code for * do pushback;
+    code -> code for * do mark; emit(ERRORTOKEN);
+    zero or int_dot or zero_int or int or float or float_x -> code for 'j' or 'J' do emit(NUMBER);
+}
+
+table paren(default) {
+    code -> code for feed do mark; newline;
+    code -> code for ')' do emit(RPAR, ")"); pop;
+    code -> code for ']' do emit(RSQB, "]"); pop;
+    code -> code for '}' do emit(RBRACE, "}"); pop;
+}
+
+#String starting with '
+table string_s(default) {
+    instring -> code for "'" do pop; emit(STRING);
+    instring -> error for feed do pop; emit(ERRORTOKEN); newline;
+}
+
+#String starting with "
+table string_d(default) {
+    instring -> code for '"' do pop; emit(STRING);
+    instring -> error for feed do pop; emit(ERRORTOKEN); newline;
+}
+
+#String starting with '''
+table string_sss(default) {
+    instring -> string_x for "'"
+    instring -> instring for feed do newline;
+    string_x -> string_xx for "'"
+    string_x -> instring for feed do newline;
+    string_x -> instring for * do pushback;
+    string_xx -> code for "'" do pop; emit(STRING);
+    string_xx -> instring for feed do newline;
+    string_xx -> instring for * do pushback;
+}
+
+#String starting with """
+table string_ddd(default) {
+    instring -> string_x for '"'
+    instring -> instring for feed do newline;
+    string_x -> string_xx for '"'
+    string_x -> instring for feed do newline;
+    string_x -> instring for * do pushback;
+    string_xx -> code for '"' do pop; emit(STRING);
+    string_xx -> instring for feed do newline;
+    string_xx -> instring for * do pushback;
+}
+
+#F-string part common to all fstrings
+table fstring_sdsssddd(default) {
+    instring -> brace for '{'
+
+    escape -> brace for '{'
+
+    brace -> instring for '{'
+    brace -> code for * do pushback ; emit(FSTRING_MID); push(fstring_expr);
+}
+
+#F-string part common to ' and "
+table fstring_sd(fstring_sdsssddd) {
+    instring -> error for feed do pop; emit(ERRORTOKEN); newline;
+}
+
+#F-string start for string starting with '
+table fstring_start_s(fstring_sd) {
+    instring -> code for "'" do pop; emit(STRING);
+
+    # If this rule is removed or moved to a higher table, the QL tests start failing for unclear reasons.
+    # It's identical to a rule in default.
+    brace -> instring for '{'
+    brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_s); push(fstring_expr);
+}
+
+#F-string part for string starting with '
+table fstring_s(fstring_sd) {
+    instring -> code for "'" do pop; emit(FSTRING_END);
+}
+
+#F-string start for string starting with "
+table fstring_start_d(fstring_sd) {
+    instring -> code for '"' do pop; emit(STRING);
+
+    # If this rule is removed or moved to a higher table, the QL tests start failing for unclear reasons.
+    # It's identical to a rule in fstring_sdsssddd.
+    brace -> instring for '{'
+    brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_d); push(fstring_expr);
+}
+
+#F-string part for string starting with "
+table fstring_d(fstring_sd) {
+    instring -> code for '"' do pop; emit(FSTRING_END);
+}
+
+#F-string part common to ''' and """
+table fstring_sssddd(fstring_sdsssddd) {
+    instring -> instring for feed do newline;
+
+    string_x -> instring for feed do newline;
+    string_x -> instring for * do pushback;
+
+    string_xx -> instring for feed do newline;
+    string_xx -> instring for * do pushback;
+}
+
+#F-string start for string starting with '''
+table fstring_start_sss(fstring_sssddd) {
+    instring -> string_x for "'"
+
+    string_x -> string_xx for "'"
+
+    string_xx -> code for "'" do pop; emit(STRING);
+
+    brace -> instring for '{'
+    brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_sss); push(fstring_expr);
+}
+
+#F-string part for string starting with '''
+table fstring_sss(fstring_sssddd) {
+    instring -> string_x for "'"
+
+    string_x -> string_xx for "'"
+
+    string_xx -> code for "'" do pop; emit(FSTRING_END);
+}
+
+#F-string start for string starting with """
+table fstring_start_ddd(fstring_sssddd) {
+    instring -> string_x for '"'
+
+    string_x -> string_xx for '"'
+
+    string_xx -> code for '"' do pop; emit(STRING);
+
+    brace -> instring for '{'
+    brace -> code for * do pushback ; emit(FSTRING_START); pop; push(fstring_ddd); push(fstring_expr);
+}
+
+#F-string part for string starting with """
+table fstring_ddd(fstring_sssddd) {
+    instring -> string_x for '"'
+
+    string_x -> string_xx for '"'
+
+    string_xx -> code for '"' do pop; emit(FSTRING_END);
+}
+
+#Expression within an f-string
+table fstring_expr(paren) {
+    code -> instring for '}' do pop; mark;
+    code -> instring for ':' do emit(COLON); push(format_specifier);
+    instring -> instring for '}' do pop; mark;
+}
+
+fspec_type = 'b' or 'c' or 'd' or 'e' or 'E' or 'f' or 'F' or 'g' or 'G' or 'n' or 'o' or 's' or 'x' or 'X' or '%'
+fspec_align = '<' or '>' or '=' or '^'
+fspec_sign = '+' or '-' or ' '
+
+table format_specifier(default) {
+    instring -> code for '{' do emit(FSTRING_SPEC);
+    instring -> instring for '}' do pushback; emit(FSTRING_SPEC); pop;
+
+    code -> instring for '}' do mark;
+}
+
+
+
+#Special state for when dedents are pending.
+table pending_dedent(default) {
+    code -> code for * do pop; emit_indent;
+}
+
+start: default
--- a/python/extractor/tokenizer_generator/test.py
+++ b/python/extractor/tokenizer_generator/test.py
@@ -0,0 +1,25 @@
+
+from . import test_tokenizer
+import sys
+from blib2to3.pgen2.token import tok_name
+
+def printtoken(type, token, start, end): # for testing
+    token_range = "%d,%d-%d,%d:" % (start + end)
+    print("%-20s%-15s%r" %
+        (token_range, tok_name[type], token)
+    )
+
+
+def main():
+    verbose = sys.argv[1] == "-v"
+    if verbose:
+        inputfile = sys.argv[2]
+    else:
+        inputfile = sys.argv[1]
+    with open(inputfile, "r") as input:
+        t = test_tokenizer.Tokenizer(input.read()+"\n")
+    for tkn in t.tokens(verbose):
+        printtoken(*tkn)
+
+if __name__ == "__main__":
+    main()
--- a/python/extractor/tokenizer_generator/tokenizer_template.py
+++ b/python/extractor/tokenizer_generator/tokenizer_template.py
@@ -0,0 +1,172 @@
+'''
+Lookup table based tokenizer with state popping and pushing capabilities.
+The ability to push and pop state is required for handling parenthesised expressions,
+indentation, and f-strings. We also use it for handling the different quotation mark types,
+but it is not essential for that, merely convenient.
+
+'''
+
+
+
+class Tokenizer(object):
+
+    def __init__(self, text):
+        self.text = text
+        self.index = 0
+        self.line_start_index = 0
+        self.token_start_index = 0
+        self.token_start = 1, 0
+        self.line = 1
+        self.super_state = START_SUPER_STATE
+        self.state_stack = []
+        self.indents = [0]
+#ACTIONS-HERE
+    def tokens(self, debug=False):
+        text = self.text
+        cls_table = CLASS_TABLE
+        id_index = ID_INDEX
+        id_chunks = ID_CHUNKS
+        max_id = len(id_index)*256
+#ACTION_TABLE_HERE
+        state = 0
+        try:
+            if debug:
+                while True:
+                    c = ord(text[self.index])
+                    if c < 128:
+                        cls = cls_table[c]
+                    elif c >= max_id:
+                        cls = ERROR_CLASS
+                    else:
+                        b = id_chunks[id_index[c>>8]][(c>>2)&63]
+                        cls = (b>>((c&3)*2))&3
+                    prev_state = state
+                    print("char = '%s', state=%d, cls=%d" % (text[self.index], state, cls))
+                    state, transition = action_table[self.super_state[state][cls]]
+                    print ("%s -> %s on %r in %s" % (prev_state, state, text[self.index], TRANSITION_STATE_NAMES[id(self.super_state)]))
+                    if transition:
+                        tkn = transition()
+                        if tkn:
+                            yield tkn
+                    else:
+                        self.index += 1
+            else:
+                while True:
+                    c = ord(text[self.index])
+                    if c < 128:
+                        cls = cls_table[c]
+                    elif c >= max_id:
+                        cls = ERROR_CLASS
+                    else:
+                        b = id_chunks[id_index[c>>8]][(c>>2)&63]
+                        cls = (b>>((c&3)*2))&3
+                    state, transition = action_table[self.super_state[state][cls]]
+                    if transition:
+                        tkn = transition()
+                        if tkn:
+                            yield tkn
+                    else:
+                        self.index += 1
+        except IndexError as ex:
+            if self.index != len(text):
+                #Reraise index error
+                cls = cls_table[c]
+                trans = self.super_state[state]
+                action_index = trans[cls]
+                action_table[action_index]
+                # Not raised? Must have been raised in transition function.
+                raise ex
+            tkn = self.emit_indent()
+            while tkn is not None:
+                yield tkn
+                tkn = self.emit_indent()
+            end = self.line, self.index-self.line_start_index
+            yield ENDMARKER, u"", self.token_start, end
+            return
+
+    def emit_indent(self):
+        indent = 0
+        index = self.line_start_index
+        current = self.index
+        here = self.line, current-self.line_start_index
+        while index < current:
+            if self.text[index] == ' ':
+                indent += 1
+            elif self.text[index] == '\t':
+                indent = (indent+8) & -8
+            elif self.text[index] == '\f':
+                indent = 0
+            else:
+                #Unexpected state. Emit error token
+                while len(self.indents) > 1:
+                    self.indents.pop()
+                result = ERRORTOKEN, self.text[self.token_start_index:self.index+1], self.token_start, here
+                self.token_start = here
+                self.line_start_index = self.index
+                return result
+            index += 1
+        if indent == self.indents[-1]:
+            self.token_start = here
+            self.token_start_index = self.index
+            return None
+        elif indent > self.indents[-1]:
+            self.indents.append(indent)
+            start = self.line, 0
+            result = INDENT, self.text[self.line_start_index:current], start, here
+            self.token_start = here
+            self.token_start_index = current
+            return result
+        else:
+            self.indents.pop()
+            if indent > self.indents[-1]:
+                #Illegal indent
+                result = ILLEGALINDENT, u"", here, here
+            else:
+                result = DEDENT, u"", here, here
+                if indent < self.indents[-1]:
+                    #More dedents to do
+                    self.state_stack.append(self.super_state)
+                    self.super_state = PENDING_DEDENT
+            self.token_start = here
+            self.token_start_index = self.index
+            return result
+
+
+ENCODING_RE = re.compile(br'.*coding[:=]\s*([-\w.]+).*')
+NEWLINE_BYTES = b'\n'
+
+def encoding_from_source(source):
+    'Returns encoding of source (bytes), plus source strip of any BOM markers.'
+    #Check for BOM
+    if source.startswith(codecs.BOM_UTF8):
+        return 'utf8', source[len(codecs.BOM_UTF8):]
+    if source.startswith(codecs.BOM_UTF16_BE):
+        return 'utf-16be', source[len(codecs.BOM_UTF16_BE):]
+    if source.startswith(codecs.BOM_UTF16_LE):
+        return 'utf-16le', source[len(codecs.BOM_UTF16_LE):]
+    try:
+        first_new_line = source.find(NEWLINE_BYTES)
+        first_line = source[:first_new_line]
+        second_new_line = source.find(NEWLINE_BYTES, first_new_line+1)
+        second_line = source[first_new_line+1:second_new_line]
+        match = ENCODING_RE.match(first_line) or ENCODING_RE.match(second_line)
+        if match:
+            ascii_encoding = match.groups()[0]
+            encoding = ascii_encoding.decode("ascii")
+            # Handle non-standard encodings that are recognised by the interpreter.
+            if encoding.startswith("utf-8-"):
+                encoding = "utf-8"
+            elif encoding == "iso-latin-1":
+                encoding = "iso-8859-1"
+            elif encoding.startswith("latin-1-"):
+                encoding = "iso-8859-1"
+            elif encoding.startswith("iso-8859-1-"):
+                encoding = "iso-8859-1"
+            elif encoding.startswith("iso-latin-1-"):
+                encoding = "iso-8859-1"
+            return encoding, source
+    except Exception as ex:
+        print(ex)
+        #Failed to determine encoding -- Just treat as default.
+        pass
+    return 'utf-8', source