Source code for typped.lexer

# -*- coding: utf-8 -*-
"""

A `Lexer` class instance is a general lexer/scanner/tokenizer.  It was designed
to be used by the `PrattParser` class, but it can also be used for other
lexical scanning applications.

The general purpose of the `Lexer` is to take a string of text and produce a
corresponding sequence of tokens from the text.  The set of possible tokens is
defined by the user, with a string label and a regex pattern that is searched
for in the program text.  Once initialized with text a `Lexer` instance is a
generator which sequentially produces tokens with its `next` method.  It is
also an iterator, so it can be used in loops, etc.

With some lexers the order in which tokens are defined is significant.  They
match regexes from a list of regexes, taking the first match without regard
to the length of the match.  The
`Lexer` class was designed to function independently of the order in which
tokens are defined.  The longest match is always returned, with ties broken
by an explicit priority mechanism.  This allows token definitions to be
organized in various ways.   They can all be done in one
place or spread around in the code in any order, however the programmer wants
to do it.

Defining tokens
===============

This section describes the low-level definition of tokens when using `Lexer` as
a standalone application.  To use tokens with a `PrattParser` instance, though,
you need to use the corresponding `def_token` method of the `PrattParser`
class.  That class adds extra attributes, methods, etc., to the tokens.  The
interface is generally the same.

A token for a left parenthesis would be defined like this::

    lex = typped.Lexer()
    lex.def_token("k_lpar", r"lpar")

The string `k_lpar` is a label for the token.  The use of the string prefix
"`k_`" is a naming convention for token labels.  An identifier token could be
defined like this::

    lex.def_token("k_identifier", r"[a-zA-Z_](?:\w*)", on_ties=-1)

Notice that in the definition of the identifier the keyword argument `on_ties`
is set to -1.  The lexer will by default always choose the longest string which
matches a defined regex pattern for a token.  If there is a tie then, by
default, an exception will be raised.  The `on_ties` value is used to break
ties; strings of the same length are sorted by that value and the
highest-priority string is chosen.  The default `on_ties` value is `0`.  Suppose
you also wanted a token for the string `mod`, and defined it as::

    lex.def_token("k_mod", r"mod")

Since this token has a higher `on_ties` value, it will always take precedence over
the identifier token, even though both match and have the same length.

Begin and end tokens
====================

The lexer uses sentinel begin-token and end-token tokens for the beginning and
the end of the token sequence for text.  These tokens must be explicitly
defined (i.e., given string labels) either by calling `def_begin_end_tokens`::

    lex.def_begin_end_tokens("k_begin", "k_end")

or else by setting the `default_begin_end_tokens` flag to `True` when
initializing the lexer::

    lex = Lexer(def_begin_end_tokens=True)

The default tokens have the labels `k_begin` and `k_end`.

The begin-token is never explicitly returned.  After the call to `set_text` to
define the text to tokenize and before any calls to `next` the begin-token is
the current token `lexer.token`.  So `lex.token` and `lex.peek(0)` would both
return the begin token.

After the end of the text the `next` method explicitly returns one end-token.
Calling `next` again raises `StopIteration` and halts the lexing of the
currently-set text.  All peeks beyond the end of the text are reported as
end-tokens.

Using the lexer
===============

This is a simple example of using the lexer.  Notice that multiple token definitions
can be combined using the `def_multi_tokens` method.  It is usually better to
define a shorter alias for the function call, however.::

    lex = Lexer()

    lex.def_begin_end_tokens("k_begin", "k_end")
    lex.def_token("k_space", r"[ \\t]+", ignore=True) # note + NOT *
    lex.def_token("k_newline", r"[\\n\\f\\r\\v]+", ignore=True) # note + NOT *
    tokens = [
        ("k_identifier", r"[a-zA-Z_](?:\w*)")
        ("k_plus", r"\+")
        ]
    lex.def_multi_tokens(tokens)

    lex.set_text("x  + y")

    for t in lex:
        print(t)

The result is as follows:
::

    <k_identifier,x>
    <k_plus,+>
    <k_identifier,y>
    <k_end,None>

Notice that the end-token is actually returned, but the begin-token is not.
The method `def_default_whitespace` could alternately be used to define the
whitespace tokens.

User-accessible methods and attributes of `Lexer`
=================================================

The lexer class has many utility methods and user-accessible attributes.  Some
of the main ones are listed here.  One of the most commonly-accessed attributes
of a lexer `lex` is the current token, `lex.token`.

General methods:

* `next` --- return the next token
* `peek` --- peek at the next token without consuming it
* `go_back` --- go back in the text stream by some number of tokens

Helper methods:

* `match_next` --- matches the specified token, with various options
* `in_ignored_tokens` --- test if some particular token was ignored before the current one
* `no_ignored_after` --- true if no ignored tokens immediately follow current token
* `no_ignored_before` --- true if no ignored tokens immediately preceed current token

Some boolean-valued informational methods:

* `curr_token_is_first` --- true if the current token is the first returned
* `text_is_set` --- true only when text is currently set for scanning

Other attributes:

* `token` --- the current token (the most recent one returned by `next`)
* `all_token_count` --- num of tokens since text was set (begin and end not counted)
* `non_ignored_token_count` --- num of not-ignored tokens since text was set
* `default_helper_exception` --- the default exception for helpers like `match_next`
* `text_is_set` --- whether or not text has been set for the lexer

TODO, list more, and why not make some of these methods of `TokenNode` instead?

User-accessible attributes of tokens
====================================

The tokens returned by the lexer are instances of a subclass of the class
`TokenNode` (named that since the parser combines them into the nodes of a
parse tree).  The subclasses themselves represent the general kind of token,
for example if `k_identifier` was defined as a token label then a particular
subclass of `TokenNode` would be created to represent identifiers in general.
The particular instances of identifiers, found in the lexed text with their
actual string values, are represented by instances of the general class for
identifiers.

User-accessible methods of tokens.

* `is_begin_token` --- true when tokens is a begin token
* `is_end_token` --- true when tokens is a end token
* `is_begin_or_end_token` --- true when tokens is a begin_or_end_token
* `ignored_before_labels` --- just the token labels of the tokens ignored before

For a token named `t`, these attributes are available:

* `t.token_label` --- the string label of the token (which was defined with it)
* `t.value` --- the string value for the token, found in the lexed text
* `t.is_first` --- true iff this is the first non-begin token in the text
* `t.is_first_on_line` --- true iff this is the first token returned for a line
* `t.parent` --- can be set to the parent in a tree; set by the lexer to `None`
* `t.children` --- can be set to a list of children; set by the lexer to `[]`
* `t.original_matched_string` --- the original text that was consumed for this token
* `t.line_and_char` --- tuple of line number and character where the token started
* `t.char_index_in_program` --- the index of this token into the text set via `set_text`
* `t.ignored_before` --- a tuple of all tokens ignored immediately before this one

TODO, list other methods, too.

Initialization options
======================

There are several options that can be set on initialization, including the
level of token lookahead that is supported.

TODO

Code
====

"""

# TODO: implement move_back, the equivalent of push_back except you think of it
# as moving in the token buffer rather than pushing back a token (it is still a
# peek token).

from __future__ import print_function, division, absolute_import

# Run tests when invoked as a script.
if __name__ == "__main__":
    import pytest_helper
    pytest_helper.script_run(["../../test/test_pratt_parser.py",
                              "../../test/test_matcher.py",
                              "../../test/test_lexer.py",
                              ],
                              pytest_args="-v")

import collections
from .shared_settings_and_exceptions import LexerException, is_subclass_of
from .matcher import Matcher

#
# TokenNode
#

[docs]class TokenNode(object):
    """The base class for token objects.  Each different kind of token is represented
    by a subclass of this class.  Instances of the tokens in the program text are
    represented by instances of the subclass for that kind of token.

    The attribute `token_label` is the string token label for the kind of token
    represented by an instance.  The attribute `value` is set to the actual
    string value in the lexed text which matched the regex of the token.  The
    attribute `ignored_before` is a tuple of all tokens ignored just
    before the lexer got this token.

    The attribute `children` is a list of the child nodes, and `parent` is the
    parent.  Indexing a `TokenNode` class also returns the corresponding child
    node, i.e. `t_node[0]` would be the leftmost child."""

    token_label = None # A label for subclasses representing kinds of tokens.
    original_matched_string = "" # Default, for begin- and end-tokens.

    def __init__(self):
        """Initialize the TokenNode."""
        self.ignored_before = [] # Values ignored by lexer just before.
        self.value = None # The actual parsed text string for the token.
        self.children = [] # Args to functions are their children in parse tree.
        self.parent = None # The parent in a tree of nodes.

[docs]    def original_text(self):
        """Return the original text that was read in lexing the token, including
        any ignored text."""
        #ignored_strings = [ s.value for s in self.ignored_before ]
        #joined = "".join(ignored_strings) + self.value
        #assert joined == self.original_matched_string # A debugging test.
        return self.original_matched_string

[docs]    def ignored_before_labels(self):
        """Return the list of token labels of tokens which were ignored just
        before this token."""
        return [t.token_label for t in self.ignored_before]

[docs]    def append_children(self, *token_nodes):
        """Append all the arguments as children, also setting their parent to self."""
        for t in token_nodes:
            self.children.append(t)
            t.parent = self

    def __getitem__(self, index):
        """Use indexing to access the children.  No check is made, however, to see
        if the correct number of children exist."""
        return self.children[index]

[docs]    def convert_to_AST(self, convert_TokenNode_to_AST_node_fun):
        """Call this on the root node.  Converts the token tree to an abstract
        syntax tree.  This basically converts the nodes one-to-one to a more
        convenient type of node for the AST of a given application.  The
        function `convert_TokenNode_to_AST_node_fun` should take one argument,
        a `TokenNode` instance, and return an AST node instance for the
        corresponding AST node.  The only requirement for the AST nodes is that
        they have a method called `append_children`.  The `ast_data` attribute
        of a node can be used to save information useful in the transformation."""
        ast_node = convert_TokenNode_to_AST_node_fun(self)
        for child in self.children:
            ast_node.append_children(
                    child.convert_to_AST(convert_TokenNode_to_AST_node_fun))
        return ast_node

    #
    # Informational methods.
    #

[docs]    def is_begin_token(self):
        """Test whether this token is the begin-token."""
        return self.token_label == self.token_table.begin_token_label

[docs]    def is_end_token(self):
        """Test whether this token is the end-token."""
        return self.token_label == self.token_table.end_token_label

[docs]    def is_begin_or_end_token(self):
        """Test whether this token is either the begin- or end-token."""
        return self.is_begin_token() or self.is_end_token()

    #
    # Various representations.  Note a token can be considered a node or a subtree.
    # Coming straight from the Lexer, though, they do not yet have any children.
    #

[docs]    def traditional_repr(self):
        """Representation as a string that looks like class initialization."""
        return "TokenNode()"

[docs]    def value_repr(self):
        """Token representation as its value."""
        return str(self.value)

[docs]    def label_repr(self):
        """Token representation as its token label."""
        return str(self.token_label)

[docs]    def summary_repr(self):
        """Token representation as a summarizing string containing both the label and
        the value."""
        value_str = str(self.value)
        if isinstance(self.value, str):
            value_str = "'" + value_str + "'"
        return "<{0},{1}>".format(self.token_label, value_str)

[docs]    def tree_repr(self, indent=0):
        """Token representation as the root of a parse subtree, with formatting.
        The optional `indent` parameter can be either an indent string or else
        an integer for the number of spaces to indent."""
        try:
            num_indent = int(indent)
        except ValueError:
            pass
        else:
            indent = " " * num_indent
        string = indent + self.summary_repr() + "\n"
        for c in self.children:
            string += c.tree_repr(indent=indent+" "*4)
        return string

[docs]    def string_tree_repr(self, only_vals=False, only_labels=False):
        """Token representation as the root of a parse subtree, in a string format.
        This is the default representation, used for `__repr__`."""
        string = self.summary_repr()
        if only_vals: string = self.value_repr()
        if only_labels: string = self.label_repr()
        if self.children:
            string += "("
            string += ",".join(c.string_tree_repr() for c in self.children)
            string += ")"
        return string

[docs]    def old_repr(self):
        """This old representation is kept *only* because it is used in some tests."""
        if self.token_label == "k_number":
            return "[literal {0}]".format(self.value)
        if self.token_label == "k_lpar":
            if self.children:
                return "[k_lpar {0} k_rpar]".format(self.children[0].old_repr())
            else:
                return "[literal k_lpar]"
        else:
            str_val = "[" + str(self.value)
            for a in self.children: str_val += " " + a.old_repr()
            str_val += "]"
            return str_val
    __repr__ = string_tree_repr


[docs]def basic_token_subclass_factory():
    """Create and return a new token subclass representing tokens with label
    `token_label`.  This function is called from the `_create_token_subclass`
    method  of `TokenTable` when it needs to create a new one to
    start with.  This function **should not be called directly**, since
    additional attributes (such as the token label and a new subclass name)
    also need to be added to the generated subclass.

    This function is the default argument to the `token_subclassing_fun`
    keyword argument of the initializer for `TokenTable`.  Users
    can define their own such function in order to add methods to token objects
    which are particular to their own application (the `PrattParser` class does
    this, for example).

    Note that using a separate subclass for each token label allows for
    attributes and methods specific to a kind of token to be pasted onto the
    class itself without conflicts.  For example, the `PrattParser` subclass
    adds head handler and tail handler methods which are specific to a given
    token label."""
    # If we instead used a metaclass to generate the token subclasses instead
    # of a factory then it would be possible to define a __repr__ that controls
    # how the token-representing classes themselves are printed (they are ugly
    # now).  How much would this complicate things for users who wanted to
    # create their own factory?  Kind of an advanced topic for many people.  If
    # they could simply declare a metaclass that would be OK, but might need
    # args passed, etc.  See version in PrattParser module.
    class TokenSubclass(TokenNode):
        """This is the class returned by the factory function."""
        def __init__(self, value):
            super(TokenSubclass, self).__init__() # Call base class __init__.
            self.value = value # Passed in for instances by the Lexer token generator.

    return TokenSubclass

#
# Token table.
#

[docs]class TokenTable(object):
    """A symbol table holding subclasses of the `TokenNode` class for each token label
    defined in a `Lexer` instance.  Also has methods for operating on tokens.
    Each `Lexer` instance contains an instance of this class to save the subclasses for
    the kinds of tokens which have been defined for it."""
    def __init__(self, token_subclass_factory_fun=basic_token_subclass_factory,
                       pattern_matcher_instance=None):
        """Initialize the token table.

        The parameter `token_subclass_factory_fun` can be passed a function to
        be used to generate token subclasses, taking a token label as an
        argument.  The default is `basic_token_subclass_factory`.

        The parameter `pattern_matcher_instance` can be passed an empty pattern
        matcher instance, which will be used instead of the default one.  In
        this way users can define their own matchers, or pass in whatever options
        they choose to the initializer of the default one."""
        self.token_subclassing_fun = token_subclass_factory_fun
        self.token_subclass_dict = {}
        self.lex = None # The lexer currently associated with this token table.

        self.begin_token_label = None
        self.begin_token_subclass = None
        self.end_token_label = None
        self.end_token_subclass = None
        if pattern_matcher_instance is None:
            pattern_matcher_instance = Matcher()
        self.pattern_matcher = pattern_matcher_instance

    def __contains__(self, token_label):
        """Test whether a token subclass for `token_label` has been stored."""
        return token_label in self.token_subclass_dict

    def __getitem__(self, token_label):
        """Look up the subclasses of base class `TokenNode` corresponding to
        `token_label` in the token table and return it.  Raises a
        `LexerException` if no subclass is found for the token label."""
        if token_label in self.token_subclass_dict:
            TokenSubclass = self.token_subclass_dict[token_label]
        else:
            raise LexerException("No token with label '{0}' is in the token table."
                                   .format(token_label))
        return TokenSubclass

    def _create_token_subclass(self, token_label, store_in_dict=True):
        """Create a subclass for tokens with label `token_label` and store it
        in the token table.  Return the new subclass.  Raises a `LexerException`
        if a subclass for `token_label` has already been created.  If
        `store_in_dict` is `False` then the token is not stored."""
        if token_label in self.token_subclass_dict:
            raise LexerException("In `_create_token_subclass`, already created the"
                    " token subclass for token_label '{0}'.".format(token_label))
        # Create a new token subclass for token_label and add some attributes.
        TokenSubclass = self.token_subclassing_fun()
        TokenSubclass.token_label = token_label
        TokenSubclass.__name__ = "TokenClass_" + token_label # For debugging.
        # Store the newly-created subclass in the token_dict.
        if store_in_dict:
            self.token_subclass_dict[token_label] = TokenSubclass
        return TokenSubclass

[docs]    def undef_token_subclass(self, token_label):
        """Un-define the token with label token_label.  The `TokenNode` subclass
        previously associated with that label is removed from the dictionary."""
        try:
            del self.token_subclass_dict[token_label]
        except KeyError:
            return # Not saved in dict, ignore.

[docs]    def undef_token(self, token_label):
        """Undefine the token corresponding to `token_label`."""
        # Remove from the list of defined tokens and from the token table.
        self.undef_token_subclass(token_label)
        self.pattern_matcher.undef_pattern(token_label)

[docs]    def def_token(self, token_label, regex_string, on_ties=0, ignore=False,
                  matcher_options=None):
        """Define a token and the regex to recognize it.  Returns the new
        token subclass.

        The label `token_label` is the label for the kind of token.

        The label `regex_string` is a Python regular expression defining the
        text strings which match for the token.  If `regex_string` is set to
        `None` then a dummy token will be created which is never searched for
        in the lexed text.  To better catch errors it does not have a default
        value, so setting it to `None` must be done explicitly.

        Setting `ignore=True` will cause all such tokens to be ignored (except
        that they will be placed on the `ignored_before` list of the
        non-ignored token that they precede).

        In case of ties for the longest match in scanning, the integer
        `on_ties` values are used to break the ties.  If any two are still
        equal an exception will be raised.

        The `option` parameter takes a string value, which is then passed to
        the `insert_pattern` method of whatever matcher is being used."""
        if token_label in self:
            raise LexerException("A token with label '{0}' is already defined.  It "
            "must be undefined before it can be redefined.".format(token_label))

        if regex_string is not None:
            self.pattern_matcher.insert_pattern(token_label, regex_string,
                         on_ties, ignore=ignore, matcher_options=matcher_options)

        # Initialize and return a bare-bones, default token_subclass.
        tok = self._create_token_subclass(token_label)
        tok.token_table = self
        return tok

[docs]    def def_begin_token(self, begin_token_label):
        """Define the begin-token.  The lexer's `def_begin_end_tokens` method
        should usually be called instead."""
        tok = self.def_token(begin_token_label, None)
        self.begin_token_label = begin_token_label
        self.begin_token_subclass = tok
        return tok

[docs]    def def_end_token(self, end_token_label):
        """Define the end-token.  The `def_begin_end_tokens` method should usually
        be called instead."""
        tok = self.def_token(end_token_label, None)
        self.end_token_label = end_token_label
        self.end_token_subclass = tok
        return tok

[docs]    def get_next_token_label_and_value(self, program, prog_unprocessed_indices,
                                             ERROR_MSG_TEXT_SNIPPET_SIZE):
        """Return the next token label for the start of the current program
        text, as in the string `program` and indexed by the numbers in
        the ordered-pair tuple `prog_unprocessed`."""
        return self.pattern_matcher.get_next_token_label_and_value(
                                              program, prog_unprocessed_indices,
                                              ERROR_MSG_TEXT_SNIPPET_SIZE)

[docs]    def ignored_tokens(self):
        """Return the set of ignored tokens."""
        return self.pattern_matcher.ignore_tokens

#
# Lexer
#

[docs]class TokenBuffer(object):
    """An abstraction of the token buffer.  This is used internally by the
    `Lexer` class and should not usually be accessed by users.  It is basically
    a nice wrapper over an underlying deque, but this is complicated by the
    need to save persistent state pointers into the buffer even in fixed-size
    buffers when tokens at the front get dropped.

    Previous tokens are stored in the same deque as the current token and any
    lookahead tokens.  The default indexing is relative to the current token,
    at `current_offset`, which is zero for the current token.  (The current
    offset is itself relative to a reference point, but users do not need to
    know that detail)."""

    def __init__(self, token_getter_fun, max_peek=-1, max_deque_size=-1,):
        """Initialize the buffer.  If `max_deque_size` equals -1 then the
        size is unlimited (`None` is not used so that Cython has a single type).
        Similarly for `max_peek`, the maximum peek lookahead allowed."""
        self.token_getter_fun = token_getter_fun
        self.max_deque_size = max_deque_size
        self.max_peek = max_peek
        # Any popleft operations are done explicitly, so maxlen=None.
        self.token_buffer = collections.deque(maxlen=None)

        # Indices are relative to current_offset, and current_offset is
        # relative to reference_point.  This is because a fixed-size deque can
        # drop elements.  The current_offset is not relative to 0 because then
        # a saved offset would become invalid when items are popped off the
        # left of the deque.  The reference point is decremented for every
        # deque element popped off the left, and at no other time.  That way,
        # the current offset can be saved and remain valid until the
        # `TokenBuffer` is reset (though the referenced item may get deleted).
        self.current_offset = 0
        self.reference_point = 0

[docs]    def reset(self, begin_token):
        """Initialize the token buffer, or clear an reset it.  Any saved
        offsets are no longer valid, but no check is made for that."""
        self.current_offset = 0
        self.reference_point = 0
        self.token_buffer.clear()
        self._append(begin_token)

[docs]    def state_to_offset(self, state):
        """Return the offset into the current deque that corresponds to what
        was the offset (absolute index to the current token) at the time when
        the state was saved."""
        return state - self.reference_point

[docs]    def get_state(self):
        """Return a buffer state indicator that can be returned to later.  The
        `go_back` or `push_back` methods of the lexer use this."""
        return self._offset_to_absolute(self.current_offset)

    def __getitem__(self, index):
        """Index the buffer relative to the current offset.  Zero is the
        current token.  Negative indices go back in the buffer.  They **do not**
        index from the end of the buffer, as with ordinary Python indexing."""
        # Slices are CURRENTLY NOT allowed.  What should they return?  How
        # do you know where the zero point is?
        #if isinstance(index, slice): # Handle slices recursively.
        #    start = index.start
        #    stop = index.stop
        #    step = index.step
        #    # The indices call returns the above three in a tuple.
        #    return [self[i] for i in range(*index.indices(len(self)))]
        if isinstance(index, int): # The ordinary case.
            if self.max_peek != -1 and index > self.max_peek:
                raise LexerException("User-set maximum peeking level of {0} was"
                                     " exceeded.".format(self.max_peek))
            while self._index_to_absolute(index) >= len(self.token_buffer):
                if self.token_buffer[-1].is_end_token():
                    return self.token_buffer[-1]
                self._append()
            return self.token_buffer[self._index_to_absolute(index)]
        else:
            raise TypeError("Invalid argument type in __getitem__ of TokenBuffer.")

[docs]    def num_saved_previous_tokens(self):
        """Return the number of tokens before the current token that are saved."""
        return self._offset_to_absolute(self.current_offset)

[docs]    def num_tokens_after_current(self):
        """An informational method.  Returns the number of tokens from
        the current token to the end of the token buffer.  Some may have been
        read past the position of the current token due to peeks or
        pushbacks."""
        begin_point = self._offset_to_absolute(self.current_offset) + 1
        return len(self.token_buffer) - begin_point

[docs]    def move_forward(self, num_toks=1):
        """Move the current token (i.e., the offset) forward by one.  This is
        the token buffer's equivalent of `next`, except that it returns
        previously-buffered tokens if possible.  The `Lexer` method `next`
        should always be called by users of that class, because it also handles
        some other things.

        Attempts to move past the first end-token leave the current offset at
        the first end-token.  No new tokens are added to the buffer.  The
        end-token is returned."""
        for _ in range(num_toks):
            if self[0].is_end_token():
                break
            self.current_offset += 1
            self._fill_to_current_offset()
        return self[0]

[docs]    def move_back(self, num_toks=1):
        """Move the current token (i.e., offset) back `num_toks` tokens.  Will
        always stop at the begin-token.  Users should check the condition if it
        matters.  If the move attempts to move back to before the
        currently-saved tokens, but the begin-token is no longer saved, then a
        `LexerException` is raised."""
        self.current_offset -= num_toks
        absolute_index = self._offset_to_absolute(self.current_offset)
        if absolute_index < 0:
            self.current_offset += abs(absolute_index)
            curr_token = self[0]
            if not curr_token.is_begin_token():
                raise LexerException("Not enough saved tokens to move back to the"
                                     " begin-token in the `move_back` method.")
        else:
            curr_token = self[0]
        return curr_token

    #
    # Internal utility methods below.
    #

    def _index_to_absolute(self, index):
        """Convert an index into an absolute index into the current deque.
        Note that any changes to the current offset or to the reference
        point (the latter via _append) will invalidate the absolute reference.
        In those cases it will need to be re-calculated."""
        return index + self.current_offset + self.reference_point

    def _offset_to_absolute(self, offset):
        """Convert an offset into an absolute index into the current deque.
        Note that calls to `_append` can modify the reference point and
        invalidate the absolute index.  Needs to be re-calculated after
        such a call."""
        return offset + self.reference_point

    def _fill_to_current_offset(self):
        """If the current offset points past the end of the token buffer then
        get tokens and append them until it is a valid index.  Note this
        calls `_append`.  If the current offset is past the first end token
        in the text then the current offset point is reset to the first end
        token.  Only one end token is ever stored in the buffer."""
        while self._offset_to_absolute(self.current_offset) >= len(self.token_buffer):
            self._append() # Note this call can change self.reference_point.

    def _pop(self):
        """Users should not call.  Pop off the rightmost item and return it.  Moves
        the current token backward if necessary."""
        retval = self.token_buffer.pop()
        if self._offset_to_absolute(self.current_offset) >= len(self.token_buffer):
            self.current_offset -= 1
        return retval

    def _append(self, tok=None):
        """Append to buffer and fix current index if necessary.  Users should not
        call.  If `tok` is not set then the token to append is obtained from the
        `token_getter_fun` function.

        Note that this is the **only** method that ever gets tokens directly
        from the token getter function."""
        # TODO: should this fill with end tokens, or stop at the first end token?
        if tok is None:
            tok = self.token_getter_fun()
            # TODO: below causes a FAIL with go_back hanging... probably __getitem__
            # calling but not compensating for this behavior...
            if tok.is_end_token() and self.token_buffer[-1].is_end_token():
                return tok
        self.token_buffer.append(tok)
        if (self.max_deque_size != -1
                and len(self.token_buffer) > self.max_deque_size):
            self.reference_point -= 1
            self.token_buffer.popleft() # Do an explicit popleft.
            if self._offset_to_absolute(self.current_offset) < 0:
                raise LexerException("Error in TokenBuffer:"
                    " Maximum buffer size is too small for the amount of peeking."
                    " Current token was deleted.")
            assert len(self.token_buffer) == self.max_deque_size


[docs]class GenTokenState(object):
    """The state of the token_generator program execution."""
    ordinary = 1
    end = 2
    uninitialized = 3

# The beginnings of a state tuple for the Lexer.  NOT YET USED AT ALL, but it would
# be more elegant than the current ad hoc state restoration approach.
LexerState = collections.namedtuple("LexerState", [
                           "x",
                           "y",
                        ])

# TODO: Consider if it is a good idea to have a method like `next_raw` which
# would just return tokens for raw characters.  This would be useful in
# parsing, say, C-style comments using the parser rather than a complicated
# regex.  Would need a special token kind to return for it.  This effectively
# modifies the token set scanned for, and would require flushing the buffer
# with go_back.  Other than that it should work. Not especially efficient to
# create a token for each char, but still linear in text size.

[docs]class Lexer(object):
    """Scans text and returns tokens, represented by instances of `TokenNode`
    subclass instances. There is one subclass for each kind of token, i.e., for
    each token label.  These subclasses themselves are assumed to have been
    created before any scanning operation, via the `def_token` method.

    Token sequences are assumed to have both a begin-token and an end-token
    sentinel, defined via the `def_begin_end_tokens` method.  Exactly one
    end-token will be returned by `next`; any further calls to `next` raise
    `StopIteration`.

    The scanning is independent of the order in which tokens are defined.  The
    longest match over all token patterns will always be the one selected.  In
    case of ties the `on_ties` value (passed to `def_token`) is used to
    break it.  If that fails a `LexerException` is raised.

    If no token table is passed into `__init__` the `Lexer` will create its
    own empty one."""

    ERROR_MSG_TEXT_SNIPPET_SIZE = 40 # Number of characters to show for context.
    DEFAULT_BEGIN = "k_begin" # Default label for begin-token.
    DEFAULT_END = "k_end" # Default label for end-token.

    #
    # Initialization methods
    #

    def __init__(self, token_table=None, max_peek_tokens=None,
                 max_deque_size=None, default_begin_end_tokens=False,
                 final_mod_function=None):

        """Initialize the Lexer.  Optional arguments set the `TokenTable` to be
        used (default creates a new one), the maximum number of lookahead
        tokens (default is no fixed maximum), and the maximum deque size, which
        determines how far `go_back` operations will work (the default is
        unlimited).

        If `default_begin_end_tokens` is true then begin- and end-tokens will
        be defined using the default token labels.  By default, though, the user
        must call the `def_begin_end_tokens` method to define the begin and
        end tokens (using whatever labels are desired).

        If `final_mod_function` is passed a function taking a two arguments
        then any time a token instance is created by the lexer that function
        will be called with the parser and the token itself as the two
        arguments.  It should return the modified token."""
        self.reset(token_table=token_table, max_peek_tokens=max_peek_tokens,
                   max_deque_size=max_deque_size,
                   default_begin_end_tokens=default_begin_end_tokens,
                   final_mod_function=final_mod_function)

[docs]    def reset(self, token_table=None, max_peek_tokens=None,
                    max_deque_size=None, default_begin_end_tokens=False,
                    final_mod_function=None):
        """Return the lexer to the initial state.  Takes the same arguments as
        the initializer."""
        self.text_is_set = False
        self.token = None
        #self.all_token_count = None

        if token_table is None:
            token_table = TokenTable()
        self.set_token_table(token_table)

        if max_deque_size is None:
            max_deque_size = -1
        if max_peek_tokens is None:
            max_peek_tokens = -1
        self.token_buffer = TokenBuffer(self._unbuffered_token_getter,
                                              max_peek=max_peek_tokens,
                                              max_deque_size=max_deque_size)

        # These line and char numbers are for raw, unprocessed tokens, not the
        # buffered ones.  Use the values set with tokens as the token attribute
        # line_and_char (such as for the current token) for that info.
        self.raw_linenumber = 1 # The line number currently being read.
        self.upcoming_raw_charnumber = 1 # Char number of first char of upcoming token.
        self.upcoming_raw_total_chars = 1 # Like above, but total num., not on line.

        self.all_token_count = 0

        self.token_generator_state = GenTokenState.uninitialized

        if default_begin_end_tokens:
            self.def_begin_end_tokens(self.DEFAULT_BEGIN, self.DEFAULT_END)

        self.final_mod_function = final_mod_function

        # The default exception raised by methods like `match_next`.
        self.default_helper_exception = LexerException

[docs]    def set_token_table(self, token_table, go_back=0):
        """Sets the current `TokenTable` instance for the lexer to
        `token_table`.  This is called on initialization, but can also be
        called at any time.  If text is being scanned at the time then it
        flushes the current and lookahead tokens and re-scans the current
        token.

        When set with this method the token table is always given the attribute
        `lex`, which points to the lexer instance that this method was called from.
        This attribute is used by tokens (which know their fixed symbol table)
        so they can find the current lexer (to call `next`, etc.)"""
        self.token_table = token_table
        token_table.lex = self
        if self.text_is_set:
            self.go_back(0) # Re-scan the current token.

[docs]    def set_text(self, program,
                 reset_linenumber=True, reset_charnumber=True):
        # TODO: redefine to take a TextStream.  Be sure to also pass back position
        # info with the returned text so that tokens have have their line/position
        # of origin pasted onto them..... or at least keep track in generating
        # tokens.
        """Users should call this method to pass in the program text (or other
        text) which is to be lexically scanned.  The parameter `program` should
        be a string."""
        if not (self.token_table.begin_token_label
                and self.token_table.end_token_label):
            raise LexerException("Begin and end tokens must be defined by calling"
                    " `def_begin_end_tokens` before set_text can be called.")

        self.already_returned_end_token = False
        self._curr_token_is_first = False # Is curr token first non-ignored in text?
        self._returned_first_token = False

        # Reset line, character, and token counts.  All counts include the buffer.
        if reset_linenumber:
            self.raw_linenumber = 1
        if reset_charnumber:
            self.upcoming_raw_charnumber = 1
            self.upcoming_raw_total_chars = 1
        self.all_token_count = 0 # Count all actual tokens (not begin and end).
        self.non_ignored_token_count = 0 # Count non-ignored actual tokens.

        self.program = program # The program text currently being scanned/lexed.
        # The prog_unprocessed list holds slice indices for the unprocessed part
        # of the program text.  The go_back routine can modify this.
        self.prog_unprocessed = [0, len(self.program)] # The unprocessed slice.
        self.token_generator_state = GenTokenState.ordinary

        # Set up the token buffer.
        self._initialize_token_buffer()
        self.token = self.token_buffer[0] # Last token returned; begin-token here.
        self.text_is_set = True

    def _initialize_token_buffer(self):
        """A utility routine to initialize (fill) the token buffer.  The
        `token_buffer[0]` slot is the current token.  The current token will be
        set to the begin-token after this routine runs (since no tokens have
        yet been read with `next`).  Any tokens in the buffer past the first
        end-token are also set to end-tokens.  The size of the token buffer is
        `self.NUM_LOOKAHEAD_TOKENS` plus one for the current token.  For
        two-token lookahead the buffer deque has the form:
            [<current_token>, <peek1>, <peek2>]
        """
        begin_tok = self.token_table.begin_token_subclass(None) # Get instance.
        if self.final_mod_function:
            begin_tok = self.final_mod_function(self, begin_tok)
        tb = self.token_buffer
        tb.reset(begin_tok) # Begin token set as current; first next() returns it.
        self.token = tb[0]
        assert tb[0].is_begin_token() # DEBUG check, remove later

    #
    # Next and peek related methods
    #

[docs]    def next(self, num=1):
        """Return the next token, consuming from the token stream.  Also sets
        `self.token` to the return value.  Returns one end-token and raises
        `StopIteration` on a `next` after that end-token.

        If `num` is greater than one a list of the tokens is returned.  This
        list is cut short if the first end-token is encountered, so this
        kind of `next` call will never generate `StopIteration`."""

        if not self.text_is_set:
            raise LexerException(
                    "Attempt to call lexer's next method when no text is set.")

        if self.already_returned_end_token:
            self.text_is_set = False
            raise StopIteration

        # Handle num > 1 case with recursion.
        if num > 1:
            ret_list = []
            for _ in range(num):
                if not self.token.is_end_token():
                    ret_list.append(self.next())
                else:
                    break
            return ret_list

        # Handle ordinary case.
        tb = self.token_buffer
        self.token = tb.move_forward()
        if self.token.is_end_token():
            self.already_returned_end_token = True
        return self.token

    #__next__ = next # For Python 3.
    def __next__(self): # NOTE: Cython needs this wrapper (instead of assignment) or else
        return self.next() # it doesn't think Lexer is an iterator.  But extra overhead.

    def __iter__(self):
        return self # Class provides its own __next__ method.

[docs]    def peek(self, num_toks=1):
        """Peek ahead in the token stream without consuming any tokens.  The
        argument `num_toks` is the number of tokens ahead to peek.  The default
        peek of `num_toks=1` peeks at the token just beyond the current token.
        Peeking zero shows the current token.  Negative peeks are allowed, and
        look back at the previous tokens (up to the number saved in the token
        buffer).

        Tokens are read into the buffer on-demand to satisfy any requested
        peek.  If `max_peek_tokens` is set then an exception will be raised on
        attempts to peek farther than that."""
        if not self.text_is_set:
            raise LexerException(
                    "Attempt to call lexer's peek method when no text is set.")
        try:
            retval = self.token_buffer[num_toks]
        except IndexError: # Shouldn't happen.
            raise BufferIndexError
        return retval

[docs]    def move_back(self, num_toks=1, num_is_raw=False):
        """NOT YET IMPLEMENTED

        Move the current token back in the token stream.  This method is
        similar to methods commonly called `push_back`. It is similar to
        `go_back` except that tokens are not rescanned.  The current position
        in the token buffer is just moved back.  This is more efficient than
        `go_back` but it assumes that there have been no modifications,
        additions, or deletions to the token definitions.  If the parser is
        guaranteed to be static with respect to the defined tokens then this is
        the routine to use.  Otherwise, use `go_back`.

        The optional parameter `num_toks` is the number of tokens to move
        back.  Negative numbers move forward, consuming more tokens if
        necessary.  Moving forward will always stop before consuming a second
        end-token (which would raise `StopIteration` if done in `next`)."""
        # TODO: Implement.  Shouldn't be too hard, but it might need to modify
        # some of the Lexer attributes.  Remember, though, that the line and
        # char number in the lexer class are for the latest *unbuffered* token.
        #
        # TODO use the token buffer's move_forward and move_back methods;
        # finish implementing them...
        raise NotImplementedError
        if not self.text_is_set:
            raise LexerException(
                    "Attempt to call lexer's move_back method when no text is set.")
        if self.token_generator_state == GenTokenState.uninitialized:
            raise LexerException("The token generator has not been initialized "
                  "or has reached `StopIteration` by reading past the end-token.")

        token_buffer = self.token_buffer

    def _pop_tokens(self, n):
        """Pop `n` tokens from the token buffer, resetting the slice indices in
        `self.prog_unprocessed` and other state variables.  Used by `go_back`."""
        popped_to_begin_token = False
        current_token_is_first = False
        for _ in range(n):
            token_buffer = self.token_buffer

            if token_buffer[0].is_begin_token():
                popped_to_begin_token = True
                self.token = self.token_buffer[0]
                return popped_to_begin_token, current_token_is_first

            popped = token_buffer._pop()

            if popped.is_end_token():
                continue # No actual text was read for end tokens.

            self.non_ignored_token_count -= 1
            self.all_token_count -= (1 + len(popped.ignored_before))

            # Reset the line number information.
            if popped.ignored_before:
                line_and_char = popped.ignored_before[0].line_and_char
                char_index_in_program = popped.ignored_before[0].char_index_in_program
            else:
                line_and_char = popped.line_and_char
                char_index_in_program = popped.char_index_in_program
            self.raw_linenumber, self.upcoming_raw_charnumber = line_and_char
            self.upcoming_raw_total_chars = char_index_in_program

            # Reset the slice indices into the program text.
            self.prog_unprocessed[0] -= len(popped.original_matched_string)

            if token_buffer[-1].is_begin_token():
                current_token_is_first = True

        self.token = self.token_buffer[0]
        return popped_to_begin_token, current_token_is_first

[docs]    def go_back(self, num_toks=1, num_is_raw=False):
        """This method allows the lexer to go back in the token buffer by
        `num_toks` tokens.  The call `go_back(n)` will undo the effects of the
        last `n` calls to `next`.  This operation is different from the usual
        pushback operations because the program text will be re-scanned for the
        current token and later tokens (rather than simply backing up to
        already-scanned tokens and saving the most-recent as lookahead tokens,
        like with `move_back`).

        Going back one with `go_back(1)` or just `go_back()` results in the
        current token being set back to the previous token and also re-scanned
        from the original text.  Calling `go_back(0)` just re-scans the current
        token (and flushes any tokens in the lookahead buffer).  Values greater
        than one go farther back in the token stream.  Attempts to go back
        before the beginning of the program text go back to the beginning and
        stop there.

        This method returns the current token after any re-scanning.

        Negative numbers of tokens can be specified.  When `num_toks <= 0` the
        operation only applies to saved loohahead tokens (if there are any).
        The call `go_back(-1)` flushes all lookahead tokens saved in the buffer
        except the one immediately following the current token.  The current
        offset in the token buffer never moves forward when this method is
        called; only can only go back or stay the same.

        If `num_is_raw` is true then `num_toks` is interpreted as the actual
        number of tokens to go back, including any in the buffer (which are
        otherwise handled automatically).  This can be useful when looking at
        `lex.all_token_count` to determine how far to go back and undo
        something.

        Going back with re-scanning can be necessary when token definitions
        themselves change dynamically, such as by semantic actions.  For
        example, a declaration of the string "my_fun" as a variable might
        dynamically add a token for that new variable, which would then stop it
        from matching a general identifier with a lower on_ties value (set to,
        say, -1).  This kind of thing is also needed when swapping token
        tables, such as in parsing a sublanguage with a different parser.
        Since the sublanguage has a different collection of tokens the
        lookahead buffer must be re-scanned based on those tokens."""

        if not self.text_is_set:
            raise LexerException(
                    "Attempt to call lexer's go_back method when no text is set.")
        if self.token_generator_state == GenTokenState.uninitialized:
            raise LexerException("The token generator has not been initialized "
                  "or has reached `StopIteration` by reading past the end-token.")

        token_buffer = self.token_buffer # Shorter alias.

        # For negative values just pop the required number off the end of token_buffer.
        if num_toks < 0:
            peekahead_num = abs(num_toks)
            print("the peekahead num is:", peekahead_num)
            self._pop_tokens(token_buffer.num_tokens_after_current() - peekahead_num)
            print("returning token with value:", self.token.value)
            return self.token

        # We will re-scan at least one token, so reset `already_returned_end_token`.
        self.already_returned_end_token = False

        num_buffered_after_current = token_buffer.num_tokens_after_current()
        num_to_pop = num_toks + num_buffered_after_current + 1 # new curr is rescanned

        if num_is_raw:
            # Works with lex.all_token_count in production_rules, but why +2?
            # Setting max_peek_tokens doesn't affect it.  Clean up code.
            num_to_pop = num_toks + 2 # The added number doesn't matter except when it does...

        popped_to_begin_token, current_token_is_first = self._pop_tokens(num_to_pop)

        # Re-scan to get the new current token.
        if not popped_to_begin_token:
            self.next()

        # Reset some state variables.
        if popped_to_begin_token:
            self.peek().is_first = True
            self._returned_first_token = False
            self._curr_token_is_first = False
        elif current_token_is_first:
            self.token.is_first = True
            self._returned_first_token = True
            self._curr_token_is_first = True
        self.already_returned_end_token = self.token.is_end_token()

        return self.token

[docs]    def get_current_state(self):
        """Get a lexer state that can be returned to with `go_back_to_state`.
        States become invalid after the text is reset, but no check is made."""
        return self.token_buffer.get_state()

[docs]    def go_back_to_state(self, state):
        """Return the lexer to the state `state` saved from a previous call to
        `get_current_state`."""
        index_to_go_back_to = self.token_buffer.state_to_offset(state)
        num_to_go_back = self.token_buffer.current_offset - index_to_go_back_to
        if num_to_go_back < 0:
            self.next(-num_to_go_back)
        else:
            self.go_back(num_to_go_back)
        return self.token

    #
    # Informational methods
    #

[docs]    def curr_token_is_begin(self):
        """True if `self.token` (the last one returned by the `next` method) is
        the begin-token."""
        return self.token.is_begin_token()

[docs]    def curr_token_is_first(self):
        """True if `self.token` (the last one returned by the `next` function)
        is the first actual token in the currently-set program text.  Resetting
        the text resets this.  This value is also set as the attribute
        `is_first` on all returned tokens.  This is useful, for example, for
        finding indentation levels (along with `ignored_before_curr`)."""
        return self.token.is_first

[docs]    def ignored_before_curr(self):
        """Return the list of all tokens ignored just before `self.token` (the
        last token returned by the `next` function).  Useful for enforcing
        things like syntactic whitespace requirements, along with
        `curr_token_is_first`. This list is also set as the attribute
        `ignored_before_tokens` on all returned tokens."""
        return self.token.ignored_before

[docs]    def curr_token_is_end(self):
        """True if `self.token` (the last one returned by the `next` method) is
        the end-token."""
        return self.token.is_end_token()

[docs]    def is_defined_token_label(self, token_label):
        """Return true if `token` is currently defined as a token label."""
        return self.token_table.is_defined_token_label()

[docs]    def last_n_tokens_original_text(self, n):
        """Returns the original text parsed by the last `n` tokens (back from
        and including the current token).  This routine is mainly used to make
        error messages more helpful.  It uses the token attribute
        `original_matched_string` and the saved tokens in the token buffer.
        (which must be large enough for `n`)."""
        # TODO: Test this, code updated to use token_buffer class.
        # Could also print line numbers and stuff....
        n = min(n, self.token_buffer.num_saved_previous_tokens() + 1)
        prev_tokens = [self.token_buffer[t] for t in range(-n+1, 1)]
        string_list = [s.original_matched_string for s in prev_tokens]
        full_string = "".join(string_list)
        return full_string

[docs]    def get_unprocessed_text(self, peek=1):
        """Return all the text that is set but not yet processed.  Returns
        `None` if no text is currently set.  The current token is assumed
        to have been processed.

        By default this is relative to the token at a peek of `1`, but the
        `peek` number can be set to a previous or later one if available in the
        buffer."""
        if not self.text_is_set:
            return None
        text = self.program[self.peek(peek).char_index_in_program:]
        return text

[docs]    def get_processed_text(self, peek=1):
        """Return all the text that is set and has been processed.  Returns
        `None` if no text is currently set.  The current token is assumed
        to have been processed.

        By default this is relative to the current peek token, but the `peek`
        number can be set to a previous or later one if available in the
        buffer."""
        if not self.text_is_set:
            return None
        text = self.program[:self.peek(peek).char_index_in_program]
        return text

    #
    # Methods to define and undefine tokens
    #

[docs]    def def_token(self, token_label, regex_string, on_ties=0, ignore=False,
                  matcher_options=None):
        """A convenience method to define a token. It calls the corresponding
        `def_token` method of the current `TokenTable` instance associated with
        the lexer, and does nothing else."""
        new_subclass = self.token_table.def_token(token_label, regex_string,
                    on_ties=on_ties, ignore=ignore, matcher_options=matcher_options)
        return new_subclass

[docs]    def undef_token(self, token_label):
        """A convenience function to call the corresponding `undef_token` of
        the current `TokenTable` instance associated with the Lexer."""
        self.token_table.undef_token(token_label)

[docs]    def def_ignored_token(self, token_label, regex_string, on_ties=0,
                          matcher_options=None):
        """A convenience function to define an ignored token without setting
        `ignore=True`.  This just calls `def_token` with the value set."""
        return self.def_token(token_label, regex_string, on_ties=on_ties, ignore=True,
                              matcher_options=matcher_options)

[docs]    def def_multi_tokens(self, tuple_list, **kwargs):
        """A convenience function, to define multiple tokens at once.  Each element
        of the passed-in list should be a tuple containing the arguments to the
        ordinary `def_token` method.  Called in the same order as the list.  Any
        keyword arguments are passed on to `def_token`.  Returns a tuple of the
        defined tokens."""
        return multi_funcall(self.def_token, tuple_list, **kwargs)

[docs]    def def_multi_ignored_tokens(self, tuple_list, **kwargs):
        """A convenience function, to define multiple tokens at once with
        `ignore=True` set.  Each element of the passed-in list should be a tuple
        containing the arguments to the ordinary `def_token` method.  Called in
        the same order as the list.  Any keyword arguments are passed on to
        `def_token`.  Returns a tuple of the defined tokens."""
        return multi_funcall(self.def_ignored_token, tuple_list, **kwargs)

[docs]    def def_begin_end_tokens(self, begin_token_label, end_token_label):
        """Define the sentinel tokens at the beginning and end of the token
        stream.  This method must be called before using the Lexer.  It will
        automatically be called using default token label values unless
        `default_begin_end_tokens` was set false on initialization.  Returns a
        tuple of the new begin- and end-token subclasses.  These tokens do not
        need to be defined with `def_token` because they are never actually
        scanned and recognized in the program text (which would also require a
        regex pattern)."""
        # TODO: consider if begin and end tokens should be created by a
        # token_table method.  Probably they should, but then in Lexer need
        # to change all the self.begin_token_label to have self.token_table
        # prefix.
        begin_tok = self.token_table.def_begin_token(begin_token_label)
        end_tok = self.token_table.def_end_token(end_token_label)
        return begin_tok, end_tok

[docs]    def def_default_whitespace(self, space_label="k_space", space_regex=r"[ \t]+",
                        newline_label="k_newline", newline_regex=r"[\n\f\r\v]+",
                        matcher_options=None):
        """Define the standard whitespace tokens for space and newline, setting
        them as ignored tokens."""
        tok = self.def_ignored_token
        tok(space_label, space_regex, matcher_options=matcher_options)
        tok(newline_label, newline_regex, matcher_options=matcher_options)

    #
    # Some helper functions when using the Lexer class.
    #

[docs]    def match_next(self, token_label_to_match, peeklevel=1, consume=True,
                   raise_on_fail=False, raise_on_success=False,
                   err_msg_tokens=3):
        # TODO: Consider a way for users to define custom error strings for
        # better error-reporting.
        """A utility function that tests whether the value of the next token
        label equals a given token label.

        This method consumes a token from the lexer if and only if there is a
        match.  Either way, a boolean is returned indicating the match status.

        If `consume` is false then no tokens will ever be consumed.  Otherwise,
        and by default, a token will be consumed if and only if it matches.

        The parameter `peeklevel` is passed to the peek function for how far
        ahead to look; the default is one.

        If `raise_on_fail` set true then a `LexerException` will be raised by
        default if the match fails.  The default can be changed by setting the
        lexer instance attribute `default_helper_exception`.  Similarly,
        `raise_on_success` raises an exception when a match is found.  Either one
        can be set to a subclass of `Exception` instead of a boolean, and then
        that exception will be called.

        The parameter `err_msg_tokens` can be set to change how many tokens
        worth of text back the error messages report (as debugging
        information) when an exception is raised.  (The count does not
        include whitespace, but it is printed, too.)"""
        retval = False
        if token_label_to_match == self.peek(peeklevel).token_label:
            retval = True
        if consume and retval:
            self.next() # Eat the token that was matched.

        if retval and raise_on_success:
            exception = return_first_exception(raise_on_success,
                                               self.default_helper_exception)
            raise exception(
                    "Function match_next (with peeklevel={0}) found unexpected "
                    "token {1}.  The text of the {2} tokens up to "
                    "the error is: {3}" # TODO fix below, fails with parser
                    .format(peeklevel, str(self.peek(peeklevel)), err_msg_tokens,
                        self.last_n_tokens_original_text(err_msg_tokens)))
        if not retval and raise_on_fail:
            exception = return_first_exception(raise_on_fail,
                                               self.default_helper_exception)
            raise exception(
                    "Function match_next (with peeklevel={0}) expected token "
                    "with label '{1}' but found token {2}.  The text parsed "
                    "from the tokens up to the error is: {3}" # TODO fix below, fails
                    .format(peeklevel, token_label_to_match,
                            str(self.peek(peeklevel)),
                            self.last_n_tokens_original_text(err_msg_tokens)))
        return retval

    # TODO document these utilities......
[docs]    def in_ignored_tokens(self, token_label_to_match,
                          raise_on_fail=False, raise_on_success=False):
        """A utility function to test if a particular token label is among
        the tokens ignored before the current token.  Returns a boolean
        value.  Like `match_next`, this method can be set to raise an
        exception on success or failure."""
        retval = False
        ignored_token_labels = [t.token_label for t in self.peek().ignored_before]
        if token_label_to_match in ignored_token_labels:
            retval = True

        if retval and raise_on_success:
            exception = return_first_exception(raise_on_success,
                                               self.default_helper_exception)
            raise exception(
                    "Function in_ignored_tokens found unexpected token with "
                    "label '{0}' before the current token {1}."
                    .format(token_label_to_match, str(self.token)))
        if not retval and raise_on_fail:
            exception = return_first_exception(raise_on_fail,
                                               self.default_helper_exception)
            raise exception(
                    "Function in_ignored_tokens expected token with label "
                    "'{0}' before the current token {1}, but it was not found."
                    .format(token_label_to_match, str(self.token)))
        return retval

[docs]    def no_ignored_after(self, raise_on_fail=False, raise_on_success=False):
        """A boolean utility function to test if any tokens were ignored
        between current token and lookahead.  Like `match_next`, this method
        can be set to raise an exception on success or failure."""
        retval = True
        if self.peek().ignored_before:
            retval = False

        if retval and raise_on_success:
            exception = return_first_exception(raise_on_success,
                                               self.default_helper_exception)
            raise exception(
                    "Function no_ignored_after expected tokens between the current "
                    "token {0} and the following token {1}, but there were none."
                    .format(str(self.token), str(self.peek())))
        if not retval and raise_on_fail:
            exception = return_first_exception(raise_on_fail,
                                               self.default_helper_exception)
            raise exception(
                    "Function no_ignored_after expected nothing between the "
                    "current token {0} and the following token {1}, but there "
                    "were ignored tokens."
                    .format(str(self.token), str(self.peek())))
        else:
            return False
        return retval

[docs]    def no_ignored_before(self, raise_on_fail=False, raise_on_success=False):
        """A boolean utility function to test if any tokens were ignored between
        previous token and current token.  Like `match_next`, this method
        can be set to raise an exception on success or failure."""
        retval = True
        if self.token.ignored_before:
            retval = False

        if retval and raise_on_success:
            exception = return_first_exception(raise_on_success,
                                               self.default_helper_exception)
            raise exception(
                    "Function no_ignored_before expected ignored tokens before "
                    " the current token {0}, but none were found."
                    .format(str(self.token)))
        if not retval and raise_on_fail:
            exception = return_first_exception(raise_on_fail,
                                               self.default_helper_exception)
            raise exception(
                    "Function no_ignored_before expected no ignored tokens "
                    "before the current token {0}, but at least one was found."
                    .format(str(self.token)))
        return retval

    #
    # Lower-level routine for token generation
    #

    def _unbuffered_token_getter(self):
        """This routine generates tokens from the program text in the attribute
        `self.program`.  It does not modify the program itself, but keeps slice
        indices in a list `self.prog_unprocessed` indexing the unprocessed
        part.  That slice can be externally modified (the `go_back` routine
        does this).

        This is a lower-level function used by `next` to do the real work.  All
        the token subclasses should have been defined and stored in the the
        `TokenTable`.  Regexes defined for tokens are repeatedly matched at the
        beinning of the string `program`.  When a winning_index is found it is
        stripped off the beginning of the unprocessed slice of `program`.  For
        each winning_index the token subclass is looked up in the `TokenTable`
        object and an instance of that subclass is returned to represent the
        token.  Every token processed is represented by a unique new instance
        of the appropriate subclass of `TokenNode`.

        This generator has two states which can be set instance-globally to
        alter the state of the generator.  The states are
        `GenTokenState.ordinary` for ordinary scanning execution, and
        `GenTokenState.end` when all the tokens have been read.  In the
        `GenTokenState.end` state the method returns nothing but end tokens.
        The end state is normally entered when the program text becomes empty.
        If that variable is later is set to have text again the state switches
        back to ordinary.  (The lexer's `next` routine handles any raising of
        `StopIteration`.)"""
        ignored_before_labels = []
        ignored_before_tokens = []
        original_matched_string = ""
        token_table = self.token_table

        while True:
            self._curr_token_is_first = not self._returned_first_token
            self._returned_first_token = True

            if self.prog_unprocessed[0] == self.prog_unprocessed[1]:
                self.token_generator_state = GenTokenState.end
            else:
                self.token_generator_state = GenTokenState.ordinary

            first_after_newline = False

            # =======================================================================
            # === Ordinary execution state ==========================================
            # =======================================================================
            if self.token_generator_state == GenTokenState.ordinary:
                # Find the token_label and token_value of the matching prefix
                # which is longest (with ties broken by the on_ties values).
                label_and_value = token_table.get_next_token_label_and_value(
                                    self.program, self.prog_unprocessed,
                                    self.ERROR_MSG_TEXT_SNIPPET_SIZE)
                token_label, token_value = label_and_value

                # Remove matched prefix of the self.prog_unprocessed argument after
                # saving the matched prefix string.
                original_matched_string += self.program[self.prog_unprocessed[0]:
                                        self.prog_unprocessed[0]+len(token_value)]
                self.prog_unprocessed[0] += len(token_value)

                # Look up the class to represent the winning_index.
                try:
                    token_subclass_for_label = token_table[token_label]
                except LexerException:
                    raise LexerException("Undefined key in token table for "
                                         "token_label '{0}'.".format(token_label))

                # Make an instance of the class to return (or at least to save
                # in the token's ignored_before if ignored).
                token_instance = token_subclass_for_label(token_value)
                if self.final_mod_function:
                    token_instance = self.final_mod_function(self, token_instance)
                self.all_token_count += 1

                # Save the line and char counts for the beginning text of the
                # token with the token from the Lexer attributes.  Then update
                # the Lexer attributes.  Remember that the Lexer class versions
                # always refer to the beginning of the next token to be read
                # (into the buffer, not as the current token).  The versions
                # stored with the tokens themselves hold the beginning of text
                # when this routine scanned that token (including any ignored
                # text before it).
                #
                # Remember that we are looping and getting tokens which may turn
                # out to be ignored tokens (tested just below this block).
                token_instance.line_and_char = (
                                 self.raw_linenumber, self.upcoming_raw_charnumber)
                token_instance.char_index_in_program = self.upcoming_raw_total_chars
                num_newlines = token_value.count("\n")
                self.raw_linenumber += num_newlines
                if num_newlines == 0:
                    self.upcoming_raw_charnumber += len(token_value)
                else:
                    first_after_newline = True
                    last_newline = token_value.rfind("\n")
                    self.upcoming_raw_charnumber = (
                            len(token_value) - (last_newline + 1) + 1)
                self.upcoming_raw_total_chars += len(token_value)

                # ------------------------------------------------------------------
                # Go to the top of the loop and get another if the token is ignored.
                # ------------------------------------------------------------------
                if token_label in token_table.ignored_tokens():
                    ignored_before_labels.append(token_label)
                    ignored_before_tokens.append(token_instance)
                    continue

                self.non_ignored_token_count += 1

            # =======================================================================
            # === Return only end-tokens state ======================================
            # =======================================================================
            elif self.token_generator_state == GenTokenState.end:
                token_subclass_for_end = token_table[token_table.end_token_label]
                token_instance = token_subclass_for_end(None)
                if self.final_mod_function:
                    token_instance = self.final_mod_function(self, token_instance)

                token_instance.line_and_char = (self.raw_linenumber,
                                                self.upcoming_raw_charnumber)
                token_instance.char_index_in_program = self.upcoming_raw_total_chars

            # Got a token to return.  Set some attributes and return it.
            # Note that the attributes below are not set on ignored tokens!
            token_instance.original_matched_string = original_matched_string
            token_instance.ignored_before = tuple(ignored_before_tokens)
            token_instance.all_token_count = self.all_token_count
            token_instance.non_ignored_token_count = self.non_ignored_token_count
            token_instance.is_first = self._curr_token_is_first
            token_instance.is_first_on_line = token_instance.is_first or first_after_newline

            return token_instance


[docs]def multi_funcall(function, tuple_list, **kwargs):
    """A convenience function that takes a function (or method) and a list of tuples
    and calls `function` with the values in those tuple as arguments.
    Any unrecognized keyword arguments are passed on to the function `function`
    as keyword arguments.  If the `exception_to_raise` keyword argument is provided with
    an exception then that exception will be called whenever a `TypeError` results from
    the attempt to call `function` (defaulting to `LexerException`)."""
    retval_list = []
    exception_to_raise = kwargs.pop("exception_to_raise", LexerException)
    for t in tuple_list:
        try:
            retval_list.append(function(*t, **kwargs))
        except TypeError:
            raise exception_to_raise(
                    "Bad multi-definition of {0}: Omitted required arguments or bad "
                    "keyword arguments passed in.  Error on this tuple:\n{1}\nwith "
                    "keyword arguments\n{2}".format(function.__name__, t, kwargs))
    return tuple(retval_list)

[docs]def return_first_exception(*args):
    """Go down the argument list and return the first object that is a
    subclass of the `Exception` class.  Arguments do not need to all be
    classes.  Returns `None` if all fail.  Used to allow an optional exception
    class to be passed to a function instead of true, with a default called
    if the passed-in value is not an exception."""
    for item in args:
        if is_subclass_of(item, Exception):
            return item
    return None

#
# Exceptions
#

[docs]class BufferIndexError(LexerException):
    """Raised on attempts to read past the beginning or the end of the buffer
    (such as in `peek` methods)."""
    pass