10.9. typped.lexer

A Lexer class instance is a general lexer/scanner/tokenizer. It was designed to be used by the PrattParser class, but it can also be used for other lexical scanning applications.

The general purpose of the Lexer is to take a string of text and produce a corresponding sequence of tokens from the text. The set of possible tokens is defined by the user, with a string label and a regex pattern that is searched for in the program text. Once initialized with text a Lexer instance is a generator which sequentially produces tokens with its next method. It is also an iterator, so it can be used in loops, etc.

With some lexers the order in which tokens are defined is significant. They match regexes from a list of regexes, taking the first match without regard to the length of the match. The Lexer class was designed to function independently of the order in which tokens are defined. The longest match is always returned, with ties broken by an explicit priority mechanism. This allows token definitions to be organized in various ways. They can all be done in one place or spread around in the code in any order, however the programmer wants to do it.

10.9.1. Defining tokens

This section describes the low-level definition of tokens when using Lexer as a standalone application. To use tokens with a PrattParser instance, though, you need to use the corresponding def_token method of the PrattParser class. That class adds extra attributes, methods, etc., to the tokens. The interface is generally the same.

A token for a left parenthesis would be defined like this:

lex = typped.Lexer()
lex.def_token("k_lpar", r"lpar")

The string k_lpar is a label for the token. The use of the string prefix “k_” is a naming convention for token labels. An identifier token could be defined like this:

lex.def_token("k_identifier", r"[a-zA-Z_](?:\w*)", on_ties=-1)

Notice that in the definition of the identifier the keyword argument on_ties is set to -1. The lexer will by default always choose the longest string which matches a defined regex pattern for a token. If there is a tie then, by default, an exception will be raised. The on_ties value is used to break ties; strings of the same length are sorted by that value and the highest-priority string is chosen. The default on_ties value is 0. Suppose you also wanted a token for the string mod, and defined it as:

lex.def_token("k_mod", r"mod")

Since this token has a higher on_ties value, it will always take precedence over the identifier token, even though both match and have the same length.

10.9.2. Begin and end tokens

The lexer uses sentinel begin-token and end-token tokens for the beginning and the end of the token sequence for text. These tokens must be explicitly defined (i.e., given string labels) either by calling def_begin_end_tokens:

lex.def_begin_end_tokens("k_begin", "k_end")

or else by setting the default_begin_end_tokens flag to True when initializing the lexer:

lex = Lexer(def_begin_end_tokens=True)

The default tokens have the labels k_begin and k_end.

The begin-token is never explicitly returned. After the call to set_text to define the text to tokenize and before any calls to next the begin-token is the current token lexer.token. So lex.token and lex.peek(0) would both return the begin token.

After the end of the text the next method explicitly returns one end-token. Calling next again raises StopIteration and halts the lexing of the currently-set text. All peeks beyond the end of the text are reported as end-tokens.

10.9.3. Using the lexer

This is a simple example of using the lexer. Notice that multiple token definitions can be combined using the def_multi_tokens method. It is usually better to define a shorter alias for the function call, however.:

lex = Lexer()

lex.def_begin_end_tokens("k_begin", "k_end")
lex.def_token("k_space", r"[ \t]+", ignore=True) # note + NOT *
lex.def_token("k_newline", r"[\n\f\r\v]+", ignore=True) # note + NOT *
tokens = [
    ("k_identifier", r"[a-zA-Z_](?:\w*)")
    ("k_plus", r"\+")
    ]
lex.def_multi_tokens(tokens)

lex.set_text("x  + y")

for t in lex:
    print(t)

The result is as follows:

<k_identifier,x>
<k_plus,+>
<k_identifier,y>
<k_end,None>

Notice that the end-token is actually returned, but the begin-token is not. The method def_default_whitespace could alternately be used to define the whitespace tokens.

10.9.4. User-accessible methods and attributes of Lexer

The lexer class has many utility methods and user-accessible attributes. Some of the main ones are listed here. One of the most commonly-accessed attributes of a lexer lex is the current token, lex.token.

General methods:

  • next — return the next token
  • peek — peek at the next token without consuming it
  • go_back — go back in the text stream by some number of tokens

Helper methods:

  • match_next — matches the specified token, with various options
  • in_ignored_tokens — test if some particular token was ignored before the current one
  • no_ignored_after — true if no ignored tokens immediately follow current token
  • no_ignored_before — true if no ignored tokens immediately preceed current token

Some boolean-valued informational methods:

  • curr_token_is_first — true if the current token is the first returned
  • text_is_set — true only when text is currently set for scanning

Other attributes:

  • token — the current token (the most recent one returned by next)
  • all_token_count — num of tokens since text was set (begin and end not counted)
  • non_ignored_token_count — num of not-ignored tokens since text was set
  • default_helper_exception — the default exception for helpers like match_next
  • text_is_set — whether or not text has been set for the lexer

TODO, list more, and why not make some of these methods of TokenNode instead?

10.9.5. User-accessible attributes of tokens

The tokens returned by the lexer are instances of a subclass of the class TokenNode (named that since the parser combines them into the nodes of a parse tree). The subclasses themselves represent the general kind of token, for example if k_identifier was defined as a token label then a particular subclass of TokenNode would be created to represent identifiers in general. The particular instances of identifiers, found in the lexed text with their actual string values, are represented by instances of the general class for identifiers.

User-accessible methods of tokens.

  • is_begin_token — true when tokens is a begin token
  • is_end_token — true when tokens is a end token
  • is_begin_or_end_token — true when tokens is a begin_or_end_token
  • ignored_before_labels — just the token labels of the tokens ignored before

For a token named t, these attributes are available:

  • t.token_label — the string label of the token (which was defined with it)
  • t.value — the string value for the token, found in the lexed text
  • t.is_first — true iff this is the first non-begin token in the text
  • t.is_first_on_line — true iff this is the first token returned for a line
  • t.parent — can be set to the parent in a tree; set by the lexer to None
  • t.children — can be set to a list of children; set by the lexer to []
  • t.original_matched_string — the original text that was consumed for this token
  • t.line_and_char — tuple of line number and character where the token started
  • t.char_index_in_program — the index of this token into the text set via set_text
  • t.ignored_before — a tuple of all tokens ignored immediately before this one

TODO, list other methods, too.

10.9.6. Initialization options

There are several options that can be set on initialization, including the level of token lookahead that is supported.

TODO

10.9.7. Code

class typped.lexer.TokenNode[source]

Bases: object

The base class for token objects. Each different kind of token is represented by a subclass of this class. Instances of the tokens in the program text are represented by instances of the subclass for that kind of token.

The attribute token_label is the string token label for the kind of token represented by an instance. The attribute value is set to the actual string value in the lexed text which matched the regex of the token. The attribute ignored_before is a tuple of all tokens ignored just before the lexer got this token.

The attribute children is a list of the child nodes, and parent is the parent. Indexing a TokenNode class also returns the corresponding child node, i.e. t_node[0] would be the leftmost child.

token_label = None
original_matched_string = ''
original_text()[source]

Return the original text that was read in lexing the token, including any ignored text.

ignored_before_labels()[source]

Return the list of token labels of tokens which were ignored just before this token.

append_children(*token_nodes)[source]

Append all the arguments as children, also setting their parent to self.

convert_to_AST(convert_TokenNode_to_AST_node_fun)[source]

Call this on the root node. Converts the token tree to an abstract syntax tree. This basically converts the nodes one-to-one to a more convenient type of node for the AST of a given application. The function convert_TokenNode_to_AST_node_fun should take one argument, a TokenNode instance, and return an AST node instance for the corresponding AST node. The only requirement for the AST nodes is that they have a method called append_children. The ast_data attribute of a node can be used to save information useful in the transformation.

is_begin_token()[source]

Test whether this token is the begin-token.

is_end_token()[source]

Test whether this token is the end-token.

is_begin_or_end_token()[source]

Test whether this token is either the begin- or end-token.

traditional_repr()[source]

Representation as a string that looks like class initialization.

value_repr()[source]

Token representation as its value.

label_repr()[source]

Token representation as its token label.

summary_repr()[source]

Token representation as a summarizing string containing both the label and the value.

tree_repr(indent=0)[source]

Token representation as the root of a parse subtree, with formatting. The optional indent parameter can be either an indent string or else an integer for the number of spaces to indent.

string_tree_repr(only_vals=False, only_labels=False)[source]

Token representation as the root of a parse subtree, in a string format. This is the default representation, used for __repr__.

old_repr()[source]

This old representation is kept only because it is used in some tests.

typped.lexer.basic_token_subclass_factory()[source]

Create and return a new token subclass representing tokens with label token_label. This function is called from the _create_token_subclass method of TokenTable when it needs to create a new one to start with. This function should not be called directly, since additional attributes (such as the token label and a new subclass name) also need to be added to the generated subclass.

This function is the default argument to the token_subclassing_fun keyword argument of the initializer for TokenTable. Users can define their own such function in order to add methods to token objects which are particular to their own application (the PrattParser class does this, for example).

Note that using a separate subclass for each token label allows for attributes and methods specific to a kind of token to be pasted onto the class itself without conflicts. For example, the PrattParser subclass adds head handler and tail handler methods which are specific to a given token label.

class typped.lexer.TokenTable(token_subclass_factory_fun=<function basic_token_subclass_factory>, pattern_matcher_instance=None)[source]

Bases: object

A symbol table holding subclasses of the TokenNode class for each token label defined in a Lexer instance. Also has methods for operating on tokens. Each Lexer instance contains an instance of this class to save the subclasses for the kinds of tokens which have been defined for it.

undef_token_subclass(token_label)[source]

Un-define the token with label token_label. The TokenNode subclass previously associated with that label is removed from the dictionary.

undef_token(token_label)[source]

Undefine the token corresponding to token_label.

def_token(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]

Define a token and the regex to recognize it. Returns the new token subclass.

The label token_label is the label for the kind of token.

The label regex_string is a Python regular expression defining the text strings which match for the token. If regex_string is set to None then a dummy token will be created which is never searched for in the lexed text. To better catch errors it does not have a default value, so setting it to None must be done explicitly.

Setting ignore=True will cause all such tokens to be ignored (except that they will be placed on the ignored_before list of the non-ignored token that they precede).

In case of ties for the longest match in scanning, the integer on_ties values are used to break the ties. If any two are still equal an exception will be raised.

The option parameter takes a string value, which is then passed to the insert_pattern method of whatever matcher is being used.

def_begin_token(begin_token_label)[source]

Define the begin-token. The lexer’s def_begin_end_tokens method should usually be called instead.

def_end_token(end_token_label)[source]

Define the end-token. The def_begin_end_tokens method should usually be called instead.

get_next_token_label_and_value(program, prog_unprocessed_indices, ERROR_MSG_TEXT_SNIPPET_SIZE)[source]

Return the next token label for the start of the current program text, as in the string program and indexed by the numbers in the ordered-pair tuple prog_unprocessed.

ignored_tokens()[source]

Return the set of ignored tokens.

class typped.lexer.TokenBuffer(token_getter_fun, max_peek=-1, max_deque_size=-1)[source]

Bases: object

An abstraction of the token buffer. This is used internally by the Lexer class and should not usually be accessed by users. It is basically a nice wrapper over an underlying deque, but this is complicated by the need to save persistent state pointers into the buffer even in fixed-size buffers when tokens at the front get dropped.

Previous tokens are stored in the same deque as the current token and any lookahead tokens. The default indexing is relative to the current token, at current_offset, which is zero for the current token. (The current offset is itself relative to a reference point, but users do not need to know that detail).

reset(begin_token)[source]

Initialize the token buffer, or clear an reset it. Any saved offsets are no longer valid, but no check is made for that.

state_to_offset(state)[source]

Return the offset into the current deque that corresponds to what was the offset (absolute index to the current token) at the time when the state was saved.

get_state()[source]

Return a buffer state indicator that can be returned to later. The go_back or push_back methods of the lexer use this.

num_saved_previous_tokens()[source]

Return the number of tokens before the current token that are saved.

num_tokens_after_current()[source]

An informational method. Returns the number of tokens from the current token to the end of the token buffer. Some may have been read past the position of the current token due to peeks or pushbacks.

move_forward(num_toks=1)[source]

Move the current token (i.e., the offset) forward by one. This is the token buffer’s equivalent of next, except that it returns previously-buffered tokens if possible. The Lexer method next should always be called by users of that class, because it also handles some other things.

Attempts to move past the first end-token leave the current offset at the first end-token. No new tokens are added to the buffer. The end-token is returned.

move_back(num_toks=1)[source]

Move the current token (i.e., offset) back num_toks tokens. Will always stop at the begin-token. Users should check the condition if it matters. If the move attempts to move back to before the currently-saved tokens, but the begin-token is no longer saved, then a LexerException is raised.

class typped.lexer.GenTokenState[source]

Bases: object

The state of the token_generator program execution.

ordinary = 1
end = 2
uninitialized = 3
class typped.lexer.LexerState(x, y)

Bases: tuple

x

Alias for field number 0

y

Alias for field number 1

class typped.lexer.Lexer(token_table=None, max_peek_tokens=None, max_deque_size=None, default_begin_end_tokens=False, final_mod_function=None)[source]

Bases: object

Scans text and returns tokens, represented by instances of TokenNode subclass instances. There is one subclass for each kind of token, i.e., for each token label. These subclasses themselves are assumed to have been created before any scanning operation, via the def_token method.

Token sequences are assumed to have both a begin-token and an end-token sentinel, defined via the def_begin_end_tokens method. Exactly one end-token will be returned by next; any further calls to next raise StopIteration.

The scanning is independent of the order in which tokens are defined. The longest match over all token patterns will always be the one selected. In case of ties the on_ties value (passed to def_token) is used to break it. If that fails a LexerException is raised.

If no token table is passed into __init__ the Lexer will create its own empty one.

ERROR_MSG_TEXT_SNIPPET_SIZE = 40
DEFAULT_BEGIN = 'k_begin'
DEFAULT_END = 'k_end'
reset(token_table=None, max_peek_tokens=None, max_deque_size=None, default_begin_end_tokens=False, final_mod_function=None)[source]

Return the lexer to the initial state. Takes the same arguments as the initializer.

set_token_table(token_table, go_back=0)[source]

Sets the current TokenTable instance for the lexer to token_table. This is called on initialization, but can also be called at any time. If text is being scanned at the time then it flushes the current and lookahead tokens and re-scans the current token.

When set with this method the token table is always given the attribute lex, which points to the lexer instance that this method was called from. This attribute is used by tokens (which know their fixed symbol table) so they can find the current lexer (to call next, etc.)

set_text(program, reset_linenumber=True, reset_charnumber=True)[source]

Users should call this method to pass in the program text (or other text) which is to be lexically scanned. The parameter program should be a string.

next(num=1)[source]

Return the next token, consuming from the token stream. Also sets self.token to the return value. Returns one end-token and raises StopIteration on a next after that end-token.

If num is greater than one a list of the tokens is returned. This list is cut short if the first end-token is encountered, so this kind of next call will never generate StopIteration.

peek(num_toks=1)[source]

Peek ahead in the token stream without consuming any tokens. The argument num_toks is the number of tokens ahead to peek. The default peek of num_toks=1 peeks at the token just beyond the current token. Peeking zero shows the current token. Negative peeks are allowed, and look back at the previous tokens (up to the number saved in the token buffer).

Tokens are read into the buffer on-demand to satisfy any requested peek. If max_peek_tokens is set then an exception will be raised on attempts to peek farther than that.

move_back(num_toks=1, num_is_raw=False)[source]

NOT YET IMPLEMENTED

Move the current token back in the token stream. This method is similar to methods commonly called push_back. It is similar to go_back except that tokens are not rescanned. The current position in the token buffer is just moved back. This is more efficient than go_back but it assumes that there have been no modifications, additions, or deletions to the token definitions. If the parser is guaranteed to be static with respect to the defined tokens then this is the routine to use. Otherwise, use go_back.

The optional parameter num_toks is the number of tokens to move back. Negative numbers move forward, consuming more tokens if necessary. Moving forward will always stop before consuming a second end-token (which would raise StopIteration if done in next).

go_back(num_toks=1, num_is_raw=False)[source]

This method allows the lexer to go back in the token buffer by num_toks tokens. The call go_back(n) will undo the effects of the last n calls to next. This operation is different from the usual pushback operations because the program text will be re-scanned for the current token and later tokens (rather than simply backing up to already-scanned tokens and saving the most-recent as lookahead tokens, like with move_back).

Going back one with go_back(1) or just go_back() results in the current token being set back to the previous token and also re-scanned from the original text. Calling go_back(0) just re-scans the current token (and flushes any tokens in the lookahead buffer). Values greater than one go farther back in the token stream. Attempts to go back before the beginning of the program text go back to the beginning and stop there.

This method returns the current token after any re-scanning.

Negative numbers of tokens can be specified. When num_toks <= 0 the operation only applies to saved loohahead tokens (if there are any). The call go_back(-1) flushes all lookahead tokens saved in the buffer except the one immediately following the current token. The current offset in the token buffer never moves forward when this method is called; only can only go back or stay the same.

If num_is_raw is true then num_toks is interpreted as the actual number of tokens to go back, including any in the buffer (which are otherwise handled automatically). This can be useful when looking at lex.all_token_count to determine how far to go back and undo something.

Going back with re-scanning can be necessary when token definitions themselves change dynamically, such as by semantic actions. For example, a declaration of the string “my_fun” as a variable might dynamically add a token for that new variable, which would then stop it from matching a general identifier with a lower on_ties value (set to, say, -1). This kind of thing is also needed when swapping token tables, such as in parsing a sublanguage with a different parser. Since the sublanguage has a different collection of tokens the lookahead buffer must be re-scanned based on those tokens.

get_current_state()[source]

Get a lexer state that can be returned to with go_back_to_state. States become invalid after the text is reset, but no check is made.

go_back_to_state(state)[source]

Return the lexer to the state state saved from a previous call to get_current_state.

curr_token_is_begin()[source]

True if self.token (the last one returned by the next method) is the begin-token.

curr_token_is_first()[source]

True if self.token (the last one returned by the next function) is the first actual token in the currently-set program text. Resetting the text resets this. This value is also set as the attribute is_first on all returned tokens. This is useful, for example, for finding indentation levels (along with ignored_before_curr).

ignored_before_curr()[source]

Return the list of all tokens ignored just before self.token (the last token returned by the next function). Useful for enforcing things like syntactic whitespace requirements, along with curr_token_is_first. This list is also set as the attribute ignored_before_tokens on all returned tokens.

curr_token_is_end()[source]

True if self.token (the last one returned by the next method) is the end-token.

is_defined_token_label(token_label)[source]

Return true if token is currently defined as a token label.

last_n_tokens_original_text(n)[source]

Returns the original text parsed by the last n tokens (back from and including the current token). This routine is mainly used to make error messages more helpful. It uses the token attribute original_matched_string and the saved tokens in the token buffer. (which must be large enough for n).

get_unprocessed_text(peek=1)[source]

Return all the text that is set but not yet processed. Returns None if no text is currently set. The current token is assumed to have been processed.

By default this is relative to the token at a peek of 1, but the peek number can be set to a previous or later one if available in the buffer.

get_processed_text(peek=1)[source]

Return all the text that is set and has been processed. Returns None if no text is currently set. The current token is assumed to have been processed.

By default this is relative to the current peek token, but the peek number can be set to a previous or later one if available in the buffer.

def_token(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]

A convenience method to define a token. It calls the corresponding def_token method of the current TokenTable instance associated with the lexer, and does nothing else.

undef_token(token_label)[source]

A convenience function to call the corresponding undef_token of the current TokenTable instance associated with the Lexer.

def_ignored_token(token_label, regex_string, on_ties=0, matcher_options=None)[source]

A convenience function to define an ignored token without setting ignore=True. This just calls def_token with the value set.

def_multi_tokens(tuple_list, **kwargs)[source]

A convenience function, to define multiple tokens at once. Each element of the passed-in list should be a tuple containing the arguments to the ordinary def_token method. Called in the same order as the list. Any keyword arguments are passed on to def_token. Returns a tuple of the defined tokens.

def_multi_ignored_tokens(tuple_list, **kwargs)[source]

A convenience function, to define multiple tokens at once with ignore=True set. Each element of the passed-in list should be a tuple containing the arguments to the ordinary def_token method. Called in the same order as the list. Any keyword arguments are passed on to def_token. Returns a tuple of the defined tokens.

def_begin_end_tokens(begin_token_label, end_token_label)[source]

Define the sentinel tokens at the beginning and end of the token stream. This method must be called before using the Lexer. It will automatically be called using default token label values unless default_begin_end_tokens was set false on initialization. Returns a tuple of the new begin- and end-token subclasses. These tokens do not need to be defined with def_token because they are never actually scanned and recognized in the program text (which would also require a regex pattern).

def_default_whitespace(space_label='k_space', space_regex='[ \\t]+', newline_label='k_newline', newline_regex='[\\n\\f\\r\\v]+', matcher_options=None)[source]

Define the standard whitespace tokens for space and newline, setting them as ignored tokens.

match_next(token_label_to_match, peeklevel=1, consume=True, raise_on_fail=False, raise_on_success=False, err_msg_tokens=3)[source]

A utility function that tests whether the value of the next token label equals a given token label.

This method consumes a token from the lexer if and only if there is a match. Either way, a boolean is returned indicating the match status.

If consume is false then no tokens will ever be consumed. Otherwise, and by default, a token will be consumed if and only if it matches.

The parameter peeklevel is passed to the peek function for how far ahead to look; the default is one.

If raise_on_fail set true then a LexerException will be raised by default if the match fails. The default can be changed by setting the lexer instance attribute default_helper_exception. Similarly, raise_on_success raises an exception when a match is found. Either one can be set to a subclass of Exception instead of a boolean, and then that exception will be called.

The parameter err_msg_tokens can be set to change how many tokens worth of text back the error messages report (as debugging information) when an exception is raised. (The count does not include whitespace, but it is printed, too.)

in_ignored_tokens(token_label_to_match, raise_on_fail=False, raise_on_success=False)[source]

A utility function to test if a particular token label is among the tokens ignored before the current token. Returns a boolean value. Like match_next, this method can be set to raise an exception on success or failure.

no_ignored_after(raise_on_fail=False, raise_on_success=False)[source]

A boolean utility function to test if any tokens were ignored between current token and lookahead. Like match_next, this method can be set to raise an exception on success or failure.

no_ignored_before(raise_on_fail=False, raise_on_success=False)[source]

A boolean utility function to test if any tokens were ignored between previous token and current token. Like match_next, this method can be set to raise an exception on success or failure.

typped.lexer.multi_funcall(function, tuple_list, **kwargs)[source]

A convenience function that takes a function (or method) and a list of tuples and calls function with the values in those tuple as arguments. Any unrecognized keyword arguments are passed on to the function function as keyword arguments. If the exception_to_raise keyword argument is provided with an exception then that exception will be called whenever a TypeError results from the attempt to call function (defaulting to LexerException).

typped.lexer.return_first_exception(*args)[source]

Go down the argument list and return the first object that is a subclass of the Exception class. Arguments do not need to all be classes. Returns None if all fail. Used to allow an optional exception class to be passed to a function instead of true, with a default called if the passed-in value is not an exception.

exception typped.lexer.BufferIndexError[source]

Bases: typped.shared_settings_and_exceptions.LexerException

Raised on attempts to read past the beginning or the end of the buffer (such as in peek methods).