10.9. typped.lexer¶
A Lexer class instance is a general lexer/scanner/tokenizer. It was designed
to be used by the PrattParser class, but it can also be used for other
lexical scanning applications.
The general purpose of the Lexer is to take a string of text and produce a
corresponding sequence of tokens from the text. The set of possible tokens is
defined by the user, with a string label and a regex pattern that is searched
for in the program text. Once initialized with text a Lexer instance is a
generator which sequentially produces tokens with its next method. It is
also an iterator, so it can be used in loops, etc.
With some lexers the order in which tokens are defined is significant. They
match regexes from a list of regexes, taking the first match without regard
to the length of the match. The
Lexer class was designed to function independently of the order in which
tokens are defined. The longest match is always returned, with ties broken
by an explicit priority mechanism. This allows token definitions to be
organized in various ways. They can all be done in one
place or spread around in the code in any order, however the programmer wants
to do it.
10.9.1. Defining tokens¶
This section describes the low-level definition of tokens when using Lexer as
a standalone application. To use tokens with a PrattParser instance, though,
you need to use the corresponding def_token method of the PrattParser
class. That class adds extra attributes, methods, etc., to the tokens. The
interface is generally the same.
A token for a left parenthesis would be defined like this:
lex = typped.Lexer()
lex.def_token("k_lpar", r"lpar")
The string k_lpar is a label for the token. The use of the string prefix
“k_” is a naming convention for token labels. An identifier token could be
defined like this:
lex.def_token("k_identifier", r"[a-zA-Z_](?:\w*)", on_ties=-1)
Notice that in the definition of the identifier the keyword argument on_ties
is set to -1. The lexer will by default always choose the longest string which
matches a defined regex pattern for a token. If there is a tie then, by
default, an exception will be raised. The on_ties value is used to break
ties; strings of the same length are sorted by that value and the
highest-priority string is chosen. The default on_ties value is 0. Suppose
you also wanted a token for the string mod, and defined it as:
lex.def_token("k_mod", r"mod")
Since this token has a higher on_ties value, it will always take precedence over
the identifier token, even though both match and have the same length.
10.9.2. Begin and end tokens¶
The lexer uses sentinel begin-token and end-token tokens for the beginning and
the end of the token sequence for text. These tokens must be explicitly
defined (i.e., given string labels) either by calling def_begin_end_tokens:
lex.def_begin_end_tokens("k_begin", "k_end")
or else by setting the default_begin_end_tokens flag to True when
initializing the lexer:
lex = Lexer(def_begin_end_tokens=True)
The default tokens have the labels k_begin and k_end.
The begin-token is never explicitly returned. After the call to set_text to
define the text to tokenize and before any calls to next the begin-token is
the current token lexer.token. So lex.token and lex.peek(0) would both
return the begin token.
After the end of the text the next method explicitly returns one end-token.
Calling next again raises StopIteration and halts the lexing of the
currently-set text. All peeks beyond the end of the text are reported as
end-tokens.
10.9.3. Using the lexer¶
This is a simple example of using the lexer. Notice that multiple token definitions
can be combined using the def_multi_tokens method. It is usually better to
define a shorter alias for the function call, however.:
lex = Lexer()
lex.def_begin_end_tokens("k_begin", "k_end")
lex.def_token("k_space", r"[ \t]+", ignore=True) # note + NOT *
lex.def_token("k_newline", r"[\n\f\r\v]+", ignore=True) # note + NOT *
tokens = [
("k_identifier", r"[a-zA-Z_](?:\w*)")
("k_plus", r"\+")
]
lex.def_multi_tokens(tokens)
lex.set_text("x + y")
for t in lex:
print(t)
The result is as follows:
<k_identifier,x>
<k_plus,+>
<k_identifier,y>
<k_end,None>
Notice that the end-token is actually returned, but the begin-token is not.
The method def_default_whitespace could alternately be used to define the
whitespace tokens.
10.9.4. User-accessible methods and attributes of Lexer¶
The lexer class has many utility methods and user-accessible attributes. Some
of the main ones are listed here. One of the most commonly-accessed attributes
of a lexer lex is the current token, lex.token.
General methods:
next— return the next tokenpeek— peek at the next token without consuming itgo_back— go back in the text stream by some number of tokens
Helper methods:
match_next— matches the specified token, with various optionsin_ignored_tokens— test if some particular token was ignored before the current oneno_ignored_after— true if no ignored tokens immediately follow current tokenno_ignored_before— true if no ignored tokens immediately preceed current token
Some boolean-valued informational methods:
curr_token_is_first— true if the current token is the first returnedtext_is_set— true only when text is currently set for scanning
Other attributes:
token— the current token (the most recent one returned bynext)all_token_count— num of tokens since text was set (begin and end not counted)non_ignored_token_count— num of not-ignored tokens since text was setdefault_helper_exception— the default exception for helpers likematch_nexttext_is_set— whether or not text has been set for the lexer
TODO, list more, and why not make some of these methods of TokenNode instead?
10.9.5. User-accessible attributes of tokens¶
The tokens returned by the lexer are instances of a subclass of the class
TokenNode (named that since the parser combines them into the nodes of a
parse tree). The subclasses themselves represent the general kind of token,
for example if k_identifier was defined as a token label then a particular
subclass of TokenNode would be created to represent identifiers in general.
The particular instances of identifiers, found in the lexed text with their
actual string values, are represented by instances of the general class for
identifiers.
User-accessible methods of tokens.
is_begin_token— true when tokens is a begin tokenis_end_token— true when tokens is a end tokenis_begin_or_end_token— true when tokens is a begin_or_end_tokenignored_before_labels— just the token labels of the tokens ignored before
For a token named t, these attributes are available:
t.token_label— the string label of the token (which was defined with it)t.value— the string value for the token, found in the lexed textt.is_first— true iff this is the first non-begin token in the textt.is_first_on_line— true iff this is the first token returned for a linet.parent— can be set to the parent in a tree; set by the lexer toNonet.children— can be set to a list of children; set by the lexer to[]t.original_matched_string— the original text that was consumed for this tokent.line_and_char— tuple of line number and character where the token startedt.char_index_in_program— the index of this token into the text set viaset_textt.ignored_before— a tuple of all tokens ignored immediately before this one
TODO, list other methods, too.
10.9.6. Initialization options¶
There are several options that can be set on initialization, including the level of token lookahead that is supported.
TODO
10.9.7. Code¶
-
class
typped.lexer.TokenNode[source]¶ Bases:
objectThe base class for token objects. Each different kind of token is represented by a subclass of this class. Instances of the tokens in the program text are represented by instances of the subclass for that kind of token.
The attribute
token_labelis the string token label for the kind of token represented by an instance. The attributevalueis set to the actual string value in the lexed text which matched the regex of the token. The attributeignored_beforeis a tuple of all tokens ignored just before the lexer got this token.The attribute
childrenis a list of the child nodes, andparentis the parent. Indexing aTokenNodeclass also returns the corresponding child node, i.e.t_node[0]would be the leftmost child.-
token_label= None¶
-
original_matched_string= ''¶
-
original_text()[source]¶ Return the original text that was read in lexing the token, including any ignored text.
-
ignored_before_labels()[source]¶ Return the list of token labels of tokens which were ignored just before this token.
-
append_children(*token_nodes)[source]¶ Append all the arguments as children, also setting their parent to self.
-
convert_to_AST(convert_TokenNode_to_AST_node_fun)[source]¶ Call this on the root node. Converts the token tree to an abstract syntax tree. This basically converts the nodes one-to-one to a more convenient type of node for the AST of a given application. The function
convert_TokenNode_to_AST_node_funshould take one argument, aTokenNodeinstance, and return an AST node instance for the corresponding AST node. The only requirement for the AST nodes is that they have a method calledappend_children. Theast_dataattribute of a node can be used to save information useful in the transformation.
-
summary_repr()[source]¶ Token representation as a summarizing string containing both the label and the value.
-
tree_repr(indent=0)[source]¶ Token representation as the root of a parse subtree, with formatting. The optional
indentparameter can be either an indent string or else an integer for the number of spaces to indent.
-
-
typped.lexer.basic_token_subclass_factory()[source]¶ Create and return a new token subclass representing tokens with label
token_label. This function is called from the_create_token_subclassmethod ofTokenTablewhen it needs to create a new one to start with. This function should not be called directly, since additional attributes (such as the token label and a new subclass name) also need to be added to the generated subclass.This function is the default argument to the
token_subclassing_funkeyword argument of the initializer forTokenTable. Users can define their own such function in order to add methods to token objects which are particular to their own application (thePrattParserclass does this, for example).Note that using a separate subclass for each token label allows for attributes and methods specific to a kind of token to be pasted onto the class itself without conflicts. For example, the
PrattParsersubclass adds head handler and tail handler methods which are specific to a given token label.
-
class
typped.lexer.TokenTable(token_subclass_factory_fun=<function basic_token_subclass_factory>, pattern_matcher_instance=None)[source]¶ Bases:
objectA symbol table holding subclasses of the
TokenNodeclass for each token label defined in aLexerinstance. Also has methods for operating on tokens. EachLexerinstance contains an instance of this class to save the subclasses for the kinds of tokens which have been defined for it.-
undef_token_subclass(token_label)[source]¶ Un-define the token with label token_label. The
TokenNodesubclass previously associated with that label is removed from the dictionary.
-
def_token(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]¶ Define a token and the regex to recognize it. Returns the new token subclass.
The label
token_labelis the label for the kind of token.The label
regex_stringis a Python regular expression defining the text strings which match for the token. Ifregex_stringis set toNonethen a dummy token will be created which is never searched for in the lexed text. To better catch errors it does not have a default value, so setting it toNonemust be done explicitly.Setting
ignore=Truewill cause all such tokens to be ignored (except that they will be placed on theignored_beforelist of the non-ignored token that they precede).In case of ties for the longest match in scanning, the integer
on_tiesvalues are used to break the ties. If any two are still equal an exception will be raised.The
optionparameter takes a string value, which is then passed to theinsert_patternmethod of whatever matcher is being used.
-
def_begin_token(begin_token_label)[source]¶ Define the begin-token. The lexer’s
def_begin_end_tokensmethod should usually be called instead.
-
def_end_token(end_token_label)[source]¶ Define the end-token. The
def_begin_end_tokensmethod should usually be called instead.
-
-
class
typped.lexer.TokenBuffer(token_getter_fun, max_peek=-1, max_deque_size=-1)[source]¶ Bases:
objectAn abstraction of the token buffer. This is used internally by the
Lexerclass and should not usually be accessed by users. It is basically a nice wrapper over an underlying deque, but this is complicated by the need to save persistent state pointers into the buffer even in fixed-size buffers when tokens at the front get dropped.Previous tokens are stored in the same deque as the current token and any lookahead tokens. The default indexing is relative to the current token, at
current_offset, which is zero for the current token. (The current offset is itself relative to a reference point, but users do not need to know that detail).-
reset(begin_token)[source]¶ Initialize the token buffer, or clear an reset it. Any saved offsets are no longer valid, but no check is made for that.
-
state_to_offset(state)[source]¶ Return the offset into the current deque that corresponds to what was the offset (absolute index to the current token) at the time when the state was saved.
-
get_state()[source]¶ Return a buffer state indicator that can be returned to later. The
go_backorpush_backmethods of the lexer use this.
-
num_saved_previous_tokens()[source]¶ Return the number of tokens before the current token that are saved.
-
num_tokens_after_current()[source]¶ An informational method. Returns the number of tokens from the current token to the end of the token buffer. Some may have been read past the position of the current token due to peeks or pushbacks.
-
move_forward(num_toks=1)[source]¶ Move the current token (i.e., the offset) forward by one. This is the token buffer’s equivalent of
next, except that it returns previously-buffered tokens if possible. TheLexermethodnextshould always be called by users of that class, because it also handles some other things.Attempts to move past the first end-token leave the current offset at the first end-token. No new tokens are added to the buffer. The end-token is returned.
-
move_back(num_toks=1)[source]¶ Move the current token (i.e., offset) back
num_tokstokens. Will always stop at the begin-token. Users should check the condition if it matters. If the move attempts to move back to before the currently-saved tokens, but the begin-token is no longer saved, then aLexerExceptionis raised.
-
-
class
typped.lexer.GenTokenState[source]¶ Bases:
objectThe state of the token_generator program execution.
-
ordinary= 1¶
-
end= 2¶
-
uninitialized= 3¶
-
-
class
typped.lexer.LexerState(x, y)¶ Bases:
tuple-
x¶ Alias for field number 0
-
y¶ Alias for field number 1
-
-
class
typped.lexer.Lexer(token_table=None, max_peek_tokens=None, max_deque_size=None, default_begin_end_tokens=False, final_mod_function=None)[source]¶ Bases:
objectScans text and returns tokens, represented by instances of
TokenNodesubclass instances. There is one subclass for each kind of token, i.e., for each token label. These subclasses themselves are assumed to have been created before any scanning operation, via thedef_tokenmethod.Token sequences are assumed to have both a begin-token and an end-token sentinel, defined via the
def_begin_end_tokensmethod. Exactly one end-token will be returned bynext; any further calls tonextraiseStopIteration.The scanning is independent of the order in which tokens are defined. The longest match over all token patterns will always be the one selected. In case of ties the
on_tiesvalue (passed todef_token) is used to break it. If that fails aLexerExceptionis raised.If no token table is passed into
__init__theLexerwill create its own empty one.-
ERROR_MSG_TEXT_SNIPPET_SIZE= 40¶
-
DEFAULT_BEGIN= 'k_begin'¶
-
DEFAULT_END= 'k_end'¶
-
reset(token_table=None, max_peek_tokens=None, max_deque_size=None, default_begin_end_tokens=False, final_mod_function=None)[source]¶ Return the lexer to the initial state. Takes the same arguments as the initializer.
-
set_token_table(token_table, go_back=0)[source]¶ Sets the current
TokenTableinstance for the lexer totoken_table. This is called on initialization, but can also be called at any time. If text is being scanned at the time then it flushes the current and lookahead tokens and re-scans the current token.When set with this method the token table is always given the attribute
lex, which points to the lexer instance that this method was called from. This attribute is used by tokens (which know their fixed symbol table) so they can find the current lexer (to callnext, etc.)
-
set_text(program, reset_linenumber=True, reset_charnumber=True)[source]¶ Users should call this method to pass in the program text (or other text) which is to be lexically scanned. The parameter
programshould be a string.
-
next(num=1)[source]¶ Return the next token, consuming from the token stream. Also sets
self.tokento the return value. Returns one end-token and raisesStopIterationon anextafter that end-token.If
numis greater than one a list of the tokens is returned. This list is cut short if the first end-token is encountered, so this kind ofnextcall will never generateStopIteration.
-
peek(num_toks=1)[source]¶ Peek ahead in the token stream without consuming any tokens. The argument
num_toksis the number of tokens ahead to peek. The default peek ofnum_toks=1peeks at the token just beyond the current token. Peeking zero shows the current token. Negative peeks are allowed, and look back at the previous tokens (up to the number saved in the token buffer).Tokens are read into the buffer on-demand to satisfy any requested peek. If
max_peek_tokensis set then an exception will be raised on attempts to peek farther than that.
-
move_back(num_toks=1, num_is_raw=False)[source]¶ NOT YET IMPLEMENTED
Move the current token back in the token stream. This method is similar to methods commonly called
push_back. It is similar togo_backexcept that tokens are not rescanned. The current position in the token buffer is just moved back. This is more efficient thango_backbut it assumes that there have been no modifications, additions, or deletions to the token definitions. If the parser is guaranteed to be static with respect to the defined tokens then this is the routine to use. Otherwise, usego_back.The optional parameter
num_toksis the number of tokens to move back. Negative numbers move forward, consuming more tokens if necessary. Moving forward will always stop before consuming a second end-token (which would raiseStopIterationif done innext).
-
go_back(num_toks=1, num_is_raw=False)[source]¶ This method allows the lexer to go back in the token buffer by
num_tokstokens. The callgo_back(n)will undo the effects of the lastncalls tonext. This operation is different from the usual pushback operations because the program text will be re-scanned for the current token and later tokens (rather than simply backing up to already-scanned tokens and saving the most-recent as lookahead tokens, like withmove_back).Going back one with
go_back(1)or justgo_back()results in the current token being set back to the previous token and also re-scanned from the original text. Callinggo_back(0)just re-scans the current token (and flushes any tokens in the lookahead buffer). Values greater than one go farther back in the token stream. Attempts to go back before the beginning of the program text go back to the beginning and stop there.This method returns the current token after any re-scanning.
Negative numbers of tokens can be specified. When
num_toks <= 0the operation only applies to saved loohahead tokens (if there are any). The callgo_back(-1)flushes all lookahead tokens saved in the buffer except the one immediately following the current token. The current offset in the token buffer never moves forward when this method is called; only can only go back or stay the same.If
num_is_rawis true thennum_toksis interpreted as the actual number of tokens to go back, including any in the buffer (which are otherwise handled automatically). This can be useful when looking atlex.all_token_countto determine how far to go back and undo something.Going back with re-scanning can be necessary when token definitions themselves change dynamically, such as by semantic actions. For example, a declaration of the string “my_fun” as a variable might dynamically add a token for that new variable, which would then stop it from matching a general identifier with a lower on_ties value (set to, say, -1). This kind of thing is also needed when swapping token tables, such as in parsing a sublanguage with a different parser. Since the sublanguage has a different collection of tokens the lookahead buffer must be re-scanned based on those tokens.
-
get_current_state()[source]¶ Get a lexer state that can be returned to with
go_back_to_state. States become invalid after the text is reset, but no check is made.
-
go_back_to_state(state)[source]¶ Return the lexer to the state
statesaved from a previous call toget_current_state.
-
curr_token_is_begin()[source]¶ True if
self.token(the last one returned by thenextmethod) is the begin-token.
-
curr_token_is_first()[source]¶ True if
self.token(the last one returned by thenextfunction) is the first actual token in the currently-set program text. Resetting the text resets this. This value is also set as the attributeis_firston all returned tokens. This is useful, for example, for finding indentation levels (along withignored_before_curr).
-
ignored_before_curr()[source]¶ Return the list of all tokens ignored just before
self.token(the last token returned by thenextfunction). Useful for enforcing things like syntactic whitespace requirements, along withcurr_token_is_first. This list is also set as the attributeignored_before_tokenson all returned tokens.
-
curr_token_is_end()[source]¶ True if
self.token(the last one returned by thenextmethod) is the end-token.
-
is_defined_token_label(token_label)[source]¶ Return true if
tokenis currently defined as a token label.
-
last_n_tokens_original_text(n)[source]¶ Returns the original text parsed by the last
ntokens (back from and including the current token). This routine is mainly used to make error messages more helpful. It uses the token attributeoriginal_matched_stringand the saved tokens in the token buffer. (which must be large enough forn).
-
get_unprocessed_text(peek=1)[source]¶ Return all the text that is set but not yet processed. Returns
Noneif no text is currently set. The current token is assumed to have been processed.By default this is relative to the token at a peek of
1, but thepeeknumber can be set to a previous or later one if available in the buffer.
-
get_processed_text(peek=1)[source]¶ Return all the text that is set and has been processed. Returns
Noneif no text is currently set. The current token is assumed to have been processed.By default this is relative to the current peek token, but the
peeknumber can be set to a previous or later one if available in the buffer.
-
def_token(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]¶ A convenience method to define a token. It calls the corresponding
def_tokenmethod of the currentTokenTableinstance associated with the lexer, and does nothing else.
-
undef_token(token_label)[source]¶ A convenience function to call the corresponding
undef_tokenof the currentTokenTableinstance associated with the Lexer.
-
def_ignored_token(token_label, regex_string, on_ties=0, matcher_options=None)[source]¶ A convenience function to define an ignored token without setting
ignore=True. This just callsdef_tokenwith the value set.
-
def_multi_tokens(tuple_list, **kwargs)[source]¶ A convenience function, to define multiple tokens at once. Each element of the passed-in list should be a tuple containing the arguments to the ordinary
def_tokenmethod. Called in the same order as the list. Any keyword arguments are passed on todef_token. Returns a tuple of the defined tokens.
-
def_multi_ignored_tokens(tuple_list, **kwargs)[source]¶ A convenience function, to define multiple tokens at once with
ignore=Trueset. Each element of the passed-in list should be a tuple containing the arguments to the ordinarydef_tokenmethod. Called in the same order as the list. Any keyword arguments are passed on todef_token. Returns a tuple of the defined tokens.
-
def_begin_end_tokens(begin_token_label, end_token_label)[source]¶ Define the sentinel tokens at the beginning and end of the token stream. This method must be called before using the Lexer. It will automatically be called using default token label values unless
default_begin_end_tokenswas set false on initialization. Returns a tuple of the new begin- and end-token subclasses. These tokens do not need to be defined withdef_tokenbecause they are never actually scanned and recognized in the program text (which would also require a regex pattern).
-
def_default_whitespace(space_label='k_space', space_regex='[ \\t]+', newline_label='k_newline', newline_regex='[\\n\\f\\r\\v]+', matcher_options=None)[source]¶ Define the standard whitespace tokens for space and newline, setting them as ignored tokens.
-
match_next(token_label_to_match, peeklevel=1, consume=True, raise_on_fail=False, raise_on_success=False, err_msg_tokens=3)[source]¶ A utility function that tests whether the value of the next token label equals a given token label.
This method consumes a token from the lexer if and only if there is a match. Either way, a boolean is returned indicating the match status.
If
consumeis false then no tokens will ever be consumed. Otherwise, and by default, a token will be consumed if and only if it matches.The parameter
peeklevelis passed to the peek function for how far ahead to look; the default is one.If
raise_on_failset true then aLexerExceptionwill be raised by default if the match fails. The default can be changed by setting the lexer instance attributedefault_helper_exception. Similarly,raise_on_successraises an exception when a match is found. Either one can be set to a subclass ofExceptioninstead of a boolean, and then that exception will be called.The parameter
err_msg_tokenscan be set to change how many tokens worth of text back the error messages report (as debugging information) when an exception is raised. (The count does not include whitespace, but it is printed, too.)
-
in_ignored_tokens(token_label_to_match, raise_on_fail=False, raise_on_success=False)[source]¶ A utility function to test if a particular token label is among the tokens ignored before the current token. Returns a boolean value. Like
match_next, this method can be set to raise an exception on success or failure.
-
-
typped.lexer.multi_funcall(function, tuple_list, **kwargs)[source]¶ A convenience function that takes a function (or method) and a list of tuples and calls
functionwith the values in those tuple as arguments. Any unrecognized keyword arguments are passed on to the functionfunctionas keyword arguments. If theexception_to_raisekeyword argument is provided with an exception then that exception will be called whenever aTypeErrorresults from the attempt to callfunction(defaulting toLexerException).
-
typped.lexer.return_first_exception(*args)[source]¶ Go down the argument list and return the first object that is a subclass of the
Exceptionclass. Arguments do not need to all be classes. ReturnsNoneif all fail. Used to allow an optional exception class to be passed to a function instead of true, with a default called if the passed-in value is not an exception.
-
exception
typped.lexer.BufferIndexError[source]¶ Bases:
typped.shared_settings_and_exceptions.LexerExceptionRaised on attempts to read past the beginning or the end of the buffer (such as in
peekmethods).