10.9. typped.lexer¶
A Lexer
class instance is a general lexer/scanner/tokenizer. It was designed
to be used by the PrattParser
class, but it can also be used for other
lexical scanning applications.
The general purpose of the Lexer
is to take a string of text and produce a
corresponding sequence of tokens from the text. The set of possible tokens is
defined by the user, with a string label and a regex pattern that is searched
for in the program text. Once initialized with text a Lexer
instance is a
generator which sequentially produces tokens with its next
method. It is
also an iterator, so it can be used in loops, etc.
With some lexers the order in which tokens are defined is significant. They
match regexes from a list of regexes, taking the first match without regard
to the length of the match. The
Lexer
class was designed to function independently of the order in which
tokens are defined. The longest match is always returned, with ties broken
by an explicit priority mechanism. This allows token definitions to be
organized in various ways. They can all be done in one
place or spread around in the code in any order, however the programmer wants
to do it.
10.9.1. Defining tokens¶
This section describes the low-level definition of tokens when using Lexer
as
a standalone application. To use tokens with a PrattParser
instance, though,
you need to use the corresponding def_token
method of the PrattParser
class. That class adds extra attributes, methods, etc., to the tokens. The
interface is generally the same.
A token for a left parenthesis would be defined like this:
lex = typped.Lexer()
lex.def_token("k_lpar", r"lpar")
The string k_lpar
is a label for the token. The use of the string prefix
“k_
” is a naming convention for token labels. An identifier token could be
defined like this:
lex.def_token("k_identifier", r"[a-zA-Z_](?:\w*)", on_ties=-1)
Notice that in the definition of the identifier the keyword argument on_ties
is set to -1. The lexer will by default always choose the longest string which
matches a defined regex pattern for a token. If there is a tie then, by
default, an exception will be raised. The on_ties
value is used to break
ties; strings of the same length are sorted by that value and the
highest-priority string is chosen. The default on_ties
value is 0
. Suppose
you also wanted a token for the string mod
, and defined it as:
lex.def_token("k_mod", r"mod")
Since this token has a higher on_ties
value, it will always take precedence over
the identifier token, even though both match and have the same length.
10.9.2. Begin and end tokens¶
The lexer uses sentinel begin-token and end-token tokens for the beginning and
the end of the token sequence for text. These tokens must be explicitly
defined (i.e., given string labels) either by calling def_begin_end_tokens
:
lex.def_begin_end_tokens("k_begin", "k_end")
or else by setting the default_begin_end_tokens
flag to True
when
initializing the lexer:
lex = Lexer(def_begin_end_tokens=True)
The default tokens have the labels k_begin
and k_end
.
The begin-token is never explicitly returned. After the call to set_text
to
define the text to tokenize and before any calls to next
the begin-token is
the current token lexer.token
. So lex.token
and lex.peek(0)
would both
return the begin token.
After the end of the text the next
method explicitly returns one end-token.
Calling next
again raises StopIteration
and halts the lexing of the
currently-set text. All peeks beyond the end of the text are reported as
end-tokens.
10.9.3. Using the lexer¶
This is a simple example of using the lexer. Notice that multiple token definitions
can be combined using the def_multi_tokens
method. It is usually better to
define a shorter alias for the function call, however.:
lex = Lexer()
lex.def_begin_end_tokens("k_begin", "k_end")
lex.def_token("k_space", r"[ \t]+", ignore=True) # note + NOT *
lex.def_token("k_newline", r"[\n\f\r\v]+", ignore=True) # note + NOT *
tokens = [
("k_identifier", r"[a-zA-Z_](?:\w*)")
("k_plus", r"\+")
]
lex.def_multi_tokens(tokens)
lex.set_text("x + y")
for t in lex:
print(t)
The result is as follows:
<k_identifier,x>
<k_plus,+>
<k_identifier,y>
<k_end,None>
Notice that the end-token is actually returned, but the begin-token is not.
The method def_default_whitespace
could alternately be used to define the
whitespace tokens.
10.9.4. User-accessible methods and attributes of Lexer
¶
The lexer class has many utility methods and user-accessible attributes. Some
of the main ones are listed here. One of the most commonly-accessed attributes
of a lexer lex
is the current token, lex.token
.
General methods:
next
— return the next tokenpeek
— peek at the next token without consuming itgo_back
— go back in the text stream by some number of tokens
Helper methods:
match_next
— matches the specified token, with various optionsin_ignored_tokens
— test if some particular token was ignored before the current oneno_ignored_after
— true if no ignored tokens immediately follow current tokenno_ignored_before
— true if no ignored tokens immediately preceed current token
Some boolean-valued informational methods:
curr_token_is_first
— true if the current token is the first returnedtext_is_set
— true only when text is currently set for scanning
Other attributes:
token
— the current token (the most recent one returned bynext
)all_token_count
— num of tokens since text was set (begin and end not counted)non_ignored_token_count
— num of not-ignored tokens since text was setdefault_helper_exception
— the default exception for helpers likematch_next
text_is_set
— whether or not text has been set for the lexer
TODO, list more, and why not make some of these methods of TokenNode
instead?
10.9.5. User-accessible attributes of tokens¶
The tokens returned by the lexer are instances of a subclass of the class
TokenNode
(named that since the parser combines them into the nodes of a
parse tree). The subclasses themselves represent the general kind of token,
for example if k_identifier
was defined as a token label then a particular
subclass of TokenNode
would be created to represent identifiers in general.
The particular instances of identifiers, found in the lexed text with their
actual string values, are represented by instances of the general class for
identifiers.
User-accessible methods of tokens.
is_begin_token
— true when tokens is a begin tokenis_end_token
— true when tokens is a end tokenis_begin_or_end_token
— true when tokens is a begin_or_end_tokenignored_before_labels
— just the token labels of the tokens ignored before
For a token named t
, these attributes are available:
t.token_label
— the string label of the token (which was defined with it)t.value
— the string value for the token, found in the lexed textt.is_first
— true iff this is the first non-begin token in the textt.is_first_on_line
— true iff this is the first token returned for a linet.parent
— can be set to the parent in a tree; set by the lexer toNone
t.children
— can be set to a list of children; set by the lexer to[]
t.original_matched_string
— the original text that was consumed for this tokent.line_and_char
— tuple of line number and character where the token startedt.char_index_in_program
— the index of this token into the text set viaset_text
t.ignored_before
— a tuple of all tokens ignored immediately before this one
TODO, list other methods, too.
10.9.6. Initialization options¶
There are several options that can be set on initialization, including the level of token lookahead that is supported.
TODO
10.9.7. Code¶
-
class
typped.lexer.
TokenNode
[source]¶ Bases:
object
The base class for token objects. Each different kind of token is represented by a subclass of this class. Instances of the tokens in the program text are represented by instances of the subclass for that kind of token.
The attribute
token_label
is the string token label for the kind of token represented by an instance. The attributevalue
is set to the actual string value in the lexed text which matched the regex of the token. The attributeignored_before
is a tuple of all tokens ignored just before the lexer got this token.The attribute
children
is a list of the child nodes, andparent
is the parent. Indexing aTokenNode
class also returns the corresponding child node, i.e.t_node[0]
would be the leftmost child.-
token_label
= None¶
-
original_matched_string
= ''¶
-
original_text
()[source]¶ Return the original text that was read in lexing the token, including any ignored text.
-
ignored_before_labels
()[source]¶ Return the list of token labels of tokens which were ignored just before this token.
-
append_children
(*token_nodes)[source]¶ Append all the arguments as children, also setting their parent to self.
-
convert_to_AST
(convert_TokenNode_to_AST_node_fun)[source]¶ Call this on the root node. Converts the token tree to an abstract syntax tree. This basically converts the nodes one-to-one to a more convenient type of node for the AST of a given application. The function
convert_TokenNode_to_AST_node_fun
should take one argument, aTokenNode
instance, and return an AST node instance for the corresponding AST node. The only requirement for the AST nodes is that they have a method calledappend_children
. Theast_data
attribute of a node can be used to save information useful in the transformation.
-
summary_repr
()[source]¶ Token representation as a summarizing string containing both the label and the value.
-
tree_repr
(indent=0)[source]¶ Token representation as the root of a parse subtree, with formatting. The optional
indent
parameter can be either an indent string or else an integer for the number of spaces to indent.
-
-
typped.lexer.
basic_token_subclass_factory
()[source]¶ Create and return a new token subclass representing tokens with label
token_label
. This function is called from the_create_token_subclass
method ofTokenTable
when it needs to create a new one to start with. This function should not be called directly, since additional attributes (such as the token label and a new subclass name) also need to be added to the generated subclass.This function is the default argument to the
token_subclassing_fun
keyword argument of the initializer forTokenTable
. Users can define their own such function in order to add methods to token objects which are particular to their own application (thePrattParser
class does this, for example).Note that using a separate subclass for each token label allows for attributes and methods specific to a kind of token to be pasted onto the class itself without conflicts. For example, the
PrattParser
subclass adds head handler and tail handler methods which are specific to a given token label.
-
class
typped.lexer.
TokenTable
(token_subclass_factory_fun=<function basic_token_subclass_factory>, pattern_matcher_instance=None)[source]¶ Bases:
object
A symbol table holding subclasses of the
TokenNode
class for each token label defined in aLexer
instance. Also has methods for operating on tokens. EachLexer
instance contains an instance of this class to save the subclasses for the kinds of tokens which have been defined for it.-
undef_token_subclass
(token_label)[source]¶ Un-define the token with label token_label. The
TokenNode
subclass previously associated with that label is removed from the dictionary.
-
def_token
(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]¶ Define a token and the regex to recognize it. Returns the new token subclass.
The label
token_label
is the label for the kind of token.The label
regex_string
is a Python regular expression defining the text strings which match for the token. Ifregex_string
is set toNone
then a dummy token will be created which is never searched for in the lexed text. To better catch errors it does not have a default value, so setting it toNone
must be done explicitly.Setting
ignore=True
will cause all such tokens to be ignored (except that they will be placed on theignored_before
list of the non-ignored token that they precede).In case of ties for the longest match in scanning, the integer
on_ties
values are used to break the ties. If any two are still equal an exception will be raised.The
option
parameter takes a string value, which is then passed to theinsert_pattern
method of whatever matcher is being used.
-
def_begin_token
(begin_token_label)[source]¶ Define the begin-token. The lexer’s
def_begin_end_tokens
method should usually be called instead.
-
def_end_token
(end_token_label)[source]¶ Define the end-token. The
def_begin_end_tokens
method should usually be called instead.
-
-
class
typped.lexer.
TokenBuffer
(token_getter_fun, max_peek=-1, max_deque_size=-1)[source]¶ Bases:
object
An abstraction of the token buffer. This is used internally by the
Lexer
class and should not usually be accessed by users. It is basically a nice wrapper over an underlying deque, but this is complicated by the need to save persistent state pointers into the buffer even in fixed-size buffers when tokens at the front get dropped.Previous tokens are stored in the same deque as the current token and any lookahead tokens. The default indexing is relative to the current token, at
current_offset
, which is zero for the current token. (The current offset is itself relative to a reference point, but users do not need to know that detail).-
reset
(begin_token)[source]¶ Initialize the token buffer, or clear an reset it. Any saved offsets are no longer valid, but no check is made for that.
-
state_to_offset
(state)[source]¶ Return the offset into the current deque that corresponds to what was the offset (absolute index to the current token) at the time when the state was saved.
-
get_state
()[source]¶ Return a buffer state indicator that can be returned to later. The
go_back
orpush_back
methods of the lexer use this.
-
num_saved_previous_tokens
()[source]¶ Return the number of tokens before the current token that are saved.
-
num_tokens_after_current
()[source]¶ An informational method. Returns the number of tokens from the current token to the end of the token buffer. Some may have been read past the position of the current token due to peeks or pushbacks.
-
move_forward
(num_toks=1)[source]¶ Move the current token (i.e., the offset) forward by one. This is the token buffer’s equivalent of
next
, except that it returns previously-buffered tokens if possible. TheLexer
methodnext
should always be called by users of that class, because it also handles some other things.Attempts to move past the first end-token leave the current offset at the first end-token. No new tokens are added to the buffer. The end-token is returned.
-
move_back
(num_toks=1)[source]¶ Move the current token (i.e., offset) back
num_toks
tokens. Will always stop at the begin-token. Users should check the condition if it matters. If the move attempts to move back to before the currently-saved tokens, but the begin-token is no longer saved, then aLexerException
is raised.
-
-
class
typped.lexer.
GenTokenState
[source]¶ Bases:
object
The state of the token_generator program execution.
-
ordinary
= 1¶
-
end
= 2¶
-
uninitialized
= 3¶
-
-
class
typped.lexer.
LexerState
(x, y)¶ Bases:
tuple
-
x
¶ Alias for field number 0
-
y
¶ Alias for field number 1
-
-
class
typped.lexer.
Lexer
(token_table=None, max_peek_tokens=None, max_deque_size=None, default_begin_end_tokens=False, final_mod_function=None)[source]¶ Bases:
object
Scans text and returns tokens, represented by instances of
TokenNode
subclass instances. There is one subclass for each kind of token, i.e., for each token label. These subclasses themselves are assumed to have been created before any scanning operation, via thedef_token
method.Token sequences are assumed to have both a begin-token and an end-token sentinel, defined via the
def_begin_end_tokens
method. Exactly one end-token will be returned bynext
; any further calls tonext
raiseStopIteration
.The scanning is independent of the order in which tokens are defined. The longest match over all token patterns will always be the one selected. In case of ties the
on_ties
value (passed todef_token
) is used to break it. If that fails aLexerException
is raised.If no token table is passed into
__init__
theLexer
will create its own empty one.-
ERROR_MSG_TEXT_SNIPPET_SIZE
= 40¶
-
DEFAULT_BEGIN
= 'k_begin'¶
-
DEFAULT_END
= 'k_end'¶
-
reset
(token_table=None, max_peek_tokens=None, max_deque_size=None, default_begin_end_tokens=False, final_mod_function=None)[source]¶ Return the lexer to the initial state. Takes the same arguments as the initializer.
-
set_token_table
(token_table, go_back=0)[source]¶ Sets the current
TokenTable
instance for the lexer totoken_table
. This is called on initialization, but can also be called at any time. If text is being scanned at the time then it flushes the current and lookahead tokens and re-scans the current token.When set with this method the token table is always given the attribute
lex
, which points to the lexer instance that this method was called from. This attribute is used by tokens (which know their fixed symbol table) so they can find the current lexer (to callnext
, etc.)
-
set_text
(program, reset_linenumber=True, reset_charnumber=True)[source]¶ Users should call this method to pass in the program text (or other text) which is to be lexically scanned. The parameter
program
should be a string.
-
next
(num=1)[source]¶ Return the next token, consuming from the token stream. Also sets
self.token
to the return value. Returns one end-token and raisesStopIteration
on anext
after that end-token.If
num
is greater than one a list of the tokens is returned. This list is cut short if the first end-token is encountered, so this kind ofnext
call will never generateStopIteration
.
-
peek
(num_toks=1)[source]¶ Peek ahead in the token stream without consuming any tokens. The argument
num_toks
is the number of tokens ahead to peek. The default peek ofnum_toks=1
peeks at the token just beyond the current token. Peeking zero shows the current token. Negative peeks are allowed, and look back at the previous tokens (up to the number saved in the token buffer).Tokens are read into the buffer on-demand to satisfy any requested peek. If
max_peek_tokens
is set then an exception will be raised on attempts to peek farther than that.
-
move_back
(num_toks=1, num_is_raw=False)[source]¶ NOT YET IMPLEMENTED
Move the current token back in the token stream. This method is similar to methods commonly called
push_back
. It is similar togo_back
except that tokens are not rescanned. The current position in the token buffer is just moved back. This is more efficient thango_back
but it assumes that there have been no modifications, additions, or deletions to the token definitions. If the parser is guaranteed to be static with respect to the defined tokens then this is the routine to use. Otherwise, usego_back
.The optional parameter
num_toks
is the number of tokens to move back. Negative numbers move forward, consuming more tokens if necessary. Moving forward will always stop before consuming a second end-token (which would raiseStopIteration
if done innext
).
-
go_back
(num_toks=1, num_is_raw=False)[source]¶ This method allows the lexer to go back in the token buffer by
num_toks
tokens. The callgo_back(n)
will undo the effects of the lastn
calls tonext
. This operation is different from the usual pushback operations because the program text will be re-scanned for the current token and later tokens (rather than simply backing up to already-scanned tokens and saving the most-recent as lookahead tokens, like withmove_back
).Going back one with
go_back(1)
or justgo_back()
results in the current token being set back to the previous token and also re-scanned from the original text. Callinggo_back(0)
just re-scans the current token (and flushes any tokens in the lookahead buffer). Values greater than one go farther back in the token stream. Attempts to go back before the beginning of the program text go back to the beginning and stop there.This method returns the current token after any re-scanning.
Negative numbers of tokens can be specified. When
num_toks <= 0
the operation only applies to saved loohahead tokens (if there are any). The callgo_back(-1)
flushes all lookahead tokens saved in the buffer except the one immediately following the current token. The current offset in the token buffer never moves forward when this method is called; only can only go back or stay the same.If
num_is_raw
is true thennum_toks
is interpreted as the actual number of tokens to go back, including any in the buffer (which are otherwise handled automatically). This can be useful when looking atlex.all_token_count
to determine how far to go back and undo something.Going back with re-scanning can be necessary when token definitions themselves change dynamically, such as by semantic actions. For example, a declaration of the string “my_fun” as a variable might dynamically add a token for that new variable, which would then stop it from matching a general identifier with a lower on_ties value (set to, say, -1). This kind of thing is also needed when swapping token tables, such as in parsing a sublanguage with a different parser. Since the sublanguage has a different collection of tokens the lookahead buffer must be re-scanned based on those tokens.
-
get_current_state
()[source]¶ Get a lexer state that can be returned to with
go_back_to_state
. States become invalid after the text is reset, but no check is made.
-
go_back_to_state
(state)[source]¶ Return the lexer to the state
state
saved from a previous call toget_current_state
.
-
curr_token_is_begin
()[source]¶ True if
self.token
(the last one returned by thenext
method) is the begin-token.
-
curr_token_is_first
()[source]¶ True if
self.token
(the last one returned by thenext
function) is the first actual token in the currently-set program text. Resetting the text resets this. This value is also set as the attributeis_first
on all returned tokens. This is useful, for example, for finding indentation levels (along withignored_before_curr
).
-
ignored_before_curr
()[source]¶ Return the list of all tokens ignored just before
self.token
(the last token returned by thenext
function). Useful for enforcing things like syntactic whitespace requirements, along withcurr_token_is_first
. This list is also set as the attributeignored_before_tokens
on all returned tokens.
-
curr_token_is_end
()[source]¶ True if
self.token
(the last one returned by thenext
method) is the end-token.
-
is_defined_token_label
(token_label)[source]¶ Return true if
token
is currently defined as a token label.
-
last_n_tokens_original_text
(n)[source]¶ Returns the original text parsed by the last
n
tokens (back from and including the current token). This routine is mainly used to make error messages more helpful. It uses the token attributeoriginal_matched_string
and the saved tokens in the token buffer. (which must be large enough forn
).
-
get_unprocessed_text
(peek=1)[source]¶ Return all the text that is set but not yet processed. Returns
None
if no text is currently set. The current token is assumed to have been processed.By default this is relative to the token at a peek of
1
, but thepeek
number can be set to a previous or later one if available in the buffer.
-
get_processed_text
(peek=1)[source]¶ Return all the text that is set and has been processed. Returns
None
if no text is currently set. The current token is assumed to have been processed.By default this is relative to the current peek token, but the
peek
number can be set to a previous or later one if available in the buffer.
-
def_token
(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]¶ A convenience method to define a token. It calls the corresponding
def_token
method of the currentTokenTable
instance associated with the lexer, and does nothing else.
-
undef_token
(token_label)[source]¶ A convenience function to call the corresponding
undef_token
of the currentTokenTable
instance associated with the Lexer.
-
def_ignored_token
(token_label, regex_string, on_ties=0, matcher_options=None)[source]¶ A convenience function to define an ignored token without setting
ignore=True
. This just callsdef_token
with the value set.
-
def_multi_tokens
(tuple_list, **kwargs)[source]¶ A convenience function, to define multiple tokens at once. Each element of the passed-in list should be a tuple containing the arguments to the ordinary
def_token
method. Called in the same order as the list. Any keyword arguments are passed on todef_token
. Returns a tuple of the defined tokens.
-
def_multi_ignored_tokens
(tuple_list, **kwargs)[source]¶ A convenience function, to define multiple tokens at once with
ignore=True
set. Each element of the passed-in list should be a tuple containing the arguments to the ordinarydef_token
method. Called in the same order as the list. Any keyword arguments are passed on todef_token
. Returns a tuple of the defined tokens.
-
def_begin_end_tokens
(begin_token_label, end_token_label)[source]¶ Define the sentinel tokens at the beginning and end of the token stream. This method must be called before using the Lexer. It will automatically be called using default token label values unless
default_begin_end_tokens
was set false on initialization. Returns a tuple of the new begin- and end-token subclasses. These tokens do not need to be defined withdef_token
because they are never actually scanned and recognized in the program text (which would also require a regex pattern).
-
def_default_whitespace
(space_label='k_space', space_regex='[ \\t]+', newline_label='k_newline', newline_regex='[\\n\\f\\r\\v]+', matcher_options=None)[source]¶ Define the standard whitespace tokens for space and newline, setting them as ignored tokens.
-
match_next
(token_label_to_match, peeklevel=1, consume=True, raise_on_fail=False, raise_on_success=False, err_msg_tokens=3)[source]¶ A utility function that tests whether the value of the next token label equals a given token label.
This method consumes a token from the lexer if and only if there is a match. Either way, a boolean is returned indicating the match status.
If
consume
is false then no tokens will ever be consumed. Otherwise, and by default, a token will be consumed if and only if it matches.The parameter
peeklevel
is passed to the peek function for how far ahead to look; the default is one.If
raise_on_fail
set true then aLexerException
will be raised by default if the match fails. The default can be changed by setting the lexer instance attributedefault_helper_exception
. Similarly,raise_on_success
raises an exception when a match is found. Either one can be set to a subclass ofException
instead of a boolean, and then that exception will be called.The parameter
err_msg_tokens
can be set to change how many tokens worth of text back the error messages report (as debugging information) when an exception is raised. (The count does not include whitespace, but it is printed, too.)
-
in_ignored_tokens
(token_label_to_match, raise_on_fail=False, raise_on_success=False)[source]¶ A utility function to test if a particular token label is among the tokens ignored before the current token. Returns a boolean value. Like
match_next
, this method can be set to raise an exception on success or failure.
-
-
typped.lexer.
multi_funcall
(function, tuple_list, **kwargs)[source]¶ A convenience function that takes a function (or method) and a list of tuples and calls
function
with the values in those tuple as arguments. Any unrecognized keyword arguments are passed on to the functionfunction
as keyword arguments. If theexception_to_raise
keyword argument is provided with an exception then that exception will be called whenever aTypeError
results from the attempt to callfunction
(defaulting toLexerException
).
-
typped.lexer.
return_first_exception
(*args)[source]¶ Go down the argument list and return the first object that is a subclass of the
Exception
class. Arguments do not need to all be classes. ReturnsNone
if all fail. Used to allow an optional exception class to be passed to a function instead of true, with a default called if the passed-in value is not an exception.
-
exception
typped.lexer.
BufferIndexError
[source]¶ Bases:
typped.shared_settings_and_exceptions.LexerException
Raised on attempts to read past the beginning or the end of the buffer (such as in
peek
methods).