1. Basic usage¶
This section gives an overview of the basic usage of the Typped package. Later sections go into many more details. Executable code for all the examples on this page can be found in the file examples/basic_usage_examples.py.
The PrattParser
class is the main class provided by the Typped package. It
is the only class that is needed for basic usage of the parser.
1.1. Example 1: Parsing a simple expression¶
This example parses simple expressions such as x * 4 + 4
using the builtin
methods of the PrattParser
class. The builtin parsing methods that can be
called for a PrattParser
instance are documented here:
typped.builtin_parse_methods
. Only a few are used in this
example. Example 2 will parse the same language without using any of the
builtin parsing methods. Instead, it will directly use Pratt parsing
techniques.
The language consists of integers, identifiers representing integers, addition and multiplication operators, and parentheses. Types are not used in this example. Strings of the language are parsed to an expression tree.
Create an instance of the
PrattParser
class:import typped as pp parser = pp.PrattParser()
Define each token that will appear in the language, including a string label and a regex to recognize that kind of token. If necessary, the appropriate
on_ties
values can be set to break ties in case of equal match lengths. The lexer will always take the longest match over all the defined tokens, with ties broken by anyon_ties
values (which by default equal zero). The order in which definitions are made does not affect the lexical analysis.parser.def_default_whitespace() tok = parser.def_token tok("k_number", r"\d+") tok("k_lpar", r"\(") tok("k_rpar", r"\)") tok("k_ast", r"\*") tok("k_plus", r"\+") tok("k_identifier", r"[a-zA-Z_](?:\w*)", on_ties=-1)
In these examples the string
k_
is used by convention as a prefix for token labels. The prefixt_
will be used for types.The call to
def_default_whitespace
above sets up default whitespace. It is equivalent to the following code:parser.def_ignored_token("k_space", r"[ \t]+") parser.def_ignored_token("k_newline", r"[\n\f\r\v]+")
Whitespace is defined as space, tab, and newline characters and is set to be ignored by the lexer. The space and newline sequences are made into separate tokens; both are ignored by the lexer. (Ignored tokens can be accessed via the
ignored_before
attribute of a token.)Define the syntactical elements of the language being parsed. Any necessary token labels must have already been defined (as in the previous step). The predefined syntax-definition methods of
PrattParser
take as arguments token labels, type information, etc.Note that token literals must still be defined as syntactical elements of the grammar being parsed, after being defined as tokens. The token definition just creates a kind of token to be scanned and returned by the lexer. If typing is being used then any type information should also be set for the token literals since they are the leaves of the expression trees.
parser.def_literal("k_number") parser.def_literal("k_identifier")
Infix operators are defined below, with a precedence of 10 for addition and 20 for multiplication. Both are left-associative. Parentheses for grouping are also defined.
parser.def_infix_op("k_plus", 10, "left") parser.def_infix_op("k_ast", 20, "left") parser.def_bracket_pair("k_lpar", "k_rpar")
Pass the parser instance a string of text to parse, and save the returned expression tree. The returned tree has token instances as its nodes.
result_tree = parser.parse("x + (4 + 3)*5") print(result_tree.tree_repr())
The result of running the above code is:
<k_plus,'+'> <k_identifier,'x'> <k_ast,'*'> <k_lpar,'('> <k_plus,'+'> <k_number,'4'> <k_number,'3'> <k_number,'5'>
See “Example: Implementing a simple calculator” for a similar example with many more features and which also evaluates the expressions.
1.2. Example 2: Parsing a simple expression without using builtins¶
This example parses the same language as Example 1, but none of the builtin
parsing routines of the PrattParser
class are used. Raw Pratt parsing is
used, explicitly defining the head and tail handler functions and registering
them with the parser as a construct. (Head and tail handler functions are
called null denotations and left denotations, respectively, in traditional
Pratt Parser terminology.)
See the section “Introduction to Pratt parsing and its terminology” for background information. A construct here can be thought of as simply a container class that holds a token label and a corresponding head or tail handler function which is triggered by that kind of token (though actually constructs are more general than that). See the section “Constructs and preconditioned dispatching” for more information on constructs.
The definitions of the initial parser instance and the tokens are exactly the same in this example as in the previous example, so that portion of the code above is not repeated. The below discussion starts at Step 3 above, assuming the code for steps 1 and 2 has already been run.
First we define the token literals, which are tokens that represent themselves in the final expression tree. The head handler function for such a token simply returns the token itself. Such a head-handler function is registered with the parser for each kind of token which should be a token literal:
def literal_head_handler(tok, lex):
return tok
parser.def_construct(pp.HEAD, literal_head_handler, "k_number")
parser.def_construct(pp.HEAD, literal_head_handler, "k_identifier")
Next, we define the infix operators, starting with addition. First, we need a tail handler function:
def infix_op_tail_handler_10(tok, lex, left):
tok.append_children(left, tok.recursive_parse(10)) # Use 9 for right assoc.
return tok
This handler function has a hardcoded left-association precedence value of 10
(for right-association 9 would be used instead). When called, the tok
parameter will hold the token for the "k_plus"
operator which triggers this
particular handler function. The function simply sets the left child of
tok
to the passed-in left
argument (which holds the expression to the
left that was already processed). It sets the right child to the result of the
recursive_parse
function, which parses the next expression. So the left
and right operands are both set to expressions.
The def_construct
method is now used to register the handler with the
parser as a head-handler triggered by "k_plus"
tokens:
parser.def_construct(pp.TAIL, infix_op_tail_handler_10, "k_plus", prec=10)
Notice that the precedence value of 10 is also passed to def_construct
.
The construct for parsing +
operators has now been defined for the
language. The code for multiplication is similar, except that a precedence of
20 is hardcoded:
def infix_op_tail_handler_20(tok, lex, left):
tok.append_children(left, tok.recursive_parse(20)) # Use 19 for right assoc.
return tok
parser.def_construct(pp.TAIL, infix_op_tail_handler_20, "k_ast", prec=20)
Finally, we need to define the construct for parsing parentheses. This is done
by defining a head-handler for the left parenthesis token. The handler just
calls recursive_parse
to get the expression inside the parentheses,
consumes the closing parenthesis, and returns the expression inside:
def paren_head_handler(tok, lex):
expr = tok.recursive_parse(0)
lex.match_next("k_rpar", raise_on_fail=True)
return expr # Do not include the parens themselves, just the arg.
parser.def_construct(pp.HEAD, paren_head_handler, "k_lpar")
This finishes the definition of the parser for the simple language, without using any of the builtin parsing methods. Now this code can be run:
result_tree = parser.parse("x + (4 + 3)*5")
print(result_tree.tree_repr())
The result is shown here:
<k_plus,'+'>
<k_identifier,'x'>
<k_ast,'*'>
<k_plus,'+'>
<k_number,'4'>
<k_number,'3'>
<k_number,'5'>
Notice that the expression tree created using the def_bracket_pair
builtin
in Example 1 included the k_lpar
token in the tree. This handler function
does not; it simply returns the expression inside the parentheses. To get that
kind of behavior with def_bracket_pair
you can set the keyword in_tree
to false.
The builtin methods of PrattParser
are basically just wrapper functions
that do things like defining and registering handler functions behind the
scenes. They are written for much more generality than the above code, and
they tend to have various options. If you need to write your own wrapper
functions it can be useful to look at the code for the builtin parse routines
in the file builtin_parse_methods.py
documented in
typped.builtin_parse_methods
.
1.3. Example 3 : A simple untyped string and number language¶
This three examples are of a simple language that operates on both quoted strings and integers. The only allowed operations are addition and multiplication. The operations on integers give the usual results. The operations on strings are like in Python: addition concatenates and multiplication of a string by an integer (on the left or right) repeats it that many times. Addition of strings and integers is a syntax error. Identifier variables can also be defined and assigned values.
This example does not use typing at all. The later two examples use dynamic typing and static typing, respectively. These examples illustrate how to define evaluation functions to interpret the parsed expression trees. They also show how to use the basic type mechanism.
The example code in the examples directory runs the language in a read-evaluate-print loop (REPL).
def setup_string_language_parser_no_typing():
"""A simple, untyped language that uses `+` to add integers and
concatenate strings. Multiplication of a number by a string repeats the
string. Multiplication of a string by a string is not defined. It also
has simple variables which can represent either numbers or strings. Any
errors are caught during the Python evaluation.
Since type-checking is disabled the overloading must be implemented
by general handler functions that check the arguments themselves."""
parser = pp.PrattParser(skip_type_checking=True)
# Define the tokens.
parser.def_default_whitespace()
tok = parser.def_token
tok("k_int", r"-?\d+")
tok("k_lpar", r"\(")
tok("k_rpar", r"\)")
tok("k_ast", r"\*")
tok("k_plus", r"\+")
tok("k_equals", r"=")
tok("k_identifier", r"[a-zA-Z_](?:\w*)", on_ties=-1)
tok("k_string", r"(\"(.|[\r\n])*?\")")
# Define the syntax of the language, supplying evaluation functions.
parser.def_literal("k_int", eval_fun=lambda t: int(t.value))
parser.def_literal("k_string", eval_fun=lambda t: t.value)
parser.def_bracket_pair("k_lpar", "k_rpar", eval_fun=lambda t: t[0].eval_subtree())
def plus_op(x, y):
"""Generic addition operation that works on all the defined cases. Note strings
are stored with quotes."""
if isinstance(x, int) or x[0] != "\"":
return int(x) + int(y)
return x[:-1] + y[1:]
def mult_op(x, y):
"""Generic multiplication operation that works on all the defined cases.
Note strings are stored with quotes."""
print("eval: x, y:", x, y)
if isinstance(x, int) and isinstance(y, int):
return x * y
if isinstance(x, int) or x[0] != '"':
return '"' + int(x) * y[1:-1] + '"'
elif isinstance(y, int) or y[0] != '"':
return '"' + int(y) * x[1:-1] + '"'
return int(x) * int(y)
infix = parser.def_infix_op
infix("k_plus", 10, "left",
eval_fun=lambda t: plus_op(t[0].eval_subtree(), t[1].eval_subtree()))
infix("k_ast", 20, "left",
eval_fun=lambda t:mult_op(t[0].eval_subtree(), t[1].eval_subtree()))
# Define assignment as an infix equals operator.
parser.def_assignment_op_untyped("k_equals", 5, "right", "k_identifier",
create_eval_fun=True)
# Define identifier literals with a lookup if needed. Parser attribute
# symbol_value_dict was initialized by the def_assignment_op_untyped call.
symbol_dict = parser.symbol_value_dict
default_identifier_eval_value = 0
def eval_literal_identifier(tok):
if tok.value in symbol_dict:
return symbol_dict[tok.value]
else:
return default_identifier_eval_value
parser.def_literal("k_identifier", eval_fun=eval_literal_identifier)
return parser
1.4. Example 4 : A simple string and number language with evaluation and dynamic typing¶
This next example is the same simple language that operates on both quoted strings and integers as in Example 3. The only allowed operations are addition and multiplication. The operations on integers give the usual results. The operations on strings are like in Python: addition concatenates and multiplication of a string by an integer (on the left or right) repeats it that many times. Addition of strings and integers is a syntax error. Identifier variables can also be defined and assigned values.
This example uses dynamic typing, like an interpreted language. Type errors
are reported at parse-time, based on the types implicitly defined by the
previously-executed coded. For example, assigning x = "house"
implicitly
defines x
as a string.
This example shows one way to use the Typped type mechanism. The example code in the examples directory runs the language in a read-evaluate-print loop (REPL).
def setup_string_language_parser_dynamic_typing():
"""A simple dynamically-typed language that uses `+` to add integers and
concatenate strings. Multiplication of a number by a string repeats the
string. Multiplication of a string by a string is not defined. It also
has simple variables which can represent either numbers or strings."""
parser = pp.PrattParser()
# Define the tokens.
parser.def_default_whitespace()
tok = parser.def_token
tok("k_int", r"-?\d+")
tok("k_lpar", r"\(")
tok("k_rpar", r"\)")
tok("k_ast", r"\*")
tok("k_plus", r"\+")
tok("k_equals", r"=")
tok("k_identifier", r"[a-zA-Z_](?:\w*)", on_ties=-1)
tok("k_string", r"(\"(.|[\r\n])*?\")")
# Define the types.
t_int = parser.def_type("t_int") # Integer type.
t_str = parser.def_type("t_str") # String type.
# Define the syntax of the language, supplying evaluation functions.
parser.def_literal("k_int", val_type=t_int, eval_fun=lambda t: int(t.value))
parser.def_literal("k_string", val_type=t_str, eval_fun=lambda t: t.value)
parser.def_literal_typed_from_dict("k_identifier", create_eval_fun=True,
default_type=t_int, default_eval_value=0)
parser.def_bracket_pair("k_lpar", "k_rpar", eval_fun=lambda t: t[0].eval_subtree())
# Overload the + operator twice.
infix = parser.def_infix_op
infix_plus_construct = infix("k_plus", 10, "left",
val_type=t_int, arg_types=[t_int, t_int],
eval_fun=lambda t: t[0].eval_subtree() + t[1].eval_subtree())
infix_plus_construct.overload(
val_type=t_str, arg_types=[t_str, t_str],
eval_fun=lambda t: t[0].eval_subtree()[:-1] + t[1].eval_subtree()[1:])
# Overload the * operator three times.
infix_mult_construct = infix("k_ast", 20, "left",
val_type=t_int, arg_types=[t_int, t_int],
eval_fun=lambda t: t[0].eval_subtree() * t[1].eval_subtree())
infix_mult_construct.overload(
val_type=t_str, arg_types=[t_str, t_int],
eval_fun=lambda t: (
'"' + (t[0].eval_subtree()[1:-1] * t[1].eval_subtree()) + '"'))
infix_mult_construct.overload(
val_type=t_str, arg_types=[t_int, t_str],
eval_fun=lambda t: (
'"' + (t[1].eval_subtree()[1:-1] * t[0].eval_subtree()) + '"'))
# Define assignment as an infix equals operator.
parser.def_assignment_op_dynamic("k_equals", 5, "right", "k_identifier",
val_type=None, allowed_types=[t_int, t_str],
create_eval_fun=True)
return parser
1.5. Example 5: A simple string and number language with evaluation and static typing¶
The language being parsed in this example is basically the same as the previous two except that the language is statically typed rather than dynamically typed. This is like parsing a statically-typed compiled language. Type errors are caught at parse-time, before any interpretation or translation into machine code. This language translates the simple string-number language into Python code, which is then evaluated.
Static typing in a language requires some mechanism for declaring types (either implicitly or explicitly). This language has a C-like type declaration syntax. None of the builtin parse methods work for this special-purpose construct, so a new construct is defined for type declarations. Builtin methods are used for the rest of the parsing.
This example illustrates how to use static typing and how to define custom
parsing functions when the builtin methods are not sufficient. The example
code in the examples directory
runs the language in a read-evaluate-print loop (REPL). It prints out the
parsed expression tree, the translation to Python, and the result of evaluating
the Python code with Python’s eval
.
The definitions of the parser instance, tokens, and types are basically the same as in Example 3.
1.5.1. Defining a new construct for type definitions¶
We want to allow C-style typed variable declarations like the following in the language:
int x
str y
Since the builtin parsing methods do not cover this, we need to define a new construct and register it with the parser. Later sections cover this in more detail, so readers can skim this subsection for now if necessary.
There are several ways to do this parsing. We could define new tokens for the
keywords int
and str
(or a token for all such keywords) with a higher
on_ties
value than general identifiers. Then we would have the handler
functions for those tokens do the corresponding parsing. Instead, we use
preconditioned dispatching on identifier tokens. In this way, a different
head-handler function can be dispatched to handle a type name identifier versus
a general identifier. This makes it easy to add new type names later (since
they are stored in a dict).
In this example we have used a precondition on a head-handler function instead
of using a tail-handler function defined for identifiers. A tail-handler could
have been used, but in that case int x = 4
would be more difficult to
parse. The recursive_parse
routine would consume the x
as an operator,
with a left operand of int
. The remaining part = 4
could not then be
evaluated in the usual way as an infix operation without going back one token
in the lexer (such as with the go_back
method). The handler function would
have to use peek and next calls.
def setup_string_language_parser_static_typing():
"""A simple statically-typed language that uses `+` to add integers and
concatenate strings. Multiplication of a number by a string repeats the
string. Multiplication of a string by a string is not defined. It also
has simple variables which can represent either numbers or strings."""
parser = pp.PrattParser()
# Define the tokens.
parser.def_default_whitespace()
tok = parser.def_token
tok("k_int", r"-?\d+")
tok("k_lpar", r"\(")
tok("k_rpar", r"\)")
tok("k_ast", r"\*")
tok("k_plus", r"\+")
tok("k_equals", r"=")
tok("k_identifier", r"[a-zA-Z_](?:\w*)")
tok("k_string", r"(\"(.|[\r\n])*?\")")
# Define the types.
t_int = parser.def_type("t_int") # Integer type.
t_str = parser.def_type("t_str") # String type.
# Define a new construct for type definitions in the language. The
# `typped_type_dict` is used to map type names in the implemented language
# (the strings "int" and "str") to the corresponding Typped types
# (the `TypeObject` instances t_int and t_str).
parser.typped_type_dict = {"int": t_int,
"str": t_str}
def typedecl_precond_fun(tok, lex):
"""This construct will only be triggered for identifiers stored as keys in
the dict `parser.typped_type_dict`."""
return (lex.token.token_label == "k_identifier" and
lex.token.value in parser.typped_type_dict)
def typedecl_head_handler(tok, lex):
"""Handler function for the construct that parses type declarations."""
if not lex.match_next("k_identifier", consume=False):
raise pp.ParserException("Type declaration not followed by an identifier.")
# Note the identifier is set as a key in symbol_type_dict before recursive_parse.
# Otherwise the definition-checking in def_literal_typed_from_dict would fail.
parser.symbol_type_dict[lex.peek().value] = parser.typped_type_dict[tok.value]
type_decl_expr = tok.recursive_parse(0)
if type_decl_expr.children and type_decl_expr.token_label != "k_equals":
raise pp.ParserException("Only identifiers or assignment expressions are"
" allowed in type declarations.")
tok.append_children(type_decl_expr)
return tok
def typedecl_eval_fun(tok):
"""Evaluate a type declaration when interpreting the language."""
if tok[0].token_label == "k_equals":
return tok[0].eval_subtree() + "; " + tok[0][0].value
else:
return "None"
parser.def_construct(pp.HEAD, typedecl_head_handler, "k_identifier",
construct_label="p_type_declaration",
precond_fun=typedecl_precond_fun, precond_priority=10,
val_type=t_int, eval_fun= typedecl_eval_fun)
# Now define the syntax of the language.
literal = parser.def_literal
literal("k_int", val_type=t_int, eval_fun=lambda t: t.value)
literal("k_string", val_type=t_str, eval_fun=lambda t: t.value)
parser.def_literal_typed_from_dict("k_identifier",
eval_fun=lambda t: t.value,
default_type=t_int, default_eval_value=0,
raise_if_undefined=True)
parser.def_bracket_pair("k_lpar", "k_rpar",
eval_fun=lambda t: "(" + t[0].eval_subtree() + ")")
def operator_eval_fun(tok):
"""A general eval fun for operators. It just reconstructs a string."""
if tok.children:
return tok[0].eval_subtree() + " " + tok.value + " " + tok[1].eval_subtree()
else:
return tok.value
# Overload the + operator twice.
infix = parser.def_infix_op
infix_plus_construct = infix("k_plus", 10, "left", val_type=t_int,
arg_types=[t_int, t_int], eval_fun=operator_eval_fun)
infix_plus_construct.overload(val_type=t_str, arg_types=[t_str, t_str],
eval_fun=operator_eval_fun)
# Overload the * operator three times.
infix_mult_construct = infix("k_ast", 20, "left", val_type=t_int,
arg_types=[t_int, t_int], eval_fun=operator_eval_fun)
infix_mult_construct.overload(val_type=t_str, arg_types=[t_str, t_int],
eval_fun=operator_eval_fun)
infix_mult_construct.overload(val_type=t_str, arg_types=[t_int, t_str],
eval_fun=operator_eval_fun)
# Define assignment as an infix equals operator.
parser.def_assignment_op_static("k_equals", 5, "left", "k_identifier",
eval_fun=operator_eval_fun)
return parser