3. Constructs and preconditioned dispatching

This section describes the use of preconditioned dispatching in the PrattParser class. Many Typped users will never need to explicitly use the techniques described here, since the Typped parser comes with various built-in methods which hide the use of precondition functions. The calulator example, for example, uses only built-in methods of the PrattParser class. This description is mainly for users who need to define their own, custom parsing methods or those who are simply interested in the details of how dispatching works.

3.1. What is preconditioned dispatching?

In a standard Pratt parser each token can have associated with it a single, fixed head handler function, a single, fixed tail handler function, or one of each. Preconditioned dispatching generalizes this: Each token can have multiple possible head and/or tail handler functions associated with it. The choice of which of the possible handler functions to use to process a token is made at at the time the token is parsed, based on the conditions (e.g., the parser and lexer state) at that time. The conditions which can be taken into account include, for example, the actual string value that the lexer matched in the program text and the kind of token that is one peek ahead in the lexer. Note that this generalized behavior in Typped is optional and can easily be ignored if one only wants to use standard Pratt parser techniques.

Since in Pratt parsing each head and tail handler essentially parses a different part of the grammar, the Typped packages uses an abstraction of handler functions called a syntactic construct, or simply a construct. A construct represents a particular kind of grammatical subexpression that is parsed and returned by a handler function. Since Pratt parsers are top-down these grammatical parts tend to correspond to subtrees of the final expression tree. A construct containing a head handler will be called a head construct and a construct containing a tail handler will be called a tail construct. Constructs also contain other attributes, as we will see.

In a standard Pratt parser a handler function is triggered whenever a particular kind of token is consumed from the lexer in the recursive_parse routine. Either the head handler or the tail handler function associated with that kind of token is run, depending on whether or not the token is the first token in the subexpression being parsed.

In a preconditioned-dispatching Pratt parser each handler function is associated with a unique construct, and each construct has associated with it the label of a triggering token which triggers that construct as potentially the one to use. A single kind of token can potentially trigger multiple constructs (either head constructs or tail constructs, depending on the token’s position).

When a token is consumed in recursive_parse it triggers a collection of constructs. The choice of which one to actually use is based on the evaluation of preconditions functions associated with the constructs. Preconditions functions are simply boolean-valued functions which are executed at the time when a handler function is required. A preconditions function which returns true when evaluated is said to match in the current state. The matching construct with the highest preconditions priority is selected, and its handler is run.

Constructs are implemented as instances of Construct objects. They contain the following attributes:

  • a kind of token which triggers the construct
  • a head or tail handler function
  • a preconditions function
  • a preconditions priority
  • a string label for the construct
  • other data, such as evaluation functions and type signatures

Constructs are registered with a parser instance in order to define a particular grammar on the tokens of the language (the tokens must be separately defined with def_token). The preconditions priority is a number which defaults to zero.

Whenever the recursive_parse routine consumes a particular kind of token from the lexer, in a head or tail position, it sequentially executes the preconditions functions for all the constructs triggered by that kind of token, in that position. The execution sequence is ordered by the preconditions priority values. The construct associated with the first matching preconditions function is selected. Its handler function is then dispatched as the one to be run by the recursive_parse routine. If there is no clear winner among the highest-priority matching preconditions functions (i.e., more than one match has the highest priority) then an exception is raised.

This algorithm clearly reduces to ordinary Pratt parsing in the case where there is at most one head construct and one tail construct per kind of token and the preconditions functions always evaluate as true.

3.2. Using dispatching

The PrattParser method which registers a construct with the parser is called def_construct. It is used, for example, inside the builtin methods after defining the handler functions and any preconditions functions. To define custom constructs it needs to be explicitly called.

One of the optional arguments to def_construct is precond_fun, which can be passed a function taking two parameters, tok and lex. It should return True or False (or the equivalent). When a token is read in recursive_parse all the preconditions functions for all the constructs triggered by that kind of token are run in priority ordering until one is true. The associated construct is the “winner” and is dispatched to be called. When the preconditions function is called, tok is the triggering token and lex is the lexer. (Usually tok == lex.token, except in the case of some “virtual” tokens like null-space tokens and jop-tokens which are both discussed in later sections.)

See the documentation for the def_construct method at typped.pratt_parser.PrattParser.def_construct(). The basic specification is:

def def_construct(head_or_tail, handler_fun, trigger_token_label,
                  prec=0, construct_label=None,
                  precond_fun=None, precond_priority=0,
                  val_type=None, arg_types=None,
                  eval_fun=None, ast_data=None,
                  token_value_key=None, dummy_handler=False):

For a simple example of defining a construct, see basic_usage:Example: Parsing a simple expression without using builtins. A more general example is given in the next section.

3.3. Example: Defining standard functions with lookahead

As an example of dispatching, consider the parsing of function evaluations such as f(x) in a Pratt parser. The “usual” way is to define a tail handler for the left-paren token. Then that symbol acts like an infix operator with the function name as its first argument and the function arguments and closing paren as its second argument. If parentheses are also used for grouping then a head-handler for left paren is defined for that use. The resolution between the two uses is based on whether the left paren is in a head or tail position in a subexpression. In the case of the function evaluation, the token for the function name f is the head of the subexpression.

This usual way of parsing function evaluations can lead to complications in more-complex grammars where left paren is used in various contexts. If a juxtaposition operator is being used, for example, then an expression like pi (x+y) can cause problems with the usual method. The name pi might be a constant or a function name. (At the least the left paren tail handler would need to be conditioned on a space occurring before it, but this example takes a different approach.)

By using a precondition that the lookahead token be a left paren with no intervening space the head handler for a standard function identifier can parse the whole subexpression rather than waiting to be picked up as the left operand of the infix left paren operator. A second, lower-priority default head handler can still be defined for all other identifiers. (Other preconditions can also be placed on other head handlers for identifiers). These two head handler definitions are largely independent, except via their respective priorities. They can occur in different sections of code, where the different constructs are defined. Both handlers are registered for the identifier token, and the rest is handled automatically.

The code for this example can be found in a runnable form in the file example_stdfun_lookahead.py.

In this example the PrattParser class is extended by creating a subclass with additional methods. In particular, a general method is added which parses standard functions. If a general method is not required then the code could instead just define the handler and preconditions function and call def_construct.

For a general parsing method it is not strictly necessary to create a subclass of PrattParser. An ordinary function can also be used. Just rename the self variable to something like parser and explicitly pass in a parser instance when calling it. Extending the class has the advantage that the newer methods are called in the same way as the built-in ones, and the parser instance’s namespace is convenient for accessing the function.

In this example the method def_stdfun_lookahead is added to the PrattParser. This is only an example, since the PrattParser class already has a def_stdfun method which uses lookahead and also incorporates types, etc. Before calling this method all of the tokens involved must have already been defined along with their labels (via the def_token method). Ignored whitespace tokens must also have been defined already. The lpar, rpar, and comma tokens must already have been defined as literal tokens (via the def_literal method).

Recall that the head-handler function will be called to process a subexpression starting from the beginning. That head-handler is then responsible for parsing the full subexpression – though it can itself call recursive_parse to parse sub-subexpressions. We are defining a head-handler that only matches a function name in the case where the peek token is an lpar with no intervening space.

def define_parser_subclass():

    class MyParser(pp.PrattParser):
        """Subclass and add a new method to the `PrattParser` class as an example."""

        def __init__(self, *args, **kwargs):
            """Call the superclass initializer."""
            super(MyParser, self).__init__(*args, **kwargs)

        def def_stdfun_lookahead(self, fname_token_label, lpar_token_label,
                                 rpar_token_label, comma_token_label, num_args,
                                 precond_priority=1):
            """Define a standard function with a fixed number of arguments."""

            # Define the preconditions function.
            def preconditions(tok, lex):
                peek_tok = lex.peek()
                if peek_tok.ignored_before: # No space allowed between name and lpar.
                    return False
                if peek_tok.token_label != lpar_token_label:
                    return False
                return True

            # Define the head-handler function.
            def head_handler(tok, lex):
                # Below match_next is for a precondition, so it will match and consume.
                lex.match_next(lpar_token_label, raise_on_fail=True)

                # Read comma-separated subexpressions as arguments.
                for i in range(num_args-1):
                    tok.append_children(tok.recursive_parse(0))
                    lex.match_next(comma_token_label, raise_on_fail=True)
                    lex.match_next(rpar_token_label, raise_on_success=True) # Error.
                if num_args != 0:
                    tok.append_children(tok.recursive_parse(0))
                # Consume closing paren.
                lex.match_next(rpar_token_label, raise_on_fail=True)
                return tok

            # Register the construct with the parser.
            construct_label = "function call using precondition on function name"
            self.def_construct(pp.HEAD, head_handler, fname_token_label, prec=0,
                               construct_label=construct_label,
                               precond_fun=preconditions,
                               precond_priority=precond_priority)
    return MyParser

In parsing the full function call the handler defined above uses both the helper function match_next as well as calls to the lexer and recursive_parse. Generally, tokens which will appear in the final parse tree, even literal tokens, should be retrieved with recursive_parse. That is because it performs some extra processing the nodes such as setting their actual types. Tokens which do not appear in the final parse tree, such as the final closing rpar token of the function arguments, can simply be consumed by match_next or an explicit call to lex.next() and discarded.

The function defined above could be called as follows:

def define_grammar(MyParser):
    parser = MyParser()
    parser.def_default_whitespace()

    tok = parser.def_token
    tok("k_number", r"\d+"),
    tok("k_lpar", r"\("),
    tok("k_rpar", r"\)"),
    tok("k_comma", r","),
    tok("k_add", r"add"),
    tok("k_sub", r"sub"),

    lit = parser.def_literal
    lit("k_number")
    lit("k_lpar")
    lit("k_rpar")

    parser.def_stdfun_lookahead("k_add", "k_lpar", "k_rpar", "k_comma", 2)
    parser.def_stdfun_lookahead("k_sub", "k_lpar", "k_rpar", "k_comma", 2)

    return parser

Now this code can be run:

MyParser = define_parser_subclass()
parser_instance = define_grammar(MyParser)
expr = "add(4, sub(5,6))"
expr_tree = parser_instance.parse(expr)
print(expr_tree.tree_repr(indent=3))

When run, the above code produces this output:

<k_add,'add'>
    <k_number,'4'>
    <k_sub,'sub'>
        <k_number,'5'>
        <k_number,'6'>

This example works, but is simplified from the actual def_stdfun method of the Pratt parser class. It assumes a fixed number of arguments and does not make use of type data. The function is still fairly general, though. Note that this function does not allow whitespace (ignored tokens) to occur between the function name and the left parenthesis. The preconditions function is defined as a nested function, but it could alternately be passed in as another argument to def_stdfun (along with its label).

Two ways to parse identifiers

The Typped parser and lexer are both dynamic and can be updated on-the-fly. This flexibility allows for a different style of defining identifiers than is traditionally used. Consider an example where function name identifiers are being parsed. Assume that the language being parsed has some sort of definition mechanism where function names must be defined before they are used. (The principle is more general, including cases where, say, functions and variables share the same namespace or for kinds of token other than identifiers.)

In the traditional parser design a generic function-name identifier is defined for the lexer and any further processing is done by the parser, based on the actual string value found in the program text. This allows a fixed lexer to be used. When the lexer is dynamic, though, it is possible to define a new token for each definition of an identifier.

Suppose we have functions add and exp. In the traditional approach the lexer would identify each as a function name identifier, and return that information along with the actual text string. In the dynamic-lexer approach you would define a new token for add at the time it is defined. Similarly for the exp function. The lexer would then return a unique token for each function, pushing some of the parsing down to the lexer level.

An advantage of the dynamic approach is that it can help to avoid ambiguities in parsing complex languages. The disadvantages are that it may take slightly more space to define the new tokens, it may be slower to scan with so many possible tokens, and the function names (and hence their tokens) must be defined before being used.

A disadvantage of using a common identifier token for all function names is evaluation functions then cannot be automatically associated with the tokens. To get around this the def_construct method takes a keyword argument value_key can be passed strings like add and exp. The evaluation functions are then keyed on those values, too. During lookup the actual text string for the token is used to look back up the evaluation function.

As far as the efficiency of defining many tokens, the Typped lexer is designed to very efficiently scan large numbers of tokens provided they have a simple pattern. The Matcher used by the lexer can use one of several hybrid approaches. For example, simple patterns (currently restricted to fixed strings for this speedup) can be automatically stored in a trie data structure and essentially all scanned in parallel by walking down the trie. Their insert and delete time is linear in the pattern length. So, while the Typped parser can be used in either way, the use of dynamic token definitions is worth considering.

3.4. Modifications to recursive_parse

In generalizing to preconditioned dispatching the recursive_parse routine is slightly modified from the one in the previous section. A simplified version is shown here:

def recursive_parse(lex, subexp_prec):
    curr_token = lex.next()
    head_handler = curr_token.dispatch_handler(HEAD, lex)
    processed_left = head_handler()

    while lex.peek().prec() > subexp_prec:
        curr_token = lex.next()
        tail_handler = curr_token.dispatch_handler(TAIL, lex, processed_left)
        processed_left = tail_handler()

Instead of directly calling a fixed head or tail handler for a token, the recursive_parse function instead calls a function dispatch_handler. This function takes an argument which specifies whether to fetch a head or a tail handler. This function selects a construct, as described above, and returns the handler function (which is actually a wrapper function that first runs the handler and then does type checking on the returned subtree). For convenience the arguments to the handler are bound, since they are already known.

Note

Preconditioned dispatching is only a slight generalization of the usual Pratt parser. A similar thing could be accomplished with ordinary head and tail functions via a case statement inside each one, performing different actions based on the conditions at the time and ordered in the case statement by priority. An advantage of using function dispatching instead is that it allows for modularity in defining the head and tail handlers for a particular kind of token.

With dispatching, what would otherwise be a case statement in a handler function is essentially split up into many separate functions, one for each case. So each case in such a case statement can be defined in the place where that syntactic construct is generally being defined, rather than having to be placed in one centralized and separate location. This makes it easier to create essentially independent functional interfaces for different syntactical constructs. The syntax of the language can be decomposed into subunits, with separately-defined handlers to parse them. For example, the PrattParser class comes with methods predefined to easily perform common syntax-related tasks such as defining an infix operator, defining a grouping operator, defining a standard function, etc. If one big case statement were being used in a single head or tail handler for each token then one of those case statement would have to be modified for each such method.

3.5. Uniqueness of constructs

Equality or non-equality of two constructs in the sense of being triggered by identical conditions is determined by equality of triples of the form:

(head_or_tail, trigger_token_label, precond_fun)

The preconditions priority is not included because it determines the interaction between different constructs match. If two constructs match in the above tuple but have different precond_priority values then one will always shadow the other. The shadowed construct will never run.

Unfortunately it is impractical to determine in general when two preconditions functions are identical in the sense that they compute the same thing.

Recall that function overloading based on argument types is used for syntactical constructs which parse the same (i.e., with the same preconditions and using the same handler function) but which are then resolved into different semantic objects based on the actual types of the arguments which are processed at parse-time. Overloading can also involve the type of the function’s return value.

Overloading must be explicitly specified, via a call to the overload method of a previously-defined construct instance. Because of the difficulty of determining equivalence of preconditions functions, described above, overloading cannot be done by simply calling def_construct again with the same arguments and a different type.

Overloading versus preconditions functions

An alternative way that Typped could have implemented overloading would have been to always use a unique construct label for each overload — perhaps by appending a string representation of the type to the label. But this would also complicate the resolution of constructs.

Constructs as currently implemented must be uniquely resolvable at parse-time. They then uniquely determine the handler function to call. If different preconditions labels are used for overloading then overloading will cause multiple constructs to match as a normal thing. These ties will not be uniquely resolvable by a priority system.

To resolve an overload the expression must first be parsed to find the actual types. Resolving the actual types requires a handler function, which is stored with a construct. This is circular if separate constructs are used for each overload. One approach might be to assume that if there are multiple constructs which match at the same priority then they all have the same handler function. You could then just pick one to call, but that could mask some error conditions. After the actual types are found a unique construct would still need to be determined from among the matches in order to access the associated evaluation function and AST data. It seems simpler to just to store all the overloaded signatures and their associated data with a single construct.