7. Recursive descent parsing in Typped

This section discusses recursive descent parsing and how it relates to Pratt parsing and the Typped package. Recursive descent is implemented in the Typped parser via the use of preconditioned dispatching and the introduction of a new kind of token.

The section begins with a comparison of the recursive descent parsing algorithm with the Pratt parser algorithm using preconditioned dispatching. A new kind of token is introduced (the null-string token) which bridges over one important gap between the two methods. This leads to a discussion of how recursive descent is implemented in Typped.

The Typped package provides a higher-level EBNF-like grammar specification language, implemented by operator overloading, which makes it easy to implement grammar-based languages or partial languages. Readers who are only interested in using the EBNF-like language can skip to “Recursive descent with Typped’s EBNF-like grammar.”

It should be noted that recursive descent parsing is not all that difficult to implement by hand, especially with a good lexer. It is possible to implement a traditional recursive descent parser using the same lexer as a PrattParser instance. Then either the recursive descent parser can call the Pratt parser to handle subexpressions, or the handlers of the Pratt parser can call the recursive descent functions to parse sub-grammars.

Note

Some of these features are still experimental. Some features are only partially implemented. Only basic BNF currently works in the grammar parser. Type-checking and Pratt-style precedences in the grammars are not yet implemented. The implementation still needs to be optimized.

7.1. Similarities and differences between the parsing methods

Pratt parsing is similar to recursive descent parsing in the sense that both are top-down recursive methods. Pratt parsing, however, is token-based while recursive descent parsing is based on production rules in a grammar. The ability to dispatch handlers based on preconditions makes the Typped Pratt parser even more similar to a recursive descent parser. The terminal symbols in a grammar are the token literals in a Pratt parser. The discussion below assumes this correspondence, so terminals are defined by the regular expressions in the definition of the literal tokens.

Suppose all the productions in a grammar begin with a terminal symbol (represented by the regular expression of some corresponding token). This is called a right-regular grammar. A Pratt parser with preconditioned dispatching can be used to directly implement recursive descent parsing on such a grammar. Since each rule in such a grammar begins with a terminal symbol, corresponding to a token literal, the head-handler triggered by each such literal token can be set up to process the rule. In the Typped dispatching terminology, the literal token can be used to trigger a syntactic construct (containing the handler function, a preconditions function, and related data) which will then process the full production rule.

In order to implement the recursive descent part you can keep a stack of production rule labels, updated to always have the current production rule on the top. In a PrattParser instance this stack is stored in a list attribute called pstate. The construct for handling a production rule is given a precondition that the top label in that stack is the label for the rule it processes. The head-handler functions in these constructs are also responsible for updating the pstate stack as necessary. Using this, along with lookahead to the upcoming tokens in the preconditions, gives the power to do recursive descent parsing of right regular grammars.

These head-handler functions mimic the separate recursive functions for each production rule in recursive descent. They are triggered by the corresponding literal tokens when the preconditions match (or in general by first-set elements). The pstate stack keeps track of the recursion in the tree defined by the grammar. Either a single one can handle all the cases of the nonterminal, or separate ones can be triggered to handle the cases separately.

In extending this approach to general recursive descent, a problem arises when a production starts with a nonterminal symbol. Nonterminals do not correspond to a tokens like they do with terminals. So there is no token to trigger the construct for parsing the rule. To deal with this, the Typped package has an experimental feature called a null-string token. This is a special token which acts as though its regular expression “matches the null string” just before any upcoming text. Other than that it is a regular token, with handlers, preconditions, etc.

This is implemented in the recursive_parse function. Before each call to next to get the next token from the lexer it first checks to see if the null-string token is defined for the parser instance. If so, it checks whether the preconditions of any registered null-string-triggered constructs match. If they do then the special null-string token is returned as the current token, and the handler function of the winning construct is called to process the next subexpression.

The use of null-string tokens combined with keeping a stack of production rule state-labels is enough to implement general recursive descent parsing within the framework of a Pratt parser with preconditioned dispatching. The method is essentially the same as described earlier for right-regular grammars. You just use the null-string token to recognize the production rules that start with nonterminals (using the top of the pstate stack in preconditions).

7.2. Example

Consider a very simple expression grammar in EBNF (even though the expression parts of a grammar might be better evaluated with Pratt-style parsing). The identifier and number productions are assumed to be implemented as tokens from the lexer, defined by regular expressions. Here the square brackets are optional parts, and curly braces mean “zero or more.” The (x|y) construct here means either x or y.

expression ::=  ["+"|"-"] term {("+"|"-") term}
term       ::=  factor {("*"|"/") factor}
factor     ::=  identifier | number | "(" expression ")"

Initially the pstate stack would only hold the string "expression". A head construct would be registered for the null-string token with the precondition that the expression state be at the top of the pstate stack. The head handler for the construct would first check whether the peek token is + or -. If so, it would consume it. Then the string "term" would be pushed onto the stack and recursive_parse would be called. The call to recursive_parse returns a processed subtree, which is incorporated into the expression tree. A loop would continue this way until the peek token is not + or -.

Another head construct would be registered for the null-string token with the precondition that "term" be at the top of the stack. Its head-handler function would be responsible for parsing terms. It would work in the same general way as described above for expressions.

The factor production could be implemented either as a handler for the null-string token or by separate constructs for the identifier, number, and left-paren token types.

A similar expression grammar in plain BNF is as follows:

expression ::=  term "+" expression | term "-" expression | term;
term       ::=  factor "*" term | factor "/" term | factor;
factor     ::=  constant | variable | "("  expression  ")";
variable   ::=  identifier | "-" identifier
constant   ::=  number

In this case the identifier and number nonterminals would be token literals defined by the corresponding regex passed to the def_token call.

7.3. Recursive descent with Typped’s EBNF-like grammar

The Typped package comes with an EBNF grammar definable via Python overloads. It essentially automates the procedure described above to map recursive descent to a generalized Pratt parser. A grammar in Python, while not as concise as a parsed EBNF string, is easy to work with and has syntax highlighting. It is easy to define aliases for complicated components.

When the grammar is “compiled” with respect to a PrattParser instance it produces a recursive descent parser for the grammar within the Pratt parser framework. The generated parsers currently use full backtracking search (The use of first-sets is not fully implemented, but fits nicely into the precondition-triggering model.)

This feature is still in development and experimental. The code is not optimized and parts are currently inefficient.

The EBNF language is currently bare-bones as far as what can be compiled into a parser instance. It does basic BNF. (The EBNF language itself, defined via Python overloading, is mostly implemented but is not yet compilable into a parser instance. For details of the current state of the Python EBNF language see the docs for the module ebnf_classes_and_operators.py.)

Below is a simple example running example which parses the BNF expression grammar above. (It also allows signed integers, but not signed variables, and full identifiers as variables.) See the file example_expression_grammar.py in the examples directory for the code.

import typped as pp
parser = pp.PrattParser()
parser.def_default_whitespace()
parser.def_default_single_char_tokens()
k_int = parser.def_default_int_token()
k_identifier = parser.def_default_identifier_token()

expression = ( Rule("term") + Tok("k_plus") + Rule("expression")
             | Rule("term") + Tok("k_minus") + Rule("expression")
             | Rule("term"))
term       = ( Rule("factor") + Tok("k_ast") + Rule("term")
             | Rule("factor") + Tok("k_slash") + Rule("term")
             | Rule("factor"))
factor     = ( Rule("constant")
             | Rule("variable")
             | Tok("k_lpar") + Rule("expression") + Tok("k_rpar"))
variable   = Tok(k_identifier) | Tok("k_minus") + k_identifier
constant   = k_int

g = pp.Grammar("expression", parser, locals())
tree = parser.parse("4 + my_var * (3 - 1)", pstate="expression")
print(tree.tree_repr())

This example uses several of the helper methods functions to quickly define tokens. The tokens must all be defined, but they do not need to be explicitly made into token literals (at least not for grammar-based parsing alone). They are simply read in as tokens from the lexer because the grammar specifies what to look for.

Notice that token instances can appear directly in the grammar as token literals. The token named by its token label appears as, for example, Tok("k_plus"). Token instances can also appear inside Tok calls.

The output from the above code is as follows:

<k_null-string,'expression'>
    <k_null-string,'term'>
        <k_null-string,'factor'>
            <k_null-string,'constant'>
                <k_int,'4'>
    <k_plus,'+'>
    <k_null-string,'expression'>
        <k_null-string,'term'>
            <k_null-string,'factor'>
                <k_null-string,'variable'>
                    <k_identifier,'my_var'>
            <k_ast,'*'>
            <k_null-string,'term'>
                <k_null-string,'factor'>
                    <k_lpar,'('>
                    <k_null-string,'expression'>
                        <k_null-string,'term'>
                            <k_null-string,'factor'>
                                <k_null-string,'constant'>
                                    <k_int,'3'>
                        <k_minus,'-'>
                        <k_null-string,'expression'>
                            <k_null-string,'term'>
                                <k_null-string,'factor'>
                                    <k_null-string,'constant'>
                                        <k_int,'1'>
                    <k_rpar,')'>

At some point the ability to suppress null-string tokens representing nonterminals from appearing in the tree will be added.