7. Recursive descent parsing in Typped¶
This section discusses recursive descent parsing and how it relates to Pratt parsing and the Typped package. Recursive descent is implemented in the Typped parser via the use of preconditioned dispatching and the introduction of a new kind of token.
The section begins with a comparison of the recursive descent parsing algorithm with the Pratt parser algorithm using preconditioned dispatching. A new kind of token is introduced (the null-string token) which bridges over one important gap between the two methods. This leads to a discussion of how recursive descent is implemented in Typped.
The Typped package provides a higher-level EBNF-like grammar specification language, implemented by operator overloading, which makes it easy to implement grammar-based languages or partial languages. Readers who are only interested in using the EBNF-like language can skip to “Recursive descent with Typped’s EBNF-like grammar.”
It should be noted that recursive descent parsing is not all that difficult to
implement by hand, especially with a good lexer. It is possible to implement a
traditional recursive descent parser using the same lexer as a PrattParser
instance. Then either the recursive descent parser can call the Pratt parser
to handle subexpressions, or the handlers of the Pratt parser can call the
recursive descent functions to parse sub-grammars.
Note
Some of these features are still experimental. Some features are only partially implemented. Only basic BNF currently works in the grammar parser. Type-checking and Pratt-style precedences in the grammars are not yet implemented. The implementation still needs to be optimized.
7.1. Similarities and differences between the parsing methods¶
Pratt parsing is similar to recursive descent parsing in the sense that both are top-down recursive methods. Pratt parsing, however, is token-based while recursive descent parsing is based on production rules in a grammar. The ability to dispatch handlers based on preconditions makes the Typped Pratt parser even more similar to a recursive descent parser. The terminal symbols in a grammar are the token literals in a Pratt parser. The discussion below assumes this correspondence, so terminals are defined by the regular expressions in the definition of the literal tokens.
Suppose all the productions in a grammar begin with a terminal symbol (represented by the regular expression of some corresponding token). This is called a right-regular grammar. A Pratt parser with preconditioned dispatching can be used to directly implement recursive descent parsing on such a grammar. Since each rule in such a grammar begins with a terminal symbol, corresponding to a token literal, the head-handler triggered by each such literal token can be set up to process the rule. In the Typped dispatching terminology, the literal token can be used to trigger a syntactic construct (containing the handler function, a preconditions function, and related data) which will then process the full production rule.
In order to implement the recursive descent part you can keep a stack of
production rule labels, updated to always have the current production rule on
the top. In a PrattParser
instance this stack is stored in a list
attribute called pstate
. The construct for handling a production rule is
given a precondition that the top label in that stack is the label for the rule
it processes. The head-handler functions in these constructs are also
responsible for updating the pstate
stack as necessary. Using this, along
with lookahead to the upcoming tokens in the preconditions, gives the power to
do recursive descent parsing of right regular grammars.
These head-handler functions mimic the separate recursive functions for each
production rule in recursive descent. They are triggered by the corresponding
literal tokens when the preconditions match (or in general by first-set
elements). The pstate
stack keeps track of the recursion in the tree
defined by the grammar. Either a single one can handle all the cases of the
nonterminal, or separate ones can be triggered to handle the cases separately.
In extending this approach to general recursive descent, a problem arises when a production starts with a nonterminal symbol. Nonterminals do not correspond to a tokens like they do with terminals. So there is no token to trigger the construct for parsing the rule. To deal with this, the Typped package has an experimental feature called a null-string token. This is a special token which acts as though its regular expression “matches the null string” just before any upcoming text. Other than that it is a regular token, with handlers, preconditions, etc.
This is implemented in the recursive_parse
function. Before each call to
next
to get the next token from the lexer it first checks to see if the
null-string token is defined for the parser instance. If so, it checks whether
the preconditions of any registered null-string-triggered constructs match. If
they do then the special null-string token is returned as the current token,
and the handler function of the winning construct is called to process the next
subexpression.
The use of null-string tokens combined with keeping a stack of production rule
state-labels is enough to implement general recursive descent parsing within
the framework of a Pratt parser with preconditioned dispatching. The method is
essentially the same as described earlier for right-regular grammars. You just
use the null-string token to recognize the production rules that start with
nonterminals (using the top of the pstate
stack in preconditions).
7.2. Example¶
Consider a very simple expression grammar in EBNF (even though the expression
parts of a grammar might be better evaluated with Pratt-style parsing). The
identifier
and number
productions are assumed to be implemented as
tokens from the lexer, defined by regular expressions. Here the square
brackets are optional parts, and curly braces mean “zero or more.” The
(x|y)
construct here means either x
or y
.
expression ::= ["+"|"-"] term {("+"|"-") term} term ::= factor {("*"|"/") factor} factor ::=identifier
|number
| "(" expression ")"
Initially the pstate
stack would only hold the string "expression"
. A
head construct would be registered for the null-string token with the
precondition that the expression
state be at the top of the pstate
stack. The head handler for the construct would first check whether the peek
token is +
or -
. If so, it would consume it. Then the string
"term"
would be pushed onto the stack and recursive_parse
would be
called. The call to recursive_parse
returns a processed subtree, which is
incorporated into the expression tree. A loop would continue this way until
the peek token is not +
or -
.
Another head construct would be registered for the null-string token with the
precondition that "term"
be at the top of the stack. Its head-handler
function would be responsible for parsing terms. It would work in the same
general way as described above for expressions.
The factor
production could be implemented either as a handler for the
null-string token or by separate constructs for the identifier, number, and
left-paren token types.
A similar expression grammar in plain BNF is as follows:
expression ::= term "+" expression | term "-" expression | term; term ::= factor "*" term | factor "/" term | factor; factor ::= constant | variable | "(" expression ")"; variable ::= identifier | "-" identifier constant ::= number
In this case the identifier
and number
nonterminals would be token
literals defined by the corresponding regex passed to the def_token
call.
7.3. Recursive descent with Typped’s EBNF-like grammar¶
The Typped package comes with an EBNF grammar definable via Python overloads. It essentially automates the procedure described above to map recursive descent to a generalized Pratt parser. A grammar in Python, while not as concise as a parsed EBNF string, is easy to work with and has syntax highlighting. It is easy to define aliases for complicated components.
When the grammar is “compiled” with respect to a PrattParser
instance it
produces a recursive descent parser for the grammar within the Pratt parser
framework. The generated parsers currently use full backtracking search
(The use of first-sets is not fully implemented, but fits nicely into the
precondition-triggering model.)
This feature is still in development and experimental. The code is not optimized and parts are currently inefficient.
The EBNF language is currently bare-bones as far as what can be compiled into a
parser instance. It does basic BNF. (The EBNF language itself, defined via
Python overloading, is mostly implemented but is not yet compilable into a
parser instance. For details of the current state of the Python EBNF language
see the docs for the module ebnf_classes_and_operators.py
.)
Below is a simple example running example which parses the BNF expression
grammar above. (It also allows signed integers, but not signed variables, and
full identifiers as variables.) See the file example_expression_grammar.py
in the examples directory for the code.
import typped as pp
parser = pp.PrattParser()
parser.def_default_whitespace()
parser.def_default_single_char_tokens()
k_int = parser.def_default_int_token()
k_identifier = parser.def_default_identifier_token()
expression = ( Rule("term") + Tok("k_plus") + Rule("expression")
| Rule("term") + Tok("k_minus") + Rule("expression")
| Rule("term"))
term = ( Rule("factor") + Tok("k_ast") + Rule("term")
| Rule("factor") + Tok("k_slash") + Rule("term")
| Rule("factor"))
factor = ( Rule("constant")
| Rule("variable")
| Tok("k_lpar") + Rule("expression") + Tok("k_rpar"))
variable = Tok(k_identifier) | Tok("k_minus") + k_identifier
constant = k_int
g = pp.Grammar("expression", parser, locals())
tree = parser.parse("4 + my_var * (3 - 1)", pstate="expression")
print(tree.tree_repr())
This example uses several of the helper methods functions to quickly define tokens. The tokens must all be defined, but they do not need to be explicitly made into token literals (at least not for grammar-based parsing alone). They are simply read in as tokens from the lexer because the grammar specifies what to look for.
Notice that token instances can appear directly in the grammar as token
literals. The token named by its token label appears as, for example,
Tok("k_plus")
. Token instances can also appear inside Tok
calls.
The output from the above code is as follows:
<k_null-string,'expression'>
<k_null-string,'term'>
<k_null-string,'factor'>
<k_null-string,'constant'>
<k_int,'4'>
<k_plus,'+'>
<k_null-string,'expression'>
<k_null-string,'term'>
<k_null-string,'factor'>
<k_null-string,'variable'>
<k_identifier,'my_var'>
<k_ast,'*'>
<k_null-string,'term'>
<k_null-string,'factor'>
<k_lpar,'('>
<k_null-string,'expression'>
<k_null-string,'term'>
<k_null-string,'factor'>
<k_null-string,'constant'>
<k_int,'3'>
<k_minus,'-'>
<k_null-string,'expression'>
<k_null-string,'term'>
<k_null-string,'factor'>
<k_null-string,'constant'>
<k_int,'1'>
<k_rpar,')'>
At some point the ability to suppress null-string tokens representing nonterminals from appearing in the tree will be added.