Recursive descent parsing in Typped =================================== This section discusses recursive descent parsing and how it relates to Pratt parsing and the Typped package. Recursive descent is implemented in the Typped parser via the use of preconditioned dispatching and the introduction of a new kind of token. The section begins with a comparison of the recursive descent parsing algorithm with the Pratt parser algorithm using preconditioned dispatching. A new kind of token is introduced (the null-string token) which bridges over one important gap between the two methods. This leads to a discussion of how recursive descent is implemented in Typped. The Typped package provides a higher-level EBNF-like grammar specification language, implemented by operator overloading, which makes it easy to implement grammar-based languages or partial languages. Readers who are only interested in using the EBNF-like language can skip to ":ref:`EBNF`." It should be noted that recursive descent parsing is not all that difficult to implement by hand, especially with a good lexer. It is possible to implement a traditional recursive descent parser using the same lexer as a ``PrattParser`` instance. Then either the recursive descent parser can call the Pratt parser to handle subexpressions, or the handlers of the Pratt parser can call the recursive descent functions to parse sub-grammars. .. note:: Some of these features are still experimental. Some features are only partially implemented. Only basic BNF currently works in the grammar parser. Type-checking and Pratt-style precedences in the grammars are not yet implemented. The implementation still needs to be optimized. Similarities and differences between the parsing methods -------------------------------------------------------- Pratt parsing is similar to recursive descent parsing in the sense that both are top-down recursive methods. Pratt parsing, however, is token-based while recursive descent parsing is based on production rules in a grammar. The ability to dispatch handlers based on preconditions makes the Typped Pratt parser even more similar to a recursive descent parser. The terminal symbols in a grammar are the token literals in a Pratt parser. The discussion below assumes this correspondence, so terminals are defined by the regular expressions in the definition of the literal tokens. Suppose all the productions in a grammar begin with a terminal symbol (represented by the regular expression of some corresponding token). This is called a **right-regular grammar**. A Pratt parser with preconditioned dispatching can be used to directly implement recursive descent parsing on such a grammar. Since each rule in such a grammar begins with a terminal symbol, corresponding to a token literal, the head-handler triggered by each such literal token can be set up to process the rule. In the Typped dispatching terminology, the literal token can be used to trigger a syntactic construct (containing the handler function, a preconditions function, and related data) which will then process the full production rule. In order to implement the recursive descent part you can keep a stack of production rule labels, updated to always have the current production rule on the top. In a ``PrattParser`` instance this stack is stored in a list attribute called ``pstate``. The construct for handling a production rule is given a precondition that the top label in that stack is the label for the rule it processes. The head-handler functions in these constructs are also responsible for updating the ``pstate`` stack as necessary. Using this, along with lookahead to the upcoming tokens in the preconditions, gives the power to do recursive descent parsing of right regular grammars. These head-handler functions mimic the separate recursive functions for each production rule in recursive descent. They are triggered by the corresponding literal tokens when the preconditions match (or in general by first-set elements). The ``pstate`` stack keeps track of the recursion in the tree defined by the grammar. Either a single one can handle all the cases of the nonterminal, or separate ones can be triggered to handle the cases separately. In extending this approach to general recursive descent, a problem arises when a production starts with a nonterminal symbol. Nonterminals do not correspond to a tokens like they do with terminals. So there is no token to trigger the construct for parsing the rule. To deal with this, the Typped package has an experimental feature called a **null-string token**. This is a special token which acts as though its regular expression "matches the null string" just before any upcoming text. Other than that it is a regular token, with handlers, preconditions, etc. This is implemented in the ``recursive_parse`` function. Before each call to ``next`` to get the next token from the lexer it first checks to see if the null-string token is defined for the parser instance. If so, it checks whether the preconditions of any registered null-string-triggered constructs match. If they do then the special null-string token is returned as the current token, and the handler function of the winning construct is called to process the next subexpression. The use of null-string tokens combined with keeping a stack of production rule state-labels is enough to implement general recursive descent parsing within the framework of a Pratt parser with preconditioned dispatching. The method is essentially the same as described earlier for right-regular grammars. You just use the null-string token to recognize the production rules that start with nonterminals (using the top of the ``pstate`` stack in preconditions). Example ------- Consider a very simple expression grammar in EBNF (even though the expression parts of a grammar might be better evaluated with Pratt-style parsing). The ``identifier`` and ``number`` productions are assumed to be implemented as tokens from the lexer, defined by regular expressions. Here the square brackets are optional parts, and curly braces mean "zero or more." The ``(x|y)`` construct here means either ``x`` or ``y``. .. TODO: consider this, especially w.r.t. associativity: http://homepage.divms.uiowa.edu/~jones/compiler/spring13/notes/10.shtml .. productionlist:: expression : ["+"|"-"] term {("+"|"-") term} term : factor {("*"|"/") factor} factor : `identifier` | `number` | "(" expression ")" Initially the ``pstate`` stack would only hold the string ``"expression"``. A head construct would be registered for the null-string token with the precondition that the ``expression`` state be at the top of the ``pstate`` stack. The head handler for the construct would first check whether the peek token is ``+`` or ``-``. If so, it would consume it. Then the string ``"term"`` would be pushed onto the stack and ``recursive_parse`` would be called. The call to ``recursive_parse`` returns a processed subtree, which is incorporated into the expression tree. A loop would continue this way until the peek token is not ``+`` or ``-``. Another head construct would be registered for the null-string token with the precondition that ``"term"`` be at the top of the stack. Its head-handler function would be responsible for parsing terms. It would work in the same general way as described above for expressions. The ``factor`` production could be implemented either as a handler for the null-string token or by separate constructs for the identifier, number, and left-paren token types. A similar expression grammar in plain BNF is as follows: .. productionlist:: expression : term "+" expression | term "-" expression | term; term : factor "*" term | factor "/" term | factor; factor : constant | variable | "(" expression ")"; variable : identifier | "-" identifier constant : number In this case the ``identifier`` and ``number`` nonterminals would be token literals defined by the corresponding regex passed to the ``def_token`` call. .. _EBNF: Recursive descent with Typped's EBNF-like grammar ------------------------------------------------- The Typped package comes with an EBNF grammar definable via Python overloads. It essentially automates the procedure described above to map recursive descent to a generalized Pratt parser. A grammar in Python, while not as concise as a parsed EBNF string, is easy to work with and has syntax highlighting. It is easy to define aliases for complicated components. When the grammar is "compiled" with respect to a ``PrattParser`` instance it produces a recursive descent parser for the grammar within the Pratt parser framework. The generated parsers currently use full backtracking search (The use of first-sets is not fully implemented, but fits nicely into the precondition-triggering model.) This feature is still in development and experimental. The code is not optimized and parts are currently inefficient. The EBNF language is currently bare-bones as far as what can be compiled into a parser instance. It does basic BNF. (The EBNF language itself, defined via Python overloading, is mostly implemented but is not yet compilable into a parser instance. For details of the current state of the Python EBNF language see the docs for the module ``ebnf_classes_and_operators.py``.) Below is a simple example running example which parses the BNF expression grammar above. (It also allows signed integers, but not signed variables, and full identifiers as variables.) See the file ``example_expression_grammar.py`` in the examples directory for the code. .. TODO: Keep this example synced with the test file. .. code-block:: python import typped as pp parser = pp.PrattParser() parser.def_default_whitespace() parser.def_default_single_char_tokens() k_int = parser.def_default_int_token() k_identifier = parser.def_default_identifier_token() expression = ( Rule("term") + Tok("k_plus") + Rule("expression") | Rule("term") + Tok("k_minus") + Rule("expression") | Rule("term")) term = ( Rule("factor") + Tok("k_ast") + Rule("term") | Rule("factor") + Tok("k_slash") + Rule("term") | Rule("factor")) factor = ( Rule("constant") | Rule("variable") | Tok("k_lpar") + Rule("expression") + Tok("k_rpar")) variable = Tok(k_identifier) | Tok("k_minus") + k_identifier constant = k_int g = pp.Grammar("expression", parser, locals()) tree = parser.parse("4 + my_var * (3 - 1)", pstate="expression") print(tree.tree_repr()) This example uses several of the helper methods functions to quickly define tokens. The tokens must all be defined, but they do not need to be explicitly made into token literals (at least not for grammar-based parsing alone). They are simply read in as tokens from the lexer because the grammar specifies what to look for. Notice that token instances can appear directly in the grammar as token literals. The token named by its token label appears as, for example, ``Tok("k_plus")``. Token instances can also appear inside ``Tok`` calls. The output from the above code is as follows:: At some point the ability to suppress null-string tokens representing nonterminals from appearing in the tree will be added.