10.7. typped.ebnf_classes_and_operators

This module implements a nice frontend for parsing grammars using EBNF-like Python expressions. The backend is the recursive-descent parsing capabilities of the PrattParser class, implemented using precondition functions and null-string token handlers.

Similar Python parsing packages.

These are a few (but not all) similar Python parsing packages for parsing from BNF- or EBNF-like grammars (though without the underlying Pratt parsing framework). Some use string representations of the grammar, while others use overloaded Python operators. Most do not automatically produce a parse tree.

For more information on the various Python parsers, see these summaries:

10.7.1. Terminology and notation

These terms are used in the description:

  • The individual rewrite rules such as <expression> ::= <term> in a BNF grammar are called production rules. They are also called productions or just rules. The symbols on the l.h.s. of productions (which can also appear in the r.h.s.) are called nonterminal symbols. The r.h.s of a production is called the parsing expression. The r.h.s. of productions can contain terminal symbols, nonterminal symbols and perhaps other symbols such as the special epsilon symbol which matches an empty string.

  • Production rules with the the same l.h.s. nonterminal symbol will be referred to as different cases of the nonterminal symbol. A common notation is to define multiple cases in one production rule expression by using the “or” symbol |. This form of definition is currently required by this module. That is, all the cases for any nonterminal must occur in a single expression, using | if there are multiple cases. The ordered list of all the rule cases for a nonterminal will be called the caselist for the nonterminal. Order matters for resolving ambiguity.

  • The separate symbol elements within a case will be collectively called the items of that case. They include terminal symbols, nonterminal symbols, and possibly the epsilon symbol. In this module there are no explicit terminal symbols. Instead, terminals are either tokens (defined for the lexer and returned by it) or else consecutive sequences of tokens.

This is an example of a definition of a caselist with two cases in the Typped EBNF-like code:

arglist = ( Rule("arg") + Tok("k_comma") + Rule("arglist")
          | Rule("arg")
          )

The tokens involved must already be defined (and the token itself can be used in the rule instead of the label inside a call to Tok). The caselist for arg is not shown, but it could be defined before or after the arglist caselist. The order in which the caselists of production rules are written does not matter, so they can be written top-down, beginning with the start-state nonterminal, or in any other convenient way.

Nonterminals are written as strings passed to Rule precisely so they can be used in the r.h.s. of caselist definitions even when they have not yet been defined. They are resolved later when the compile method of the grammar is called (passed the start nonterminal and the locals dict). These r.h.s. strings for nonterminals must be identical to the l.h.s. Python variable names for the nonterminals (since they are looked up in the locals dict).

The order in which the cases of a nonterminal are defined within a caselist does matter, at least for ambiguous grammars and to avoid or minimize backtracking. The order of the cases is the order in which the algorithm will test the cases. The first successful parse is returned. In this sense the grammar is similar to a parsing expression grammar (PEG) rather than a context-free grammar (CFG), which can be ambiguous. See, e.g., https://en.wikipedia.org/wiki/Parsing_expression_grammar

10.7.2. Implementation

This module is a work-in-progress. As of now the syntactic Python interface mostly all works, but not all of the features have been coded into the backend “compiling” algorithm. Currently it does basic BNF types production rules, but not many EBNF extensions. See the test file cases for examples. Here is a summary of what is implemented and what is not yet implemented.

Implemented:

  • Backtracking recursive descent search.
  • Rule
  • Tok

Not yet implemented:

  • Prec and precedences in productions
  • Sig type handling
  • Pratt calls to the Pratt parser
  • Opt
  • Repeated items (OneOrMore, ZeroOrMore, Between, etc.)
  • Not
  • OneOf
  • Hide
  • LL(1) optimization
  • Epsilon production handling.
  • Undo compile in the Grammar class.

10.7.3. Wrapper functions

Strings in the rule-defining expressions must be wrapped by some function call, even though allowing the plain strings would be convenient for rules or for using token labels instead of the token objects. That would work in most cases, but since addition is defined for strings it would not work for two strings at the beginning of the expression.

Wrapper functions:

Function Arguments Shortcut Python3 only
Rule rule-label (a string)    
Tok token token  
Root item    
Prec item, prec item[prec]  
Sig item, type sig item(sig)  
Pratt (optional) pstate, type sig    
Opt item (any number of args)    
nExactly int, item n * item  
nOrMore int, item (n,) * item (n,…) * item
OneOrMore item (1,) * item (1,…) * item
ZeroOrMore item (0,) * item (0,…) * item
Between int, int, item (m,n) * item  
Hide item    
Not item    
AnyOf itemlist    

Note that (n,m) and its variations are equivalent to [n,m] if that syntax looks clearer.

10.7.4. Overloaded operator API

The basic objects that make up rule definitions are Item objects, ItemList objects, and CaseList objects. The latter two are just list-like objects with most of the list operations overloaded. An ItemList only holds Item instances, and a CaseList only holds ItemList instances. These objects do not nest, though, and so there are some important differences from ordinary Python lists.

The ItemList and CaseList classes are basically designed for concatenation operations, since that is what is used to build up production rule expressions. They are “flat” in the sense that they only hold one kind of object and do everything they can to convert an object to that type before saving it on the list. An ItemList instance holds Item instances and a CaseList instance holds ItemList instances. Because of the latter property Case is defined as an alias for ItemList. Both of the classes take an arbitrary number of arguments in their constructors. All the arguments are converted to elements of the single type that they hold. The new instance then initially contains all those converted arguments as its elements. When an ItemList is passed another ItemList in its initializer argument list it just takes the elements within that list and extends its list with them. The CaseList class works similarly.

So the initializers basically form the concatenation of all the passed-in arguments, after converting each one to the type of object that the list-like object holds (each holds only one type of object). The addition and “or” operations are essentially shorthand for putting both operands on an initializer list of the appropriate return type. Appending to a list works the same except it gives the in-place result of appending that item after converting it.

Summary of the operations:

  • Addition: Two Item and/or ItemList instances can be combined with +, which always returns an ItemList. The operation is the same as if the operands had both been on the initializer list for ItemList. The += operator is also defined. The addition operator is not defined for CaseList objects in order to possibly catch some syntax errors in expressions (although there are ordinary add and iadd methods).

  • Case joining: The “or” operation | is defined for Item, ItemList, or CaseList instances. It always returns a CaseList. The operands are joined together as if they had been arguments to the initializer of a CaseList.

  • Tokens in expressions: The + and | operators are defined for tokens (in the PrattParser module) to behave in the same way as for Item instances. This allows the use of the tokens directly, without having to convert them into Item instances by wrapping them in the Tok function.

  • Other list-like operations: The methods append and insert are defined for these list-like classes. They convert their argument to the correct type for the container and then append it to the list in-place. Indexing of these list-like objects is also supported, including negative indices and slices. This allows them to be iterated over. They also have a len, so they can be tested for emptiness.

Note

Since the + operator has a higher precedence than the | operator, all the additions within a case will always be carried-out before any “or” operations. So each argument to | will be either a single token, a single Item or a single ItemList.

Note that after a full expression containing these objects and operators is evaluated the resulting r.h.s. object (which is set to the l.h.s. variable name for a production rule) can be 1) a single token, 2) a single Item, 3) a single ItemList, or 4) a CaseList. The compile method of a Grammar instance will always convert the value into a CaseList instance. (It would be possible to overload the <<= operator and use it instead of = to automatically do the conversion, but that does not seem worth the extra notation and boilerplate.)

10.7.5. Modifiers for items

Items can have several begin/end modifiers to indicate when special processing starts or ends. These are stored in a list attribute called begin_end_modifiers. End modifiers are always the string “)”. Begin modifiers can be any of these (set by the corresponding function applied to the Item or ItemList):

  • "Opt("
  • "OneOrMore("
  • "ZeroOrMore("

10.7.6. Operator precedences expressed in grammar

This is not yet implemented, but will mimic the way a Pratt parser works.

In the EBNF-like grammar precedences are defined using index-style brackets:

Tok("k_plus")[30]

The above would give a precedence of 30 to the token.

10.7.7. Optimizing the grammar

This section discusses possible optimizations to the grammar. Predictive parsing is close to being fully implemented, but is not fully set up yet.

10.7.7.1. predictive parsing

In order to optimize the parsing of a recursive descent grammar, many grammars allow the use of predictive parsing, which requires no backtracking. Even when predictive parsing is not possible, often partial predictive parsing can make the algorithm more efficient (falling back to backtracking search only when necessary).

To use a predicive parse you need to generate a first set for each non-terminal (i.e., recursive rule call) in the grammar.

When epsilon productions are allowed a follow set acts similarly to a first set.

See, e.g.:

Maybe also consider packrat parsing:

10.7.7.2. grammar transformations

Not implemented. Just an idea for now, but you could do any number of grammar transformations on the rules of a Grammar object.

One possibility is to remove at least trivial left recursion. Just change the ordering of any obvious left recursive cases.

In a more complicated transformation you could do left factoring on the grammar to remove the left recursion, but that isn’t likely to happen.

Consider curtailment of left recursion, too. If it is going to repeat will it repeat in n levels, where that is the number of rules? What is the limit, etc. See some of those articles and consider if not too much to do.

10.7.8. Code

class typped.ebnf_classes_and_operators.Grammar(*args, **kwargs)[source]

Bases: object

An object representing a context-free grammar. It is basically a dict of caselists indexed by nonterminal labels. Provides various methods for processing the caselists.

compile(start_nonterm_label, parser, locals_dict, register=True)[source]

Create the Pratt parser handlers in parser to parse the current grammar.

The start_nonterm_label is the starting nonterminal. Only rules which are reachable from the rule cases for this starting nonterminal will be processed.

The parser is a PrattParser instance.

The locals_dict should be passed locals=locals(). If you also need globals then you have to merge the locals() and globals() dicts (with locals overwriting) and pass that dict instead.

If register is true the rules are registered with the PrattParser instance parser to enable it to parse the grammar.

uncompile()[source]

Undo the effect of the compile command. Can be used for dynamic grammars, but NOT IMPLEMENTED YET.

print_grammar()[source]
class typped.ebnf_classes_and_operators.Item(value=None)[source]

Bases: object

Class representing the basic elements that make up the cases of the production rules.

class typped.ebnf_classes_and_operators.ItemList(*args)[source]

Bases: object

A list of Item instances.

append(item)[source]

Append an item to the list.

insert(index, item)[source]

Insert an item.

typped.ebnf_classes_and_operators.Case

alias of ItemList

class typped.ebnf_classes_and_operators.CaseList(*args)[source]

Bases: object

A list of Case objects. Note, though, that a single Item or ItemList can also be a case (when there are no “or” operations to form the case).

append(item)[source]

Append an item to the list.

insert(index, item)[source]

Insert an item.

add(other)[source]

Add a CaseList to some other object, which must be convertable to a CaseList. This method is purposely not overloaded with the operator + because that operator is used in the production rule strings for ItemList objects, but in that context is an error if applied to CaseList objects.

iadd(other)[source]

Add a CaseList to some other object, in place. Like add but in-place.

typped.ebnf_classes_and_operators.Rule(nonterm_label)[source]

Return an Item to represent the nonterminal with the string label nonterm_label.

typped.ebnf_classes_and_operators.Tok(token)[source]

Turn a token into an item. Used before overloading defined on tokens. Can be passed a token object or a string token label.

typped.ebnf_classes_and_operators.Root(item_init_arg, prec=None)[source]

A function to designate that the token for the item should made into the root of the resulting parse subtree, if possible.

typped.ebnf_classes_and_operators.Prec(item_init_arg, prec)[source]

Set the operator precedence when called from a tail handler. Can only wrap an item.

typped.ebnf_classes_and_operators.Sig(item_init_arg, type_sig)[source]

Set the type signature for an item.

typped.ebnf_classes_and_operators.Pratt(pstate=None, type_sig=None)[source]

Use an ordinary Pratt parser recursive_parse to get a subexpression. The paramter pstate is a state that will be temporarily pushed on the pstate_stack during the parsing (which can be used as a precondition). The optional type signature can also be set to be checked.

typped.ebnf_classes_and_operators.Opt(*args)[source]

List of optional arguments, can match any one or none.

typped.ebnf_classes_and_operators.Repeat(range_spec, arg)[source]

Used to process overload of multiplication for repetition.

typped.ebnf_classes_and_operators.nExactly(n, arg)[source]
typped.ebnf_classes_and_operators.nOrMore(n, arg)[source]
typped.ebnf_classes_and_operators.Between(m, n, arg)[source]
typped.ebnf_classes_and_operators.OneOrMore(arg)[source]
typped.ebnf_classes_and_operators.ZeroOrMore(arg)[source]
typped.ebnf_classes_and_operators.Hide(itemlist)[source]

Do not show the items in the final tree. For example, parentheses can be ignored in function argument lists.

typped.ebnf_classes_and_operators.Not(token)[source]

The token cannot appear or the case fails.

typped.ebnf_classes_and_operators.OneOf(*args)[source]
typped.ebnf_classes_and_operators.raise_if_not(instanceof_list, issubclass_list, operand_or_arg, calling_instance, operator_or_method_string, kind='op')[source]

Error-checking routine called from overloaded operators. If operand_or_arg is not an instance a class in instanceof_list or a subclass of a class in issubclass_list then raise a ParserGrammarRuleException with a helpful error message.

If kind is "op" the message is for an operator. If it is "method" then the message is for a method.

typped.ebnf_classes_and_operators.print_indented_caselist(string_before, caselist)[source]

Print routine for debugging.

exception typped.ebnf_classes_and_operators.ParserGrammarRuleException[source]

Bases: typped.shared_settings_and_exceptions.ParserException

Exception raised by grammar classes.