10.7. typped.ebnf_classes_and_operators¶
This module implements a nice frontend for parsing grammars using EBNF-like Python
expressions. The backend is the recursive-descent parsing capabilities of the
PrattParser
class, implemented using precondition functions and null-string
token handlers.
Similar Python parsing packages.
These are a few (but not all) similar Python parsing packages for parsing from BNF- or EBNF-like grammars (though without the underlying Pratt parsing framework). Some use string representations of the grammar, while others use overloaded Python operators. Most do not automatically produce a parse tree.
- pyparsing – Uses Python overloading to define the grammar, similar to this module. http://pyparsing.wikispaces.com/home
- Parsimonius – Passed a string containing the EBNF of the grammar and returns a parse tree. https://github.com/erikrose/parsimonious
- Parsley – Passed a string containing the EBNF of the grammar. https://github.com/python-parsley/parsley/
- yeanpypa – Uses Python overloading, similar to this module. https://github.com/DerNamenlose/yeanpypa
- Lark – Passed a string. Implements Earley & LALR(1) and returns a parse tree. https://github.com/erezsh/Lark
For more information on the various Python parsers, see these summaries:
10.7.1. Terminology and notation¶
These terms are used in the description:
- The individual rewrite rules such as
<expression> ::= <term>
in a BNF grammar are called production rules. They are also called productions or just rules. The symbols on the l.h.s. of productions (which can also appear in the r.h.s.) are called nonterminal symbols. The r.h.s of a production is called the parsing expression. The r.h.s. of productions can contain terminal symbols, nonterminal symbols and perhaps other symbols such as the special epsilon symbol which matches an empty string. - Production rules with the the same l.h.s. nonterminal symbol will be referred to
as different cases of the nonterminal symbol. A common notation is
to define multiple cases in one production rule expression by using the “or”
symbol
|
. This form of definition is currently required by this module. That is, all the cases for any nonterminal must occur in a single expression, using|
if there are multiple cases. The ordered list of all the rule cases for a nonterminal will be called the caselist for the nonterminal. Order matters for resolving ambiguity. - The separate symbol elements within a case will be collectively called the items of that case. They include terminal symbols, nonterminal symbols, and possibly the epsilon symbol. In this module there are no explicit terminal symbols. Instead, terminals are either tokens (defined for the lexer and returned by it) or else consecutive sequences of tokens.
This is an example of a definition of a caselist with two cases in the Typped EBNF-like code:
arglist = ( Rule("arg") + Tok("k_comma") + Rule("arglist")
| Rule("arg")
)
The tokens involved must already be defined (and the token itself can be used
in the rule instead of the label inside a call to Tok
). The caselist for
arg
is not shown, but it could be defined before or after the arglist
caselist. The order in which the caselists of production rules are written
does not matter, so they can be written top-down, beginning with the
start-state nonterminal, or in any other convenient way.
Nonterminals are written as strings passed to Rule
precisely so they can be
used in the r.h.s. of caselist definitions even when they have not yet been
defined. They are resolved later when the compile
method of the grammar is
called (passed the start nonterminal and the locals dict). These r.h.s.
strings for nonterminals must be identical to the l.h.s. Python variable
names for the nonterminals (since they are looked up in the locals dict).
The order in which the cases of a nonterminal are defined within a caselist does matter, at least for ambiguous grammars and to avoid or minimize backtracking. The order of the cases is the order in which the algorithm will test the cases. The first successful parse is returned. In this sense the grammar is similar to a parsing expression grammar (PEG) rather than a context-free grammar (CFG), which can be ambiguous. See, e.g., https://en.wikipedia.org/wiki/Parsing_expression_grammar
10.7.2. Implementation¶
This module is a work-in-progress. As of now the syntactic Python interface mostly all works, but not all of the features have been coded into the backend “compiling” algorithm. Currently it does basic BNF types production rules, but not many EBNF extensions. See the test file cases for examples. Here is a summary of what is implemented and what is not yet implemented.
Implemented:
- Backtracking recursive descent search.
Rule
Tok
Not yet implemented:
Prec
and precedences in productionsSig
type handlingPratt
calls to the Pratt parserOpt
- Repeated items (
OneOrMore
,ZeroOrMore
,Between
, etc.)Not
OneOf
Hide
- LL(1) optimization
- Epsilon production handling.
- Undo compile in the
Grammar
class.
10.7.3. Wrapper functions¶
Strings in the rule-defining expressions must be wrapped by some function call, even though allowing the plain strings would be convenient for rules or for using token labels instead of the token objects. That would work in most cases, but since addition is defined for strings it would not work for two strings at the beginning of the expression.
Wrapper functions:
Function | Arguments | Shortcut | Python3 only |
---|---|---|---|
Rule |
rule-label (a string) | ||
Tok |
token | token | |
Root |
item | ||
Prec |
item, prec | item[prec] | |
Sig |
item, type sig | item(sig) | |
Pratt |
(optional) pstate, type sig | ||
Opt |
item (any number of args) | ||
nExactly |
int, item | n * item | |
nOrMore |
int, item | (n,) * item | (n,…) * item |
OneOrMore |
item | (1,) * item | (1,…) * item |
ZeroOrMore |
item | (0,) * item | (0,…) * item |
Between |
int, int, item | (m,n) * item | |
Hide |
item | ||
Not |
item | ||
AnyOf |
itemlist |
Note that (n,m)
and its variations are equivalent to [n,m]
if that
syntax looks clearer.
10.7.4. Overloaded operator API¶
The basic objects that make up rule definitions are Item
objects, ItemList
objects, and CaseList
objects. The latter two are just list-like objects
with most of the list operations overloaded. An ItemList
only holds Item
instances, and a CaseList
only holds ItemList
instances. These objects do
not nest, though, and so there are some important differences from ordinary
Python lists.
The ItemList
and CaseList
classes are basically designed for concatenation
operations, since that is what is used to build up production rule expressions.
They are “flat” in the sense that they only hold one kind of object and do
everything they can to convert an object to that type before saving it on the
list. An ItemList
instance holds Item
instances and a CaseList
instance
holds ItemList
instances. Because of the latter property Case
is defined
as an alias for ItemList
. Both of the classes take an arbitrary number of
arguments in their constructors. All the arguments are converted to elements
of the single type that they hold. The new instance then initially contains
all those converted arguments as its elements. When an ItemList
is passed
another ItemList
in its initializer argument list it just takes the elements
within that list and extends its list with them. The CaseList
class works
similarly.
So the initializers basically form the concatenation of all the passed-in arguments, after converting each one to the type of object that the list-like object holds (each holds only one type of object). The addition and “or” operations are essentially shorthand for putting both operands on an initializer list of the appropriate return type. Appending to a list works the same except it gives the in-place result of appending that item after converting it.
Summary of the operations:
- Addition: Two
Item
and/orItemList
instances can be combined with+
, which always returns anItemList
. The operation is the same as if the operands had both been on the initializer list forItemList
. The+=
operator is also defined. The addition operator is not defined forCaseList
objects in order to possibly catch some syntax errors in expressions (although there are ordinaryadd
andiadd
methods). - Case joining: The “or” operation
|
is defined forItem
,ItemList
, orCaseList
instances. It always returns aCaseList
. The operands are joined together as if they had been arguments to the initializer of aCaseList
. - Tokens in expressions: The
+
and|
operators are defined for tokens (in thePrattParser
module) to behave in the same way as forItem
instances. This allows the use of the tokens directly, without having to convert them intoItem
instances by wrapping them in theTok
function. - Other list-like operations: The methods
append
andinsert
are defined for these list-like classes. They convert their argument to the correct type for the container and then append it to the list in-place. Indexing of these list-like objects is also supported, including negative indices and slices. This allows them to be iterated over. They also have alen
, so they can be tested for emptiness.
Note
Since the +
operator has a higher precedence than the |
operator, all
the additions within a case will always be carried-out before any “or”
operations. So each argument to |
will be either a single token, a single
Item
or a single ItemList
.
Note that after a full expression containing these objects and operators is
evaluated the resulting r.h.s. object (which is set to the l.h.s. variable
name for a production rule) can be 1) a single token, 2) a single Item
, 3)
a single ItemList
, or 4) a CaseList
. The compile
method of a
Grammar
instance will always convert the value into a CaseList
instance.
(It would be possible to overload the <<=
operator and use it instead of
=
to automatically do the conversion, but that does not seem worth the
extra notation and boilerplate.)
10.7.5. Modifiers for items¶
Items can have several begin/end modifiers to indicate when special processing
starts or ends. These are stored in a list attribute called
begin_end_modifiers
. End modifiers are always the string “)”. Begin
modifiers can be any of these (set by the corresponding function applied to the
Item or ItemList):
"Opt("
"OneOrMore("
"ZeroOrMore("
10.7.6. Operator precedences expressed in grammar¶
This is not yet implemented, but will mimic the way a Pratt parser works.
In the EBNF-like grammar precedences are defined using index-style brackets:
Tok("k_plus")[30]
The above would give a precedence of 30 to the token.
10.7.7. Optimizing the grammar¶
This section discusses possible optimizations to the grammar. Predictive parsing is close to being fully implemented, but is not fully set up yet.
10.7.7.1. predictive parsing¶
In order to optimize the parsing of a recursive descent grammar, many grammars allow the use of predictive parsing, which requires no backtracking. Even when predictive parsing is not possible, often partial predictive parsing can make the algorithm more efficient (falling back to backtracking search only when necessary).
To use a predicive parse you need to generate a first set for each non-terminal (i.e., recursive rule call) in the grammar.
When epsilon productions are allowed a follow set acts similarly to a first set.
See, e.g.:
- http://faculty.ycp.edu/~dhovemey/fall2010/cs340/lecture/lecture9.html
- http://www.csd.uwo.ca/~moreno//CS447/Lectures/Syntax.html/node12.html
Maybe also consider packrat parsing:
10.7.7.2. grammar transformations¶
Not implemented. Just an idea for now, but you could do any number of
grammar transformations on the rules of a Grammar
object.
One possibility is to remove at least trivial left recursion. Just change the ordering of any obvious left recursive cases.
In a more complicated transformation you could do left factoring on the grammar to remove the left recursion, but that isn’t likely to happen.
Consider curtailment of left recursion, too. If it is going to repeat will it repeat in n levels, where that is the number of rules? What is the limit, etc. See some of those articles and consider if not too much to do.
10.7.8. Code¶
-
class
typped.ebnf_classes_and_operators.
Grammar
(*args, **kwargs)[source]¶ Bases:
object
An object representing a context-free grammar. It is basically a dict of caselists indexed by nonterminal labels. Provides various methods for processing the caselists.
-
compile
(start_nonterm_label, parser, locals_dict, register=True)[source]¶ Create the Pratt parser handlers in
parser
to parse the current grammar.The
start_nonterm_label
is the starting nonterminal. Only rules which are reachable from the rule cases for this starting nonterminal will be processed.The
parser
is aPrattParser
instance.The
locals_dict
should be passedlocals=locals()
. If you also need globals then you have to merge thelocals()
andglobals()
dicts (with locals overwriting) and pass that dict instead.If
register
is true the rules are registered with thePrattParser
instanceparser
to enable it to parse the grammar.
-
-
class
typped.ebnf_classes_and_operators.
Item
(value=None)[source]¶ Bases:
object
Class representing the basic elements that make up the cases of the production rules.
-
class
typped.ebnf_classes_and_operators.
ItemList
(*args)[source]¶ Bases:
object
A list of
Item
instances.
-
class
typped.ebnf_classes_and_operators.
CaseList
(*args)[source]¶ Bases:
object
A list of
Case
objects. Note, though, that a single Item or ItemList can also be a case (when there are no “or” operations to form the case).-
add
(other)[source]¶ Add a
CaseList
to some other object, which must be convertable to aCaseList
. This method is purposely not overloaded with the operator+
because that operator is used in the production rule strings forItemList
objects, but in that context is an error if applied toCaseList
objects.
-
-
typped.ebnf_classes_and_operators.
Rule
(nonterm_label)[source]¶ Return an
Item
to represent the nonterminal with the string labelnonterm_label
.
-
typped.ebnf_classes_and_operators.
Tok
(token)[source]¶ Turn a token into an item. Used before overloading defined on tokens. Can be passed a token object or a string token label.
-
typped.ebnf_classes_and_operators.
Root
(item_init_arg, prec=None)[source]¶ A function to designate that the token for the item should made into the root of the resulting parse subtree, if possible.
-
typped.ebnf_classes_and_operators.
Prec
(item_init_arg, prec)[source]¶ Set the operator precedence when called from a tail handler. Can only wrap an item.
-
typped.ebnf_classes_and_operators.
Sig
(item_init_arg, type_sig)[source]¶ Set the type signature for an item.
-
typped.ebnf_classes_and_operators.
Pratt
(pstate=None, type_sig=None)[source]¶ Use an ordinary Pratt parser
recursive_parse
to get a subexpression. The paramterpstate
is a state that will be temporarily pushed on thepstate_stack
during the parsing (which can be used as a precondition). The optional type signature can also be set to be checked.
-
typped.ebnf_classes_and_operators.
Opt
(*args)[source]¶ List of optional arguments, can match any one or none.
-
typped.ebnf_classes_and_operators.
Repeat
(range_spec, arg)[source]¶ Used to process overload of multiplication for repetition.
-
typped.ebnf_classes_and_operators.
Hide
(itemlist)[source]¶ Do not show the items in the final tree. For example, parentheses can be ignored in function argument lists.
-
typped.ebnf_classes_and_operators.
raise_if_not
(instanceof_list, issubclass_list, operand_or_arg, calling_instance, operator_or_method_string, kind='op')[source]¶ Error-checking routine called from overloaded operators. If
operand_or_arg
is not an instance a class ininstanceof_list
or a subclass of a class inissubclass_list
then raise aParserGrammarRuleException
with a helpful error message.If
kind
is"op"
the message is for an operator. If it is"method"
then the message is for a method.
-
typped.ebnf_classes_and_operators.
print_indented_caselist
(string_before, caselist)[source]¶ Print routine for debugging.
-
exception
typped.ebnf_classes_and_operators.
ParserGrammarRuleException
[source]¶ Bases:
typped.shared_settings_and_exceptions.ParserException
Exception raised by grammar classes.