11. Appendix A: Possible (but currently unimplemented) generalizations¶
The following subsections discuss some possible generalizations which are not currently implemented.
11.1. Modifiable token precedence values¶
Currently the PrattParser
class has the restriction that every token must
have a fixed precedence. Even under different preconditions the precedence
must be the same. This is not a huge restriction, but there is the question of
how it could be removed. A reasonable generalization is to define precedence
values to be attributes of tail handler functions or, in the Typped case,
attributes of constructs that hold tail handler functions. (Recall
that precedences have no effect on head handler parsing.)
Assume each tail handler function can have an arbitrary precedence. In this case different preconditions can cause handlers functions with different precedences to be dispatched.
The implementation problem arises in the recursive_parse
function. In the
loop for tail handlers there is always a lookahead to the precedence of the
next token (which is compared to the subexpression precedence). So we need to
find the precedence of the peek(1)
token. Precedence can vary according to
which tail handler (construct) would be dispatched, so we need to determine
that. But which construct “wins” the dispatching can vary according to the
conditions at the time when that peek token itself is processed.
There are ways to get a reasonable version of this, subject to some
limitations. You essentially step the lexer ahead to approximate the time when
the peek token would be processed, look up the handler/construct, get its
precedence value, and then step the lexer back. This adds complexity and an
extra go_back
operation per subexpression for a feature that would probably
be little-used. It is not currently implemented (but there is currently
commented-out code that could be adapted to do it). Interactions with other
features, such as jops, would also need to be considered.
11.2. Subexpression lookahead¶
In this generalization you would be able to use lookahead to the next subexpression, not just the next token. The advantage of this kind of lookahead is that infix operator overloading can then depend on the types (or other properties) of both of the fully-resolved operands, not just the left operand and the raw lookahead tokens.
Preconditioning on a token lookahead just after a subexpression lookahead could
be one way (not the best way) to resolve things like ternary operations where
the first operator is also an operator by itself: x ? y
versus x ? y :
z
. Similarly, an if-then with optional else could be resolved that way: if
<test> then <action>
versus if <test> then <action> else <other-action>
.
The tail handler for processing the first operator can be chosen dependent on
the token type two tokens ahead.
This would be a useful feature, but it would obviously be more expensive since full subexpressions would have to be provisionally parsed. It has the potential to interact with other features, such as the jop feature. So the implementation would need to be carefully considered. Some limited form of this is likely to be implemented at some point, if only to allow for full overloading on types.
The required number of go_back
lexer operations on failure could be limited
by possibly allowing preconditions functions to specify a subexpression
lookahead, preferably after the constraints based on available data are
satisfied. You can already do something similar inside a precondition or
handler function if you really must. You can save the state of the lexer, call
recursive_parse
to get a subexpression, and then restore the state of the
lexer.
11.3. Constructs with multiple possible trigger tokens¶
The current implementation follows the Pratt parsing scheme of having handlers associated with triggering tokens, as well as the head or tail position. In general, the looking at the current kind of token could instead be implemented as part of the preconditions functions.
This would be less efficient in the sense that many more preconditions functions would potentially need to be sequentially run find the handler for a given token. (The current implementation uses a tree to quickly go to a smaller set of constructs, based on the head or tail property and the token label.) There are, however, optimizations which could be applied to make this more efficient. Preconditions functions would become more complex, since every construct which is based on the current kind of token (formerly the triggering token) would need to explicitly specify which kinds of tokens in some way.
The current setup follows Pratt parsing more closely, and the equivalent to the above is to declare multiple constructs, one for each kind of triggering token. A null-string token could also be used to do something similar.
11.4. More complex types¶
Generally we might want:
- Heirarchies of types and subtypes, with reasonable notions of type equivalence defined.
- Unions.
- Automatic conversions.
- Parameterized types or templates.
11.5. Floating point precedence values¶
The builtin functions that set right associativity currently assume that the precedence values are ints. The examples also use ints. As explained below, this is not strictly required. Limited precision floating point values already work in the current framework, provided a different value is subtracted to get right associativity.
Notice that in the while loop of recursive_parse
the comparison is:
while lex.peek().prec() > subexp_prec:
Assume that precedence values are ints. Consider a tail handler for, say,
infix *
with a precedence of 5 and left associativity. Say we process
2*2*2
. The head handler for the full expression calls recursive_parse
to
process the first *
(with 2
as the processed_left
value).
For left associativity the subexpression precedence of 5 is passed to this
recursive_parse
call. When the loop in that call of recursive_parse
peeks
at the second *
token, with a precedence of 5, it breaks and returns because
5 > 5 is false.
If instead the subexpression precedence had been 4, for right associativity,
the peek would again see the second *
token with a precedence 5, but since 5
> 4 loop would continue. It continues until it sees a token with precedence
strictly greater than 4, and then it breaks.
Notice that in the latter case the behavior with respect to peeking a token with token precedence of 4 is still the same as in the first case. The subexpression precedence for right associativity just needs to be less than 5 and greater than or equal to the next lowest precedence value (which in this case is 4 because we assumed ints).
Precedences are only used in comparisons, and the only arithmetic on precedences is subtracting from a precedence value to get a subexpression precedence that is smaller, but not too small. This means that we could equally well have used 5 - 0.1 as the subexpression precedence in the latter case of right associativity.
In general any kind of objects can be used for precedences, provided comparisons work correctly for them and there is a way to get a slightly smaller value that is still greater than or equal to the next smaller precedence value. In particular, precedences can be floating point numbers restricted in precision to some number of digits. If we restrict to three digits of precision then precedences like 4.333 and 2.111 are allowed. To get the slightly lower value for right associativity just subtract 0.00001 instead of 1.
This kind of thing is easy to implement, and has been tested, but is it a good idea? As of now the Typped builtins that set right associativity assume precedences are ints. In tweaking precedences during development sometimes float precedences might be useful. There is a slight loss of efficiency in the comparison operations when floats are involved, but probably not enough to be a problem. (If it is implemented later it would still be backward compatible with using ints.)
As a possible alternative, just after precedence values are defined and passed
to def_construct
they could always be multiplied by, say, 1000 and then
rounded to an int. Then internally the representation would be as ints but to
the user they would look like limited-precision floats. Subtracting one for
right associativity still works, and all parse-time comparisons are of ints.
The any error messages would need to convert the values back, however.
A exception could be raised if the rounding changed the value or if the
resulting int would overflow.