11. Appendix A: Possible (but currently unimplemented) generalizations

The following subsections discuss some possible generalizations which are not currently implemented.

11.1. Modifiable token precedence values

Currently the PrattParser class has the restriction that every token must have a fixed precedence. Even under different preconditions the precedence must be the same. This is not a huge restriction, but there is the question of how it could be removed. A reasonable generalization is to define precedence values to be attributes of tail handler functions or, in the Typped case, attributes of constructs that hold tail handler functions. (Recall that precedences have no effect on head handler parsing.)

Assume each tail handler function can have an arbitrary precedence. In this case different preconditions can cause handlers functions with different precedences to be dispatched.

The implementation problem arises in the recursive_parse function. In the loop for tail handlers there is always a lookahead to the precedence of the next token (which is compared to the subexpression precedence). So we need to find the precedence of the peek(1) token. Precedence can vary according to which tail handler (construct) would be dispatched, so we need to determine that. But which construct “wins” the dispatching can vary according to the conditions at the time when that peek token itself is processed.

There are ways to get a reasonable version of this, subject to some limitations. You essentially step the lexer ahead to approximate the time when the peek token would be processed, look up the handler/construct, get its precedence value, and then step the lexer back. This adds complexity and an extra go_back operation per subexpression for a feature that would probably be little-used. It is not currently implemented (but there is currently commented-out code that could be adapted to do it). Interactions with other features, such as jops, would also need to be considered.

11.2. Subexpression lookahead

In this generalization you would be able to use lookahead to the next subexpression, not just the next token. The advantage of this kind of lookahead is that infix operator overloading can then depend on the types (or other properties) of both of the fully-resolved operands, not just the left operand and the raw lookahead tokens.

Preconditioning on a token lookahead just after a subexpression lookahead could be one way (not the best way) to resolve things like ternary operations where the first operator is also an operator by itself: x ? y versus x ? y : z. Similarly, an if-then with optional else could be resolved that way: if <test> then <action> versus if <test> then <action> else <other-action>. The tail handler for processing the first operator can be chosen dependent on the token type two tokens ahead.

This would be a useful feature, but it would obviously be more expensive since full subexpressions would have to be provisionally parsed. It has the potential to interact with other features, such as the jop feature. So the implementation would need to be carefully considered. Some limited form of this is likely to be implemented at some point, if only to allow for full overloading on types.

The required number of go_back lexer operations on failure could be limited by possibly allowing preconditions functions to specify a subexpression lookahead, preferably after the constraints based on available data are satisfied. You can already do something similar inside a precondition or handler function if you really must. You can save the state of the lexer, call recursive_parse to get a subexpression, and then restore the state of the lexer.

11.3. Constructs with multiple possible trigger tokens

The current implementation follows the Pratt parsing scheme of having handlers associated with triggering tokens, as well as the head or tail position. In general, the looking at the current kind of token could instead be implemented as part of the preconditions functions.

This would be less efficient in the sense that many more preconditions functions would potentially need to be sequentially run find the handler for a given token. (The current implementation uses a tree to quickly go to a smaller set of constructs, based on the head or tail property and the token label.) There are, however, optimizations which could be applied to make this more efficient. Preconditions functions would become more complex, since every construct which is based on the current kind of token (formerly the triggering token) would need to explicitly specify which kinds of tokens in some way.

The current setup follows Pratt parsing more closely, and the equivalent to the above is to declare multiple constructs, one for each kind of triggering token. A null-string token could also be used to do something similar.

11.4. More complex types

Generally we might want:

  • Heirarchies of types and subtypes, with reasonable notions of type equivalence defined.
  • Unions.
  • Automatic conversions.
  • Parameterized types or templates.

11.5. Floating point precedence values

The builtin functions that set right associativity currently assume that the precedence values are ints. The examples also use ints. As explained below, this is not strictly required. Limited precision floating point values already work in the current framework, provided a different value is subtracted to get right associativity.

Notice that in the while loop of recursive_parse the comparison is:

while lex.peek().prec() > subexp_prec:

Assume that precedence values are ints. Consider a tail handler for, say, infix * with a precedence of 5 and left associativity. Say we process 2*2*2. The head handler for the full expression calls recursive_parse to process the first * (with 2 as the processed_left value).

For left associativity the subexpression precedence of 5 is passed to this recursive_parse call. When the loop in that call of recursive_parse peeks at the second * token, with a precedence of 5, it breaks and returns because 5 > 5 is false.

If instead the subexpression precedence had been 4, for right associativity, the peek would again see the second * token with a precedence 5, but since 5 > 4 loop would continue. It continues until it sees a token with precedence strictly greater than 4, and then it breaks.

Notice that in the latter case the behavior with respect to peeking a token with token precedence of 4 is still the same as in the first case. The subexpression precedence for right associativity just needs to be less than 5 and greater than or equal to the next lowest precedence value (which in this case is 4 because we assumed ints).

Precedences are only used in comparisons, and the only arithmetic on precedences is subtracting from a precedence value to get a subexpression precedence that is smaller, but not too small. This means that we could equally well have used 5 - 0.1 as the subexpression precedence in the latter case of right associativity.

In general any kind of objects can be used for precedences, provided comparisons work correctly for them and there is a way to get a slightly smaller value that is still greater than or equal to the next smaller precedence value. In particular, precedences can be floating point numbers restricted in precision to some number of digits. If we restrict to three digits of precision then precedences like 4.333 and 2.111 are allowed. To get the slightly lower value for right associativity just subtract 0.00001 instead of 1.

This kind of thing is easy to implement, and has been tested, but is it a good idea? As of now the Typped builtins that set right associativity assume precedences are ints. In tweaking precedences during development sometimes float precedences might be useful. There is a slight loss of efficiency in the comparison operations when floats are involved, but probably not enough to be a problem. (If it is implemented later it would still be backward compatible with using ints.)

As a possible alternative, just after precedence values are defined and passed to def_construct they could always be multiplied by, say, 1000 and then rounded to an int. Then internally the representation would be as ints but to the user they would look like limited-precision floats. Subtracting one for right associativity still works, and all parse-time comparisons are of ints. The any error messages would need to convert the values back, however. A exception could be raised if the rounding changed the value or if the resulting int would overflow.