10.10. typped.matcher

A Matcher object for a TokenTable, which the Lexer uses for looking up tokens.

10.10.1. Using the matcher

The default matcher, simply called "python" uses individual Python regexes for the tokens, stored in a list. This is to implement the “longest not first” matching semantics. There are several alternative matchers which can be specified, or users can define their own as described in a later section.

To select a different matcher to save a pattern in you can set the matcher_options keyword argument in the call to def_token. That value is passed on to the matcher’s insert_pattern method. With the default matchers different save options can be chosen for different tokens. Each matcher with defined entries will be checked when looking for matches. The default matcher to use can be changed by setting the default_insert_options attribute of the matcher instance.

One alternative matcher uses Python regexes but with “first not longest” semantics. This makes it more efficient on lookup, but causes order-dependence in the definition of tokens. (Additions and deletions are more expensive because it builds one large regex.) To select it use matcher_options="python_fnl".

There are also hybrid matchers, but those are still experimental. They are an attempt to implement more-efficient longest-not-first matching. They are hybrid matchers that store simple patterns in either a trie or a "python_fnl" matcher, reverting to the default “python” matcher for more-complex patterns.

There is also an experimental matcher that uses only the RegexTrieDictScanner. The regex language it accepts is different from the Python regex language, though. (The regex-trie-dict package is not currently up on GitHub.)

The Python matcher has good insert and delete times, but can become inefficient for large numbers of patterns. The trie is less efficient for small numbers of patterns, but can search many simple patterns quickly in parallel by just going down the trie. Simple patterns are literal matches with no special characters or patterns containing only character-range wildcards. More complex patterns in the trie matcher are still experimental.

10.10.2. Using a custom matcher

It is possible for users to define their own matchers. For example, the current matcher always takes the longest match. More-efficient matchers can be implemented which instead take the prefix which matches the first-defined pattern (so more-general patterns are usually defined first). The Typped package tries to avoid this for semantic clarity and to avoid order-dependence, but there is no technical reason it cannot be done.

A matcher class needs to have the following methods:

  • insert_pattern – insert a labeled regex pattern and priority into the matcher
  • remove_pattern – remove an inserted pattern
  • get_next_token_label_and_value – return the label and value of the next match

And the following attributes:

  • ignore_tokens – the set of token labels which are defined as ignored

The classes need to have the signatures of their corresponding methods below. Note that the insert_pattern method can take any options it wants in the options argument, and can it do whatever it chooses based on the value.

Currently to switch matchers completely you need to 1) define a new matcher class, 2) get an empty instance of that class, 3) create an empty TokenTable class with that matcher passed to the initializer as the pattern_matcher_instance argument, 4) pass that token table instance as the token_table` paremeter of a new Lexer instance, and the 5) pass that lexer instance as the lexer argument to the initializer of a new PrattParser class. This could be automated by subclassing PrattParser and doing it in the __init__ method.

pattern_matcher = NewMatcher()
token_table = TokenTable(pattern_matcher=pattern_matcher)
lexer = Lexer(token_table=token_table)
parser = PrattParser(lexer=lexer)

10.10.3. Code

class typped.matcher.TokenPatternTuple(regex_string, compiled_regex, on_ties)

Bases: tuple

compiled_regex

Alias for field number 1

on_ties

Alias for field number 2

regex_string

Alias for field number 0

class typped.matcher.MatchedPrefixTuple(length, on_ties, matched_string, token_label)

Bases: tuple

length

Alias for field number 0

matched_string

Alias for field number 2

on_ties

Alias for field number 1

token_label

Alias for field number 3

class typped.matcher.Matcher[source]

Bases: object

A matcher class that stores pattern data and matches it.

insert_pattern(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]

Insert the pattern in the list of regex patterns.

If ignore is true then the pattern is treated as an ignored pattern.

If matcher_options="python" then the patterns are saved as individual Python regexes. Each item on the list is checked against pattern for prefix matches. This is the default.

If matcher_options="python_fnl" then the patterns are combined into a single regex whenever necessary. This is faster, but gives “first not longest” (FNL) semantics. That is, the first-defined patterns take precedence regardless of length. In this case any on_ties values are used to pre-sort the list instead of breaking equal-length ties. Unlike the python and trie options this method cannot detect multiple token matches without an on_ties value to break the tie. The insertion ordering is implicitly used to break ties.

TODO: Reconsider the on_ties semantics… maybe just for comparing across methods?

If matcher_options="trie" then the pattern is inserted in a RegexTrieDict for matching (and must be in the correct format).

Any of the above options can be set arbitrarily for each insertion. The default insert options for a MatcherPythonRegex instance can be changed by setting the attribute default_insert_options to the desired value. The default can be changed between the above options at any time.

Two combined options are python_but_trie_for_simple and python_but_fnl_for_fixed_length. These use a hybrid approach to limit the number of patterns which need to be sequentially searched while still retaining longest-match behavior. The latter option cannot be used in combination with the ordinary FNL method because it always sorts its regexes by length. Like the ordinary FNL method this method cannot catch all multiple-matches (such as for “cat” and “ca[rt]”) but it catches more. It cannot detect multiple matches for same-length patterns which are matched by the FNL matcher (vs. the Python matcher).

Note that this method does not check for reinsertions of the same token label; the higher-level def_token routine which calls it is responsible for that.

remove_pattern(token_label)[source]

Remove the pattern for the token corresponding to token_label.

get_next_token_label_and_value(program, slice_indices, error_msg_text_snippet_size=20)[source]

Return the best prefix match as a tuple of the token label and the matched string. The slice_indices are an ordered pair of indices into the string program which reference the relevant part.

exception typped.matcher.MatcherException[source]

Bases: typped.shared_settings_and_exceptions.LexerException

typped.matcher.process_string_for_escapes(string_or_char_list, escape_char, open_group=None, close_group=None)[source]

A utility routine which takes a string or character list with possibly-escaped elements as an argument and returns a list of two-tuples.

The first element of a returned two-tuple is the actual character, and the second a boolean for whether or not it is escaped.

If open_group and/or close_group is set to an opening or closing paren or bracket character the function returns a list of three-tuples, where the last element gives the level of parenthesis nesting, starting at zero and increasing. An open and its corresponding close have the same level.

typped.matcher.is_fixed_length(regex)[source]

If the Python regex is fixed-length return its effective length (not its actual length). Otherwise, return false. Note that zero-length matches are not considered fixed-length. Only a subset of fixed-length patterns are recognized, currently those consisting of regular characters and/or character sets.

typped.matcher.convert_simple_python_regex_to_rtd_regex(regex, rtd_escape=None)[source]
typped.matcher.test_fixed_length()[source]
typped.matcher.test_simple_regex_conversion_python_to_rtd()[source]