10.10. typped.matcher¶
A Matcher
object for a TokenTable
, which the Lexer
uses for looking up
tokens.
10.10.1. Using the matcher¶
The default matcher, simply called "python"
uses individual Python regexes
for the tokens, stored in a list. This is to implement the “longest not first”
matching semantics. There are several alternative matchers which can be
specified, or users can define their own as described in a later section.
To select a different matcher to save a pattern in you can set the
matcher_options
keyword argument in the call to def_token
. That value is
passed on to the matcher’s insert_pattern
method. With the default matchers
different save options can be chosen for different tokens. Each matcher with
defined entries will be checked when looking for matches. The default matcher
to use can be changed by setting the default_insert_options
attribute of the
matcher instance.
One alternative matcher uses Python regexes but with “first not longest”
semantics. This makes it more efficient on lookup, but causes order-dependence
in the definition of tokens. (Additions and deletions are more expensive
because it builds one large regex.) To select it use matcher_options="python_fnl"
.
There are also hybrid matchers, but those are still experimental. They are an
attempt to implement more-efficient longest-not-first matching. They are
hybrid matchers that store simple patterns in either a trie or a "python_fnl"
matcher, reverting to the default “python” matcher for more-complex patterns.
There is also an experimental matcher that uses only the
RegexTrieDictScanner
. The regex language it accepts is different from the
Python regex language, though. (The regex-trie-dict package is not currently
up on GitHub.)
The Python matcher has good insert and delete times, but can become inefficient for large numbers of patterns. The trie is less efficient for small numbers of patterns, but can search many simple patterns quickly in parallel by just going down the trie. Simple patterns are literal matches with no special characters or patterns containing only character-range wildcards. More complex patterns in the trie matcher are still experimental.
10.10.2. Using a custom matcher¶
It is possible for users to define their own matchers. For example, the current matcher always takes the longest match. More-efficient matchers can be implemented which instead take the prefix which matches the first-defined pattern (so more-general patterns are usually defined first). The Typped package tries to avoid this for semantic clarity and to avoid order-dependence, but there is no technical reason it cannot be done.
A matcher class needs to have the following methods:
insert_pattern
– insert a labeled regex pattern and priority into the matcherremove_pattern
– remove an inserted patternget_next_token_label_and_value
– return the label and value of the next match
And the following attributes:
ignore_tokens
– the set of token labels which are defined as ignored
The classes need to have the signatures of their corresponding methods below.
Note that the insert_pattern
method can take any options it wants in the
options
argument, and can it do whatever it chooses based on the value.
Currently to switch matchers completely you need to 1) define a new matcher
class, 2) get an empty instance of that class, 3) create an empty TokenTable
class with that matcher passed to the initializer as the
pattern_matcher_instance
argument, 4) pass that token table instance as the
token_table`
paremeter of a new Lexer
instance, and the 5) pass that lexer
instance as the lexer
argument to the initializer of a new PrattParser
class. This could be automated by subclassing PrattParser
and doing it in
the __init__
method.
pattern_matcher = NewMatcher()
token_table = TokenTable(pattern_matcher=pattern_matcher)
lexer = Lexer(token_table=token_table)
parser = PrattParser(lexer=lexer)
10.10.3. Code¶
-
class
typped.matcher.
TokenPatternTuple
(regex_string, compiled_regex, on_ties)¶ Bases:
tuple
-
compiled_regex
¶ Alias for field number 1
-
on_ties
¶ Alias for field number 2
-
regex_string
¶ Alias for field number 0
-
-
class
typped.matcher.
MatchedPrefixTuple
(length, on_ties, matched_string, token_label)¶ Bases:
tuple
-
length
¶ Alias for field number 0
-
matched_string
¶ Alias for field number 2
-
on_ties
¶ Alias for field number 1
-
token_label
¶ Alias for field number 3
-
-
class
typped.matcher.
Matcher
[source]¶ Bases:
object
A matcher class that stores pattern data and matches it.
-
insert_pattern
(token_label, regex_string, on_ties=0, ignore=False, matcher_options=None)[source]¶ Insert the pattern in the list of regex patterns.
If
ignore
is true then the pattern is treated as an ignored pattern.If
matcher_options="python"
then the patterns are saved as individual Python regexes. Each item on the list is checked against pattern for prefix matches. This is the default.If
matcher_options="python_fnl"
then the patterns are combined into a single regex whenever necessary. This is faster, but gives “first not longest” (FNL) semantics. That is, the first-defined patterns take precedence regardless of length. In this case anyon_ties
values are used to pre-sort the list instead of breaking equal-length ties. Unlike thepython
andtrie
options this method cannot detect multiple token matches without anon_ties
value to break the tie. The insertion ordering is implicitly used to break ties.TODO: Reconsider the
on_ties
semantics… maybe just for comparing across methods?If
matcher_options="trie"
then the pattern is inserted in aRegexTrieDict
for matching (and must be in the correct format).Any of the above options can be set arbitrarily for each insertion. The default insert options for a
MatcherPythonRegex
instance can be changed by setting the attributedefault_insert_options
to the desired value. The default can be changed between the above options at any time.Two combined options are
python_but_trie_for_simple
andpython_but_fnl_for_fixed_length
. These use a hybrid approach to limit the number of patterns which need to be sequentially searched while still retaining longest-match behavior. The latter option cannot be used in combination with the ordinary FNL method because it always sorts its regexes by length. Like the ordinary FNL method this method cannot catch all multiple-matches (such as for “cat” and “ca[rt]”) but it catches more. It cannot detect multiple matches for same-length patterns which are matched by the FNL matcher (vs. the Python matcher).Note that this method does not check for reinsertions of the same token label; the higher-level
def_token
routine which calls it is responsible for that.
-
-
typped.matcher.
process_string_for_escapes
(string_or_char_list, escape_char, open_group=None, close_group=None)[source]¶ A utility routine which takes a string or character list with possibly-escaped elements as an argument and returns a list of two-tuples.
The first element of a returned two-tuple is the actual character, and the second a boolean for whether or not it is escaped.
If
open_group
and/orclose_group
is set to an opening or closing paren or bracket character the function returns a list of three-tuples, where the last element gives the level of parenthesis nesting, starting at zero and increasing. An open and its corresponding close have the same level.
-
typped.matcher.
is_fixed_length
(regex)[source]¶ If the Python regex is fixed-length return its effective length (not its actual length). Otherwise, return false. Note that zero-length matches are not considered fixed-length. Only a subset of fixed-length patterns are recognized, currently those consisting of regular characters and/or character sets.