-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/add inline flags #34
base: master
Are you sure you want to change the base?
Feature/add inline flags #34
Conversation
…implemented, and no testing was done
…inst bytes-like patterns
== r"(?L:test)" | ||
) | ||
|
||
InlineFlag.PATTERN_IS_BYTES_LIKE = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-commit did not like my formatting of these tests, and made these really line-heavy trees. Originally each call was in it's own line.
@@ -54,7 +56,9 @@ | |||
def re(pattern: AnyStr, flavor: Optional[Flavor] = None) -> AnyStr: | |||
# TODO: LRU cache | |||
if _is_bytes_like(pattern): | |||
asm.InlineFlag.PATTERN_IS_BYTES_LIKE = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be better placed in some global values-per-compilation dictionary somewhere.
…implemented, and no testing was done
…inst bytes-like patterns
…eature/add_inline_flags # Conflicts: # ke/asm.py
# reaches the to_regex() method is the first one in the | ||
# sequence (there may be other sequences). This is necessary | ||
# because the parenthesis wrapping and regex legality are | ||
# dependent on the whole flagging expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we drop all the validations, there is always the option to simply compile [ignore_case multiline 'string']
to
(?i:(?m:string))
instead of (?im:string)
, which should make the whole thing a lot simpler (and I very much doubt has much effect on the performance of the result, if this is even a concern).
These 3 flags are incompatible in regex: [ASCII, UNICODE, LOCALE]. This roughly means that the inline flags [a,u,L] cannot be used together in a same-nested-level flagging. When a when one of these flags is nested deeper than another one, it will override the outer flag for the substring it affects.
There are two problems with this implementation:
Python regex allows this behavior -
(?i:)
- while this PR does not allow[ignore_case]
or[ignore_case '']
(the last one is easy to fix with half a line in compiler.py line:255). However, I don't hate this and think it's probably better behavior.The bigger problem is the fact that
(?u:(?a:somestring))
is allowed in regex, with the ASCII overriding the UNICODE flag (same with all examples of the incompatible flags [ASCII, UNICODE, LOCALE] together).Here, this is not a valid expression -
[unicode [ascii_only 'somestring']]
- because nesting depth of an expression is not kept in the parsing flow. Note however, this is kind of generalization of the first problem - the concat operator makes it possible to write this[unicode [ascii_only 'somestring']'anything']
, so the expression is invalid only if the outer flag is not operating on anything, but in this case -[unicode [ascii_only 'somestring']]
- it seems less obvious to realize what's the problem when translating from regex to kleenexp.The 3 solutions I can think of for this are:
a. Add a nesting depth value to nodes when parsing - I don't want to do, because I don't understand enough of the parsing and this seems a major change.
b. Add this to a collection of 'problematic behaviors' somewhere.
c. Remove the unicode and locale_dependent flags completely from kleenexp - kind of an overreaction in my opinion.
d. Ignore, which is what I'm doing now and is probably an inferior solution to b.