Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add inline flags #34

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

Yehuda-blip
Copy link

@Yehuda-blip Yehuda-blip commented Apr 7, 2023

These 3 flags are incompatible in regex: [ASCII, UNICODE, LOCALE]. This roughly means that the inline flags [a,u,L] cannot be used together in a same-nested-level flagging. When a when one of these flags is nested deeper than another one, it will override the outer flag for the substring it affects.
There are two problems with this implementation:
Python regex allows this behavior - (?i:) - while this PR does not allow [ignore_case] or [ignore_case ''] (the last one is easy to fix with half a line in compiler.py line:255). However, I don't hate this and think it's probably better behavior.
The bigger problem is the fact that (?u:(?a:somestring)) is allowed in regex, with the ASCII overriding the UNICODE flag (same with all examples of the incompatible flags [ASCII, UNICODE, LOCALE] together).
Here, this is not a valid expression - [unicode [ascii_only 'somestring']] - because nesting depth of an expression is not kept in the parsing flow. Note however, this is kind of generalization of the first problem - the concat operator makes it possible to write this [unicode [ascii_only 'somestring']'anything'], so the expression is invalid only if the outer flag is not operating on anything, but in this case - [unicode [ascii_only 'somestring']] - it seems less obvious to realize what's the problem when translating from regex to kleenexp.
The 3 solutions I can think of for this are:
a. Add a nesting depth value to nodes when parsing - I don't want to do, because I don't understand enough of the parsing and this seems a major change.
b. Add this to a collection of 'problematic behaviors' somewhere.
c. Remove the unicode and locale_dependent flags completely from kleenexp - kind of an overreaction in my opinion.
d. Ignore, which is what I'm doing now and is probably an inferior solution to b.

== r"(?L:test)"
)

InlineFlag.PATTERN_IS_BYTES_LIKE = False
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-commit did not like my formatting of these tests, and made these really line-heavy trees. Originally each call was in it's own line.

@@ -54,7 +56,9 @@
def re(pattern: AnyStr, flavor: Optional[Flavor] = None) -> AnyStr:
# TODO: LRU cache
if _is_bytes_like(pattern):
asm.InlineFlag.PATTERN_IS_BYTES_LIKE = True
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better placed in some global values-per-compilation dictionary somewhere.

# reaches the to_regex() method is the first one in the
# sequence (there may be other sequences). This is necessary
# because the parenthesis wrapping and regex legality are
# dependent on the whole flagging expression.
Copy link
Author

@Yehuda-blip Yehuda-blip Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop all the validations, there is always the option to simply compile [ignore_case multiline 'string'] to
(?i:(?m:string)) instead of (?im:string), which should make the whole thing a lot simpler (and I very much doubt has much effect on the performance of the result, if this is even a concern).

@Yehuda-blip Yehuda-blip marked this pull request as draft May 19, 2023 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants