-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RegEx support using RE2 (rebased #665) #1039
base: master
Are you sure you want to change the base?
Conversation
Introduces 5 new built-in methods to the stdlib: - `regexFullMatch(pattern, str)` -- Full match regex - `regexPartialMatch(pattern, str)` -- Partial match regex - `regexQuoteMeta(str)` -- Escape regex metachararacters - `regexReplace(str, pattern, to)` -- Replace single occurance using regex - `regexGlobalReplace(str, pattern, to)` -- Replace globally using regex Since both `regexFullMatch` and `regexPartialMatch` can perform captures these functions return a "match" object upon match or `null` otherwise. For example: ``` $ ./jsonnet -e 'std.regexFullMatch("h(?P<mid>.*)o", "hello")' { "captures": [ "ell" ], "namedCaptures": { "mid": "ell" }, "string": "hello" } ``` Introduces a dependency on RE2 2019-06-01. Builds tested using make, CMake and Bazel on Ubuntu 18.04.
Anything else I should address? @sparkprime @sbarzowski |
@@ -32,8 +32,7 @@ CXXFLAGS += -Iinclude -Ithird_party/md5 -Ithird_party/json -Ithird_party/rapidya | |||
CFLAGS ?= -g $(OPT) -Wall -Wextra -pedantic -std=c99 -fPIC | |||
CFLAGS += -Iinclude | |||
MAKEDEPENDFLAGS += -Iinclude -Ithird_party/md5 -Ithird_party/json -Ithird_party/rapidyaml/rapidyaml/src/ -Ithird_party/rapidyaml/rapidyaml/ext/c4core/src/ | |||
LDFLAGS ?= | |||
|
|||
LDFLAGS ?= -lre2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most controversial part. Introducing a dependency here may break existing users. The fix is easy for them, so maybe not a very big deal. Going from 0 to some dependencies might still be problematic in some contexts, but it's probably worth it for regexps.
We definitely need to add a README section for dependencies, though. It would also be good to check that the Python package works as expected and maybe mention the dependencies in the description.
@sparkprime What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into that when I do the release. I presume it will break existing windows build processes but whoever uses that ought to be able to figure it out.
} | ||
) && | ||
|
||
std.assertEqual(std.regexQuoteMeta(@'1.5-2.0?'), '1\\.5\\-2\\.0\\?') && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This quotes a string to be included in regexp verbatim, right?
I'm not sure if we want to offer this. Is there a very compelling use case? I would prefer to err on the side of having smaller API (easier to add than to remove).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are already locking into RE2, reducing the API surface isn't going to make it any easier to migrate to a different regex provider. In terms of implementation, it's a small wrapper around the underlying call.
In terms of usefulness, if you want to search for a computed something inside something else I suppose it would be invaluable.
I'm ok with it.
Sorry it took so long to reply. Looks mostly good. Left some minor comments. |
After I've done the current release (hopefully today) let's move forward on regex support in both impls for the next release. |
Sounds good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be an ability to process all matches, not only the first one, by returning matched indexes, or something like that?
std.assertEqual( | ||
std.regexPartialMatch(@'e(.*)o', 'hello'), | ||
{ | ||
string: 'hello', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be the 0th capture group, "ello"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to do this with the underlying API. In fact I'm not sure why we repeat the given string back to the user at all. It is always just the string provided, which the caller already has.
If you want the behaviour of a "0th capture group" I suppose you can do @'(e(.*)o)' so it's not a loss of expressive power.
|
||
int num_groups = re.NumberOfCapturingGroups(); | ||
|
||
std::vector<std::string> rcaptures(num_groups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of optional group, h(.+)?o
, should it be an empty string, or null?
std.regexFullMatch('h(.+)?o', 'ho').captures == ['']
std.regexFullMatch('h(.+)?o', 'ho').captures == [null]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the underlying library returns an empty string then it's clearly "OK" to do that. Would we choose to differ? I'm not sure there's much point, because you have to do != null
instead of != ''
right? So it doesn't help the programmer to use null
instead. On the other hand, it means there is a difference between the underlying library and the Jsonnet manifestation of it. That difference would have to be documented by us and remembered by everyone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also make the argument that the only reason the RE2 C++ API returns the empty string is because it is much harder to return a sentinel in that language. I wonder what the Go library does? @rohitjangid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, empty match is not the same as the optional group is not matched.
Howewer, looks like both C++ re2 and golang regex implementations don't care, and leave default type value for such matches, unlike Rust (https://docs.rs/regex/latest/regex/struct.Captures.html#method.get) and js regexps.
Upstream issue: google/jsonnet#1039
Upstream issue: google/jsonnet#1039
With this PR, I'm adding a handful of methods to expose regular expressions in jsonnet, through std.native() I'm modeling them after [this open PR from jsonnet](google/jsonnet#1039), which was ported to jrsonnet. For now, they are in std.native() as they are not part of the default std package and use RE2 instead of the native regexp package (for performance and compatibility reasons with a future go-jsonnet implementation). - regexFullMatch(pattern, str) -- Full match regex - regexPartialMatch(pattern, str) -- Partial match regex - regexReplace(str, pattern, to) -- Replace single occurance using regex - regexGlobalReplace(str, pattern, to) -- Replace globally using regex and the utility function: - regexQuoteMeta(str) -- Escape regex metachararacters Those functions return a object: ``` std.native("regexFullMatch")("h(?P<mid>.*)o", "hello") { "captures": [ "ell" ], "string": "hello" } ``` This PR does not add support for the "namedCaptures" return field due to some complications with scalajs and scalanative. Those language both use the JDK Pattern class (js being powered by ECMA regex and Native being powered by RE2(!)), but JDK<20 Pattern class does not have a straightforward way to list the names of groups without some additional hacks. This will be dealt with in a follow up PR. This PR also adds the ability to cache patterns, and refactors all users of regexes to use it.
Rebased #665:
I ran
make
andmake test
on Linux, not sure how to trigger TravisCI.