Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RegEx support using RE2 (rebased #665) #1039

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Duologic
Copy link
Contributor

@Duologic Duologic commented Nov 29, 2022

Rebased #665:

Introduces 5 new built-in methods to the stdlib:

  • regexFullMatch(pattern, str) -- Full match regex
  • regexPartialMatch(pattern, str) -- Partial match regex
  • regexQuoteMeta(str) -- Escape regex metachararacters
  • regexReplace(str, pattern, to) -- Replace single occurance using regex
  • regexGlobalReplace(str, pattern, to) -- Replace globally using regex

Since both regexFullMatch and regexPartialMatch can perform captures these functions return a "match" object upon match or null otherwise. For example:

$ ./jsonnet -e 'std.regexFullMatch("h(?P<mid>.*)o", "hello")'
{
   "captures": [
      "ell"
   ],
   "namedCaptures": {
      "mid": "ell"
   },
   "string": "hello"
}

Introduces a dependency on RE2 2019-06-01.
Builds tested using make, CMake and Bazel on Ubuntu 18.04.

I ran make and make test on Linux, not sure how to trigger TravisCI.

Introduces 5 new built-in methods to the stdlib:
- `regexFullMatch(pattern, str)` -- Full match regex
- `regexPartialMatch(pattern, str)` -- Partial match regex
- `regexQuoteMeta(str)` -- Escape regex metachararacters
- `regexReplace(str, pattern, to)` -- Replace single occurance using regex
- `regexGlobalReplace(str, pattern, to)` -- Replace globally using regex

Since both `regexFullMatch` and `regexPartialMatch` can perform captures
these functions return a "match" object upon match or `null` otherwise.
For example:

```
$ ./jsonnet -e 'std.regexFullMatch("h(?P<mid>.*)o", "hello")'
{
   "captures": [
      "ell"
   ],
   "namedCaptures": {
      "mid": "ell"
   },
   "string": "hello"
}
```

Introduces a dependency on RE2 2019-06-01.
Builds tested using make, CMake and Bazel on Ubuntu 18.04.
@Duologic Duologic marked this pull request as ready for review November 29, 2022 19:30
@Duologic
Copy link
Contributor Author

Duologic commented Dec 7, 2022

Anything else I should address? @sparkprime @sbarzowski

@@ -32,8 +32,7 @@ CXXFLAGS += -Iinclude -Ithird_party/md5 -Ithird_party/json -Ithird_party/rapidya
CFLAGS ?= -g $(OPT) -Wall -Wextra -pedantic -std=c99 -fPIC
CFLAGS += -Iinclude
MAKEDEPENDFLAGS += -Iinclude -Ithird_party/md5 -Ithird_party/json -Ithird_party/rapidyaml/rapidyaml/src/ -Ithird_party/rapidyaml/rapidyaml/ext/c4core/src/
LDFLAGS ?=

LDFLAGS ?= -lre2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most controversial part. Introducing a dependency here may break existing users. The fix is easy for them, so maybe not a very big deal. Going from 0 to some dependencies might still be problematic in some contexts, but it's probably worth it for regexps.

We definitely need to add a README section for dependencies, though. It would also be good to check that the Python package works as expected and maybe mention the dependencies in the description.

@sparkprime What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into that when I do the release. I presume it will break existing windows build processes but whoever uses that ought to be able to figure it out.

}
) &&

std.assertEqual(std.regexQuoteMeta(@'1.5-2.0?'), '1\\.5\\-2\\.0\\?') &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This quotes a string to be included in regexp verbatim, right?

I'm not sure if we want to offer this. Is there a very compelling use case? I would prefer to err on the side of having smaller API (easier to add than to remove).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are already locking into RE2, reducing the API surface isn't going to make it any easier to migrate to a different regex provider. In terms of implementation, it's a small wrapper around the underlying call.

In terms of usefulness, if you want to search for a computed something inside something else I suppose it would be invaluable.

I'm ok with it.

@sbarzowski
Copy link
Collaborator

Sorry it took so long to reply. Looks mostly good. Left some minor comments.

@sparkprime
Copy link
Contributor

After I've done the current release (hopefully today) let's move forward on regex support in both impls for the next release.

@sbarzowski
Copy link
Collaborator

After I've done the current release (hopefully today) let's move forward on regex support in both impls for the next release.

Sounds good!

Copy link
Contributor

@CertainLach CertainLach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be an ability to process all matches, not only the first one, by returning matched indexes, or something like that?

std.assertEqual(
std.regexPartialMatch(@'e(.*)o', 'hello'),
{
string: 'hello',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be the 0th capture group, "ello"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible to do this with the underlying API. In fact I'm not sure why we repeat the given string back to the user at all. It is always just the string provided, which the caller already has.

If you want the behaviour of a "0th capture group" I suppose you can do @'(e(.*)o)' so it's not a loss of expressive power.


int num_groups = re.NumberOfCapturingGroups();

std::vector<std::string> rcaptures(num_groups);
Copy link
Contributor

@CertainLach CertainLach Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of optional group, h(.+)?o, should it be an empty string, or null?
std.regexFullMatch('h(.+)?o', 'ho').captures == ['']
std.regexFullMatch('h(.+)?o', 'ho').captures == [null]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the underlying library returns an empty string then it's clearly "OK" to do that. Would we choose to differ? I'm not sure there's much point, because you have to do != null instead of != '' right? So it doesn't help the programmer to use null instead. On the other hand, it means there is a difference between the underlying library and the Jsonnet manifestation of it. That difference would have to be documented by us and remembered by everyone else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also make the argument that the only reason the RE2 C++ API returns the empty string is because it is much harder to return a sentinel in that language. I wonder what the Go library does? @rohitjangid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, empty match is not the same as the optional group is not matched.
Howewer, looks like both C++ re2 and golang regex implementations don't care, and leave default type value for such matches, unlike Rust (https://docs.rs/regex/latest/regex/struct.Captures.html#method.get) and js regexps.

CertainLach added a commit to CertainLach/jrsonnet that referenced this pull request Jun 15, 2023
CertainLach added a commit to CertainLach/jrsonnet that referenced this pull request Dec 10, 2023
stephenamar-db added a commit to databricks/sjsonnet that referenced this pull request Dec 31, 2024
With this PR, I'm adding a handful of methods to expose regular
expressions in jsonnet, through std.native()
I'm modeling them after [this open PR from
jsonnet](google/jsonnet#1039), which was ported
to jrsonnet.

For now, they are in std.native() as they are not part of the default
std package and use RE2 instead of the native regexp package (for
performance and compatibility reasons with a future go-jsonnet
implementation).

- regexFullMatch(pattern, str) -- Full match regex
- regexPartialMatch(pattern, str) -- Partial match regex
- regexReplace(str, pattern, to) -- Replace single occurance using regex
- regexGlobalReplace(str, pattern, to) -- Replace globally using regex

and the utility function:
- regexQuoteMeta(str) -- Escape regex metachararacters


Those functions return a object:
```
std.native("regexFullMatch")("h(?P<mid>.*)o", "hello")

{
   "captures": [
      "ell"
   ],
   "string": "hello"
}
```

This PR does not add support for the "namedCaptures" return field due to
some complications with scalajs and scalanative. Those language both use
the JDK Pattern class (js being powered by ECMA regex and Native being
powered by RE2(!)), but JDK<20 Pattern class does not have a
straightforward way to list the names of groups without some additional
hacks. This will be dealt with in a follow up PR.

This PR also adds the ability to cache patterns, and refactors all users
of regexes to use it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants