Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] PPL Sort with num, str, or ip #3180

Open
currantw opened this issue Dec 2, 2024 · 12 comments
Open

[FEATURE] PPL Sort with num, str, or ip #3180

currantw opened this issue Dec 2, 2024 · 12 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@currantw
Copy link
Contributor

currantw commented Dec 2, 2024

Is your feature request related to a problem?

PPL has unimplemented sorting syntax:

  • sort auto(field_name)
  • sort num(field_name)
  • sort str(field_name)
  • sort ip(field_name)

While this syntax is currently valid, it has no effect on the resulting sort order - sorting is always performed according to the field's data type.

Based on the discussion below, it seems like this would be a useful feature that would both easier to use, and consistent with sorting in Splunk.

What solution would you like?

  • Implement sorting with num, str, and ip keywords.
  • Remove auto keyword from sorting syntax (consensus seems to be that this isn't particularly useful).

What alternatives have you considered?

Variety of alternatives, including deprecating this syntax completely, were discussed below.

Do you have any additional context?

Related to #3145 (Add IP data type) and opensearch-project/opensearch-spark#963 (same issue for Spark). See also comment below.

@Swiddis
Copy link
Collaborator

Swiddis commented Dec 17, 2024

Thanks for the bug report!

Where exactly do we explain this syntax in the docs? I see from the Spark PR that there's some documentation examples without explanations, I guess those are vestigial examples that imply it's a leftover grammar quirk.

If it's a grammar quirk, I'm mostly concerned with the possibility of a breaking change when people previously used these queries fine (and perhaps expected them to work) after copying the examples. I think the "safe" route is to mark these functions as deprecated no-ops in the documentation/as part of the response for such queries, clean out other doc references to them, and then only remove them from the grammar next major release.

@Swiddis
Copy link
Collaborator

Swiddis commented Dec 17, 2024

In line with the above, I think this is more a documentation issue than a bug, since there's a clear workaround for the faulty behavior and the old sort operations haven't been supported for a very long time (or perhaps ever?). I'm not sure there's significant harm in letting the no-ops exist in peace.

@Swiddis Swiddis added documentation Improvements or additions to documentation and removed bug Something isn't working labels Dec 17, 2024
@anasalkouz
Copy link
Member

I agree with @Swiddis. I think we can remove it from our documentation safely. Removing them from the grammar might cause a breaking change for customers, we can remove those changes as part of our next major release.

@currantw
Copy link
Contributor Author

currantw commented Dec 17, 2024

For clarity, here is a test case that illustrates the bug:

Data

Dataset: addresses

street_number (string) street_name (string)
725 Main Street
1127 Arbutus Street
43 Victoria Drive

Query

search source=addresses | sort num(street_number) | fields street_number, street_name

Expected Output

street_number street_name
43 Victoria Drive
725 Main Street
1127 Arbutus Street

Actual Output

street_number street_name
1127 Arbutus Street
43 Victoria Drive
725 Main Street

@currantw
Copy link
Contributor Author

currantw commented Dec 17, 2024

I agree with @Swiddis. I think we can remove it from our documentation safely. Removing them from the grammar might cause a breaking change for customers, we can remove those changes as part of our next major release.

Thanks @anasalkouz and @Swiddis. Agreed, makes more sense to avoid introducing a breaking change.

The existing syntax seems to be modelled on Spunk (Splunk sort documentation). If it is desirable to keep this syntax, for ease of use for those familiar with Splunk, we could alternately fix it so that it does work. This would be relatively easily to implement, since these sort field keywords could simply work as a "shorthand" for casting (see below).

With Sort Field Keyword With Cast Function
sort auto(field_name) sort field_name
sort num(field_name) sort cast(field_name as double)
sort str(field_name) sort cast(field_name as string)
sort ip(field_name) sort cast(field_name as ip)

Please let me know what you think!

@currantw
Copy link
Contributor Author

currantw commented Dec 18, 2024

Thanks for the bug report!

Where exactly do we explain this syntax in the docs? I see from the Spark PR that there's some documentation examples without explanations, I guess those are vestigial examples that imply it's a leftover grammar quirk.

If it's a grammar quirk, I'm mostly concerned with the possibility of a breaking change when people previously used these queries fine (and perhaps expected them to work) after copying the examples. I think the "safe" route is to mark these functions as deprecated no-ops in the documentation/as part of the response for such queries, clean out other doc references to them, and then only remove them from the grammar next major release.

@Swiddis I don't that this syntax is actually referenced anywhere in the current documentation for the SQL project, which makes it less likely that customers may depend on it/have unwittingly copied it. But I don't know if it may have been part of the documentation in the past, so it's probably best to, as you suggest, wait until the next major release to remove it (if we actually do want to remove it - see my comment above).

@currantw currantw changed the title [BUG] Outdated PPL Sorting Syntax [BUG] Unimplemented PPL Sort Syntax Dec 19, 2024
@Swiddis
Copy link
Collaborator

Swiddis commented Dec 19, 2024

I like the idea of implementing it as a shorthand, type(var) == cast(var as type) is exactly how I had read the syntax in isolation anyways (as opposed to any extra semantics that may exist with "sort type"). I'll mark the issue as an enhancement, not sure what kind of timeline we're looking at to prioritize it.

I'm willing to mark this as help-wanted from external contributors since it shouldn't be very complex, if we add some links to the relevant code pieces we can also consider marking this as a good first issue.

@Swiddis Swiddis added enhancement New feature or request help wanted Extra attention is needed and removed documentation Improvements or additions to documentation labels Dec 19, 2024
@currantw
Copy link
Contributor Author

I like the idea of implementing it as a shorthand, type(var) == cast(var as type) is exactly how I had read the syntax in isolation anyways (as opposed to any extra semantics that may exist with "sort type"). I'll mark the issue as an enhancement, not sure what kind of timeline we're looking at to prioritize it.

I'm willing to mark this as help-wanted from external contributors since it shouldn't be very complex, if we add some links to the relevant code pieces we can also consider marking this as a good first issue.

@Swiddis Makes sense to me.

Some additional thoughts/considerations:

  • Would it make sense to generalize this to all data types (i.e. add a "shorthand" method for each supported data type)? This could obviously be completed as part of a separate follow-up issue.
  • This will require updating the PPL syntax. Currently, the syntax is quite restrictive: it only allows sorting by qualified field name - for example, the expression sort cast(field as integer) actually isn't valid. Rather than updating this PPL syntax to only support the "shorthand" sort syntax discussed, would it be better to allow sorting on any expression?
  • Should auto be removed completely? It doesn't do anything (it doesn't alter the behaviour, since sorting by the field's data type is the default), and potentially is more confusing if all the other shorthands correspond to data types - or does the parallel with Splunk make it worth keeping anyway?
  • What about num? We don't have a general "number" data type, only more specific numerical types (byte, double, float, long, short, and integer). Should we add shorthand for each of these types (e.g. byte(), double(), float(), long(), short(), and int())? Should we remove num (since it might lead to unexpected results), or should we keep it (for consistency with Splunk) and just cast to double "under the hood"?

Based on your answers to the above, perhaps it would make sense to tackle this in the following steps:

  1. Update syntax to support sorting by any expression (e.g. sort cast(field as integer)).
  2. Add "shorthand" data casting methods str, ip, byte, double, float, long, short, and int (e.g. sort int(field)). Remove support for auto and num sort syntax.
  3. Add "shorthand" methods for remaining data types: bool, date, time, timestamp. (And perhaps json as well. See [FEATURE] Add json function to cast string to json object #3209). (e.g. sort timestamp(field)).

Let me know what you think!

@Swiddis
Copy link
Collaborator

Swiddis commented Dec 23, 2024

I think num is probably fine since in principle sorting by all of the numeric types listed should be the same logic: everything listed would have equivalent semantics, except perhaps the Weird floating-point values. Maybe steal the OrderedFloat semantics? We can consider breaking it up for consistency but I don't see it as an issue.

I'm in favor of the other language changes you proposed (extending the PPL sort syntax, more sort types, and removing auto) and the plan you provided. If the system for sorting is sufficiently general, this may all be doable in one go, I'm not sure how that system looks though. Might be worth getting input from someone who knows this part of the system better.

One thing I have run into is that using complex sort logic in OpenSearch queries can kill performance, from experience even something as simple as going from sorting by one field -> sorting by two fields (primary and secondary) will slow down queries by ~6x. For large enough indices this will cause query failures, so an implementation of arbitrary sorting expressions should be robust against resource exhaustion.

@currantw currantw changed the title [BUG] Unimplemented PPL Sort Syntax [FEATURE] PPL Sort with num, str, or ip Jan 3, 2025
@currantw
Copy link
Contributor Author

currantw commented Jan 3, 2025

Thanks @Swiddis for the feedback. I've updated the issue above based on our discussion, and switched it from a bug to feature type (since it will be implementing new functionality).

You raised some points around performance that I hadn't considered, and which certainly might be a concern. As a result, I decided to keep this issue as simple as possible (i.e. to get sorting with num, str, and ip to work). I reckon that further issues can be raised to implement some of the other ideas (in particular, "shorthand" method for casting) if it's valuable.

Let me know if you have any further thoughts!

@YANG-DB
Copy link
Member

YANG-DB commented Jan 4, 2025

thanks @currantw
have you started working on a PR for this issue ?

@currantw
Copy link
Contributor Author

currantw commented Jan 6, 2025

have you started working on a PR for this issue ?

No, I have not, and wasn't planning to. Just wanted to get the issue to a decent state. Please let me know if/when you would like to start work on this, or if you're happy just keeping it in the backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants