Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL command implementation for appendCol #990

Merged
merged 64 commits into from
Jan 8, 2025

Conversation

andy-k-improving
Copy link
Contributor

@andy-k-improving andy-k-improving commented Dec 11, 2024

Description

Introduce the new PPL command appendCol which aim to aggregate result from multiple searches into a single comprehensive table for user to view.

This is accomplished by reading both main-search and sub-search in the form of node then transform it into the following of SQL with by adding _row_number_ for the dataset's natural order, then join both main and sub search with the _row_number_ column.

select t1.*, t2.* 

FROM (
     SELECT *, row_number() over (order by '1') as row_org 
     FROM employees) as t1

 FULL OUTER JOIN (
     SELECT *, row_number() over (order by '1') as row_app 
     FROM employees) as t2 

ON t1.row_org = t2.row_app;

Related Issues

Resolves: #956

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Test plan:

# Produce the artifact
sbt clean sparkPPLCosmetic/publishM2

# Start Spark with the plugin
bin/spark-sql --jars "/ABSOLUTE_PATH_TO_ARTIFACT/opensearch-spark-ppl_2.12-0.6.0-SNAPSHOT.jar" \
--conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions"  \
--conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog" \
--conf "spark.hadoop.hive.cli.print.header=true"

# Insert test table and data
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT, con STRING);

INSERT INTO employees VALUES ("Lisa", "Sales", 10000, 35, 'test');
INSERT INTO employees VALUES ("Evan", "Sales", 32000, 38, 'test');
INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28, 'test');
INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33, 'test');
INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33, 'test');
INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28, 'test');
INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38, 'test');
INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23, 'test');
INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25, 'test');

# Append one sub-search:

source=employees | FIELDS name, dept, salary | APPENDCOL  [ stats count() as event_count];

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL


# Append multiple sub-searches:

source=employees | FIELDS name, dept, salary | APPENDCOL  [ stats count() as event_count] | APPENDCOL [stats avg(age) as avg_age];

name	dept	salary	event_count	avg_age
Lisa	Sales------	10000	9	31.22222222222222
Fred	Engineering	21000	NULL	NULL
Paul	Engineering	29000	NULL	NULL
Evan	Sales------	32000	NULL	NULL
Chloe	Engineering	23000	NULL	NULL
Tom	Engineering	23000	NULL	NULL
Alex	Sales	30000	NULL	NULL
Jane	Marketing	29000	NULL	NULL
Jeff	Marketing	35000	NULL	NULL



# With override option (`salary` column from the main-search is being dropped and replaced by the `salary` column over the sub-search)

source=employees | FIELDS name, dept, salary | APPENDCOL OVERRIDE=true [stats avg(salary) as salary];

name	dept	salary
Lisa	Sales------	25777.777777777777
Fred	Engineering	NULL
Paul	Engineering	NULL
Evan	Sales------	NULL
Chloe	Engineering	NULL
Tom	Engineering	NULL
Alex	Sales	NULL
Jane	Marketing	NULL
Jeff	Marketing	NULL


@andy-k-improving andy-k-improving marked this pull request as draft December 11, 2024 02:26
@andy-k-improving andy-k-improving changed the title DRAFT: PPL command appendCol implementaion PPL command implementation for appendCol Dec 16, 2024
@andy-k-improving andy-k-improving marked this pull request as ready for review December 16, 2024 22:45
@LantaoJin
Copy link
Member

LantaoJin commented Dec 18, 2024

Two high level questions:

  1. appendCol command syntax is
    APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

subSearch
: searchCommand (PIPE commands)*
;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.

  1. why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
    is
name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL 

instead of

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9

PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?

@andy-k-improving
Copy link
Contributor Author

Two high level questions:

  1. appendCol command syntax is
    APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

subSearch
: searchCommand (PIPE commands)*
;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.
2. why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
is

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL 

instead of

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9

PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?

Yep, the sub-search under appendcol is restricted to use the same dataSource as the main PPL command, for the expected result, we can probably discuss it on the other thread, in order to centralise the convo.

@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.7 labels Dec 19, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Dec 31, 2024

Two high level questions:

  1. appendCol command syntax is
    APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

subSearch
: searchCommand (PIPE commands)*
;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.
2. why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
is

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL 

instead of

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9

PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?

Yep, the sub-search under appendcol is restricted to use the same dataSource as the main PPL command, for the expected result, we can probably discuss it on the other thread, in order to centralise the convo.

@andy-k-improving I would go with the same behaviour to match with Splunk ...

@YANG-DB
Copy link
Member

YANG-DB commented Jan 2, 2025

@andy-k-improving LGTM - please validate @LantaoJin question and respond here
thanks

andy-k-improving and others added 13 commits January 2, 2025 12:02
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Co-authored-by: Taylor Curran <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Co-authored-by: Taylor Curran <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Co-authored-by: Taylor Curran <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andy-k-improving I would like to simplify the method CatalystQueryPlanVisitor.java.visitAppendCol(..) so its simpler / smaller and move its complexity away into AppendColCatalystUtils

does it make sense ?

@andy-k-improving
Copy link
Contributor Author

@andy-k-improving I would like to simplify the method visitAppendCol(..) so its simpler / smaller and more the complexity away into AppendColCatalystUtils

does it make sense ?

On high level, that make sense, however how you foresee the structure will be like on visitAppendCol( ) after the refactor?

Do you foresee t he logical plan composition flow happen within visitAppendCol( ) or it will be moved entirely to AppendColCatalystUtils.

Ex:

visitAppendCol( ) {

      # Composite main query
      
      # Extract relation 
      
      # Composite sub-query
      
      # Some more operations.....
      
      # Return the result 
 }


v.s 


visitAppendCol( ) {
      
      # One util call to pass all argument into `AppendColCatalystUtils` for processing, expecting a logical plan with main and sub as return.
      
      # Some minor logic, which couldn't be done Util.
      
      # Return the result.

}

Copy link
Member

@LantaoJin LantaoJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate your code revisions and clear explanations, which is improving the code quality. LGTM.

@YANG-DB YANG-DB merged commit 1cb538f into opensearch-project:main Jan 8, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.7 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PPL-Lang]add appendcol ppl command
4 participants