Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local spark ppl testing documentation #902

Merged
merged 10 commits into from
Nov 14, 2024

Conversation

YANG-DB
Copy link
Member

@YANG-DB YANG-DB commented Nov 13, 2024

Description

add local spark ppl testing documentation and details

Related Issues

#896

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@YANG-DB YANG-DB added documentation Improvements or additions to documentation Lang:PPL Pipe Processing Language support 0.7 labels Nov 13, 2024
@LantaoJin
Copy link
Member

@YANG-DB A high level question: will we use local spark to do sanity test instead of using OpenSearch Domain in future? It seems not an end to end testing. For example we found this issue #875 in sanity test with Domain env.

@qianheng-aws
Copy link
Contributor

qianheng-aws commented Nov 14, 2024

@YANG-DB What's the motivation to add this doc?

I think we already have the guide about the local spark ppl usage in root README:
https://github.com/opensearch-project/opensearch-spark/blob/main/README.md#ppl-build--run

And the ppl commands testing is somehow duplicate with ppl-commands doc, that place should be the single of truth for each command
https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/README.md

## emails table
```sql
CREATE TABLE emails (name STRING, age INT, email STRING, street_address STRING, year INT, month INT) PARTITIONED BY (year, month);
INSERT INTO testTable (name, age, email, street_address, year, month) VALUES ('Alice', 30, '[email protected]', '123 Main St, Seattle', 2023, 4), ('Bob', 55, '[email protected]', '456 Elm St, Portland', 2023, 5), ('Charlie', 65, '[email protected]', '789 Pine St, San Francisco', 2023, 4), ('David', 19, '[email protected]', '101 Maple St, New York', 2023, 5), ('Eve', 21, '[email protected]', '202 Oak St, Boston', 2023, 4), ('Frank', 76, '[email protected]', '303 Cedar St, Austin', 2023, 5), ('Grace', 41, '[email protected]', '404 Birch St, Chicago', 2023, 4), ('Hank', 32, '[email protected]', '505 Spruce St, Miami', 2023, 5), ('Ivy', 9, '[email protected]', '606 Fir St, Denver', 2023, 4), ('Jack', 12, '[email protected]', '707 Ash St, Seattle', 2023, 5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testTable should be emails

# Testing PPL using local Spark

## Produce the PPL artifact
The first step would be to produce the spark-ppl artifact: `sbt clean sparkPPLCosmetic/publishM2`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This action is dangerous when a user has write credentials and remote repo settings in env. How about change to sbt clean sparkPPLCosmetic/assembly?
It will generate the spark-ppl artifact and print it in the end:

[info] Built: ./opensearch-spark/sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar
[info] Jar hash: 71dd9c

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin I've updated - please review and see if anything else is missing
thanks

@YANG-DB YANG-DB requested a review from LantaoJin November 14, 2024 03:03
@YANG-DB
Copy link
Member Author

YANG-DB commented Nov 14, 2024

@YANG-DB A high level question: will we use local spark to do sanity test instead of using OpenSearch Domain in future? It seems not an end to end testing. For example we found this issue #875 in sanity test with Domain env.

Hi @LantaoJin
the idea behind this is to allow an open-source user to experiment with the PPL language in the development environment itself directly. It serves as a fast way to experiment with spark local cluster before moving it into more complicated use cases.
The ultimate goal is to have a separate testing for open-source environment which is not depended on a specific provider.
it doesnt function as a sanity test but rather as a user experiment tutorial for paying around with the language and understanding its capabilty.

@YANG-DB
Copy link
Member Author

YANG-DB commented Nov 14, 2024

@YANG-DB What's the motivation to add this doc?

I think we already have the guide about the local spark ppl usage in root README: https://github.com/opensearch-project/opensearch-spark/blob/main/README.md#ppl-build--run

And the ppl commands testing is somehow duplicate with ppl-commands doc, that place should be the single of truth for each command https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/README.md

Hi @qianheng-aws - thanks for the feedback
as I mentioned above this simple tutorial is a basic way for explaining how to quickly get started with PPL for a local spark cluster and is extending the README part.
its supposed to be used by developer which are trying to understand whether this spark-opensource-ppl solution fits their need without the need to deploy a more complicated use case into a real spark cluster.

## Start Spark with the plugin
Once installed, run spark with the generated PPL artifact:
```shell
bin/spark-sql --jars "/PATH_TO_ARTIFACT/oopensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double o typo here

Signed-off-by: YANGDB <[email protected]>
@YANG-DB YANG-DB merged commit bf60e59 into opensearch-project:main Nov 14, 2024
4 checks passed
kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
* add local spark ppl testing documentation and details

Signed-off-by: YANGDB <[email protected]>

* update more sample test tables and commands

Signed-off-by: YANGDB <[email protected]>

* update more sample test tables and commands

Signed-off-by: YANGDB <[email protected]>

* update more sample test tables and commands

Signed-off-by: YANGDB <[email protected]>

* update for using opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar

Signed-off-by: YANGDB <[email protected]>

* update tutorial documentation on using a local spark-cluster with ppl queries

Signed-off-by: YANGDB <[email protected]>

* typo fix

Signed-off-by: YANGDB <[email protected]>

---------

Signed-off-by: YANGDB <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.7 documentation Improvements or additions to documentation Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants