Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Docker integ test with async API #1003

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

normanj-bitquill
Copy link
Contributor

Description

Update the integration test docker stack to support the OpenSearch Async API and using Minio as an S3 storage engine. Also includes having everything configured on startup.

Related Issues

#992

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@normanj-bitquill
Copy link
Contributor Author

This is not complete yet. Still need to:

  1. Create a replacement for EMRServerlessClient that will use the local docker environment. This will include starting a docker container of Spark to run the query. Given that the idea is to start the container with spark-submit, it should be possible to easily change to the Spark EMR image instead.
  2. Alter the Python integration tests to use the OpenSearch Async API instead
  3. Remove the spark and spark-worker containers since Spark would be run on demand

@YANG-DB
Copy link
Member

YANG-DB commented Dec 31, 2024

@normanj-bitquill I've also started working on some similar PR with Iceberg based docker-compose -
lets see if we can share knowledge ...

@normanj-bitquill
Copy link
Contributor Author

Current Status

In my testing, I have added a second OpenSearch container. This resolves issues with the cluster and indices going into yellow state.

Changes in OpenSearch node:

  • Bind the docker socket
  • Replace the aws-java-sdk-emrserverless-*.jar file with on that uses a different EMRClient. The new EMRClient writes a file to a directory (/tmp/docker) with the arguments for running docker.
  • A script is run on the container that reads files from /tmp/docker and runs docker with the arguments.
  • Disable system indices and system indices permissions
  • Connect with second OpenSearch container

What works:

  • Initiate query with aysnc-api on OpenSearch container
  • Starts a new Spark docker container to call spark-submit using information from the start job request that the EMRClient received
  • New container connects to spark cluster to execute the job
  • Creates the request and result indices
  • Saves the result to the result index

What is missing:

  • Detecting that the job has finished (this is on the spark-submit container)
  • The query is for a table on an S3 data source, query fails with table not found

Can submit an async query and the result is written to the result index.

Need to create the external table in Spark before submitting the query

Signed-off-by: Norman Jordan <[email protected]>
@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have partially working async API in the latest commit. These are my testing steps:

  1. Start the cluster using docker compose up
  2. Run spark-shell on the master spark container.
    1. Create an external table with a location under s3a://integ-test/
    2. Put some data in the table
  3. Submit a query:
    curl -u '...' -X POST -H 'Content-Type: application/json' -d '{"datasource": "mys3", "lang": 
    "sql", "query": "SELECT * FROM mys3.default.foo"}
    
  4. Retrieve the result (will need to wait until it is ready, maybe about 1 minute)
    curl -u '...' -X POST -H 'Content-Type: application/json' -d '{}' 'http://localhost:9200/query_execution_result_mys3/_search?pretty'
    

The OpenSearch container will need to bind the docker socket /var/run/docker.sock. By default this isn't available on Mac, but can be enabled.

The OpenSearch container will start another container to process the async query. This is the place where we could slip in the EMR spark container (if there is any value from it).

@normanj-bitquill
Copy link
Contributor Author

I have not tested retrieving results using the Async API. This is likely broken, since cannot check on the EMR job status. I also haven't tested a streaming query (also likely broken).

@@ -8,29 +23,35 @@ services:
- "${UI_PORT:-4040}:4040"
- "${SPARK_CONNECT_PORT}:15002"
entrypoint: /opt/bitnami/scripts/spark/master-entrypoint.sh
user: root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill why is this mandatory ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you plz explain what is the issue here we need to solve ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header

@@ -0,0 +1,29 @@
package com.amazonaws.services.emrserverless;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header


su opensearch ./opensearch-docker-entrypoint.sh "$@"

kill -TERM `cat /var/run/docker-command-runner.pid`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called ?
can you plz explain the entire flow for the docker compose ?

@@ -0,0 +1,88 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice - can you add some documentation of what is done here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing test related feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants