Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC-2333] Data Out to S3 enhancement (3.11, 4.2) #181

Open
wants to merge 3 commits into
base: 3.11
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion modules/querying/pages/data-types.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -458,11 +458,24 @@ A `FILE` object is a sequential data storage object, associated with a text file
When referring to a `FILE` object, we always capitalize the word `FILE` to distinguish it from ordinary files.
====

=== Local disk file
When a `FILE` object is declared, associated with a particular text file, any existing content in the text file will be erased.
During the execution of the query, content written to the `FILE` will be appended to the `FILE`.
When the query where the `FILE` was declared finishes running, the `FILE` contents are saved to the text file.

A `FILE` object can be passed as a parameter to another query. When a query receives a `FILE` object as a parameter, it can append data to that `FILE`, as can every other query which receives this `FILE` object as a parameter.
=== S3 object
[#_define_s3_file_object]
==== Define s3 file object
The path should start with `s3://`, followed by the bucket name and the S3 path, e.g., `s3://bucket-name/queryoutput/output.csv`. During the execution of the query, content will be uploaded to the S3 bucket. Note that the S3 object cannot be modified or appended, if an S3 object with the same path already exists, it will be overwritten.

[NOTE]
====
A `FILE` object can be passed as a parameter to another query.
When a query receives a `FILE` object as a parameter, for a file on the local machine, it can append data to that `FILE`, as can every other query which receives this FILE object as a parameter.
However, an S3 bucket `FILE` object cannot be appended to.
When you write to an S3 path, any existing object will be overwritten.
====


== Query parameter types

Expand Down
36 changes: 36 additions & 0 deletions modules/querying/pages/write-query-data-to-cloud.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
= Write Query Output to Cloud

[#_write_query_data_to_s3]
== Write Query Output to S3 File Object
To output query results to S3, connection credentials need to be set first.
S3 connection credentials can be set at cluster level using gadmin config. Another way to set S3 connection credentials is using GSQL Session Parameters which supports idea of multi user multi session.

To use an S3 path, ensure that the necessary permissions and configurations are properly set up to allow read/write access to the specified S3 bucket.

To see how to define a S3 File Object, refer to xref:data-types.adoc#_define_s3_file_object[Define s3 file object].
[#_step_1]
==== Configure S3 connection credentials using gadmin config
[source,bash]
----
gadmin config set GPE.QueryOutputS3AWSAccessKeyID YOUR_AWS_ACCESS_KEY_ID
gadmin config set GPE.QueryOutputS3AWSSecretAccessKey YOUR_AWS_SECRET_ACCESS_KEY
gadmin config apply -y
gadmin restart gpe -y
----
==== Configure S3 connection credentials using GSQL session parameters
[source,bash]
----
set s3_aws_access_key_id ="YOUR_AWS_ACCESS_KEY_ID"
set s3_aws_secret_access_key ="YOUR_AWS_SECRET_ACCESS_KEY"
----
The connection credentials are configured only for current user and current session. After setting these session parameters, all existing and future queries in this session will use these connection credentials to write query output to S3 File objects.

==== Output
Since S3 is a shared storage system, multiple nodes in a cluster can upload to the same S3 bucket.

To handle potential conflicts and ensure unique output files, the S3 path can include a prefix based on the instance name, such as `\GPE_{PartitionId}_{ReplicaId}`. For distributed queries, additional suffixes will be used to differentiate between the manager and worker roles on the same GPE. Specifically, prefixes like `_coordinator` and `_worker` will be added, where `_coordinator` refers to the worker manager and `_worker` refers to the worker node.

The prefixes for files written by worker manager node and worker node are `\GPE_{PartitionId}_{ReplicaId}_coordinator` and `\GPE_{PartitionId}_{ReplicaId}_worker` respectively.

==== Error code
For S3 bucket connection errors, refer to error code `GSQL-5301`.