Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC-2085 fix(dataloading): improve doc for Load from External Kafka (3.9?-4.1) #491

Open
wants to merge 6 commits into
base: 4.1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 6 additions & 37 deletions modules/data-loading/examples/config-avro
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
connector.class=org.apache.kafka.connect.mirror.MirrorSourceConnector
source.cluster.alias=hello
target.cluster.alias=world
source.cluster.bootstrap.servers=source.kafka.server:9092
target.cluster.bootstrap.servers=localhost:30002
source.cluster.bootstrap.servers=<source.broker1:port,source.broker2:port,...>
target.cluster.bootstrap.servers=<local.broker1:port,local.broker2:port,...>
source->target.enabled=true
topics=avro-without-registry-topic
replication.factor=1
Expand All @@ -18,41 +18,10 @@ emit.heartbeats.interval.seconds=5
world.scheduled.rebalance.max.delay.ms=35000
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
header.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=com.tigergraph.kafka.connect.converters.TigerGraphAvroConverterWithoutSchemaRegistry

producer.security.protocol=SASL_SSL
producer.sasl.mechanism=GSSAPI
producer.sasl.kerberos.service.name=kafka
producer.sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true keyTab=\"/path/to/kafka-producer.keytab\" principal=\"[email protected]\";
producer.ssl.endpoint.identification.algorithm=
producer.ssl.keystore.location=/path/to/client.keystore.jks
producer.ssl.keystore.password=******
producer.ssl.key.password=******
producer.ssl.truststore.location=/path/to/client.truststore.jks
producer.ssl.truststore.password=******

consumer.security.protocol=SASL_SSL
consumer.sasl.mechanism=GSSAPI
consumer.sasl.kerberos.service.name=kafka
consumer.sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true keyTab=\"/path/to/kafka-consumer.keytab\" principal=\"[email protected]\";
consumer.ssl.endpoint.identification.algorithm=
consumer.ssl.keystore.location=/path/to/client.keystore.jks
consumer.ssl.keystore.password=******
consumer.ssl.key.password=******
consumer.ssl.truststore.location=/path/to/client.truststore.jks
consumer.ssl.truststore.password=******

source.admin.security.protocol=SASL_SSL
source.admin.sasl.mechanism=GSSAPI
source.admin.sasl.kerberos.service.name=kafka
source.admin.sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true keyTab=\"/path/to/kafka-admin.keytab\" principal=\"[email protected]\";
source.admin.ssl.endpoint.identification.algorithm=
source.admin.ssl.keystore.location=/path/to/client.keystore.jks
source.admin.ssl.keystore.password=******
source.admin.ssl.key.password=******
source.admin.ssl.truststore.location=/path/to/client.truststore.jks
source.admin.ssl.truststore.password=******
transforms=TigerGraphAvroTransform
transforms.TigerGraphAvroTransform.type=com.tigergraph.kafka.connect.transformations.TigergraphAvroWithoutSchemaRegistryTransformation
transforms.TigerGraphAvroTransform.errors.tolerance=none

[connector_1]
name=avro-test-without-registry
tasks.max=10
tasks.max=10
4 changes: 2 additions & 2 deletions modules/data-loading/pages/data-loading-overview.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ TigerGraph uses the same workflow for both local file and Kafka Connect loading:
. *Specify a graph*.
Data is always loading to exactly one graph (though that graph could have global vertices and edges which are shared with other graphs). For example:
+
[source,php]
[source,gsql]
USE GRAPH ldbc_snb

. If you are using Kafka Connect, *define a `DATA_SOURCE` object*.
Expand All @@ -64,7 +64,7 @@ image::data-loading:loading_arch_3.9.3.png[Architectural diagram showing support
== Loading Jobs
A loading job tells the database how to construct vertices and edges from data sources.

[source,php]
[source,gsql]
.CREATE LOADING JOB syntax
----
CREATE LOADING JOB <job_name> FOR GRAPH <graph_name> {
Expand Down
53 changes: 2 additions & 51 deletions modules/data-loading/partials/kafka/kafka-data-source-details.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -13,61 +13,12 @@ To configure the data source object, the minimum requirement is the address of t
.Data source configuration for external Kafka
----
{
"type": "mirrormaker",
"source.cluster.bootstrap.servers": "<broker_addrs>"
"type": "mirrormaker",
"source.cluster.bootstrap.servers": "<broker_addrs>"
}
----

If the source cluster is configured for SSL or SASL protocols, you need to provide the following SSL/SASL credentials in order to communicate with the source cluster.

* If the source cluster uses SASL, you need to upload the keytab of each Kerberos principal to every node of your TigerGraph cluster at the same absolute path.
* If the source cluster uses SSL, see our documentation xref:tigergraph-server:data-loading:kafka-ssl-security-guide.adoc[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://graphsql.atlassian.net/browse/DOC-1972?focusedCommentId=152682 see comment for kafka-ssl-security-guide.adoc.

  • SSL for internal Kafka
  • SSL for CRR
  • SSL for loading from external Kafka

They should belong to 3 different categories(under 3 different pages) instead of putting them together.

Copy link
Author

@wuqingjuntg wuqingjuntg Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's clarify here. We're not going to cover SSL for CRR here right?
  • For internal and external Kafka, I don't see major difference. The "internal" Kafka is set up for testing loading from regular Kafka cluster including external Kafka. Let me know if there's any understanding gap. We can have a short discussion in Tuesday's meeting and get consensus about this with the team. ( At the moment I still have a couple of JIRA tickets to handle, which also have high priorities).
  • I am removing the SASL and the SASL/SSL parameters, because: without SSL, SASL is not secured. SASL is authentication method, but without encryption, the connection is not secured. So we should simply recommend the users to enable regular SSL to secure the connection between Kafka clusters. User doesn't need to use both SASL + SSL. SSL is enough. And we didn't officially support SASL yet.
  • My suggestion is that, for all the configuration/settings in the DOC, we should have already supported the feature and also have test cases covering those configs.

cc @arunramasami , FYI. Any comments regarding this DOC change I am making?

* If the source cluster uses SASL *and* SSL, you need to upload the keytab of each Kerberos principal, as well as the key store and truststore to every node of your TigerGraph cluster.
Each file must be at the same absolute path on all nodes.

The following configurations are required for admin, producer and consumer. To supply the configuration for the corresponding component, replace `<prefix>` with `source.admin`, `producer`, or `consumer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove them?

For example, to specify `GSSAPI` as the SASL mechanism for consumer, include `"consumer.sasl.mecahnism": "GSSAPI"` in the data source configuration.

[%header,cols="1,2"]
|===
| Field | Description

| <prefix>.security.protocol
| Protocol used to communicate with brokers.
Valid values are: `PLAINTEXT`, `SSL, `SASL_PLAINTEXT`, `SASL_SSL`.
The default is `PLAINTEXT`.

| <prefix>.sasl.mechanism
| SASL mechanism used for client connections.
This may be any mechanism for which a security provider is available. GSSAPI is the default mechanism.

| <prefix>.sasl.kerberos.service.name
| The Kerberos principal name used by your Kafka brokers.
This could be defined in either JAAS configuration or Kafka’s configuration.

| <prefix>.sasl.jaas.config
| JAAS login context parameters for SASL connections in the format used by JAAS configuration files.
See https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/LoginConfigFile.html[JAAS Login Configuration File] for details.

| <prefix>.ssl.endpoint.identification.algorithm
| The endpoint identification algorithm used to validate server hostname in the server certificate. Default is `https`.
If the value is set to an empty string, this will disable server host name verification.

| <prefix>.ssl.keystore.location
| The location of the key store file.

| <prefix>.ssl.keystore.password
| The password of the key store file.

| <prefix>.ssl.key.password
| The password of the private key in the key store file or the PEM key specified in `ssl.keystore.key`.

| <prefix>.ssl.truststore.location
| The location of the trust store file.

| <prefix>.ssl.truststore.password
| The password for the trust store file.
|===

If there is a https://docs.confluent.io/platform/current/schema-registry/index.html[schema registry service] containing the record schema of the source topic, please add it to the data source configuration:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

The following is an example loading job from and external Kafka cluster.

[source,php,linenums]
.Example loading job for BigQuery
[source,gsql,linenums]
.Example loading job from external Kafka
----
USE GRAPH ldbc_snb
CREATE DATA_SOURCE s1 = "ds_config.json" FOR GRAPH ldbc_snb
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ We will call out whether a particular step is common for all loading or specific
== Example Schema
This example uses part of the LDBC_SNB schema:

[source,php]
[source,gsql]
.Example schema taken from LDBC_SNB
----
//Vertex Types:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ Inline mode is required when creating data sources for TigerGraph Cloud instance

In the following example, we create a data source named `s1`, and read its configuration information from a file called `ds_config.json`.

[source,php]
[source,gsql]
USE GRAPH ldbc_snb
CREATE DATA_SOURCE s1 = "ds_config.json" FOR GRAPH ldbc_snb

Older versions of TigerGraph required a keyword after `DATA_SOURCE` such as `STREAM` or `KAFKA`.

[source,php]
[source,gsql]
.Inline JSON data format when creating a data source
CREATE DATA_SOURCE s1 = "{
type: <type>,
Expand All @@ -24,7 +24,7 @@ key: <value>
String literals can be enclosed with a double quote `"`, triple double quotes `"""`, or triple single quotes `'''`.
Double quotes `"` in the JSON can be omitted if the key name does not contain a colon `:` or comma `,`.

[source,php]
[source,gsql]
.Alternate quote syntax for inline JSON data
CREATE DATA_SOURCE s1 = """{
"type": "<type>",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ These can refer to actual files or be placeholder names. The actual data sources
. LOAD statements specify how to take the data fields from files to construct vertices or edges.

////
[source,php]
[source,gsql]
.CREATE LOADING JOB syntax
----
CREATE LOADING JOB <job_name> FOR GRAPH <graph_name> {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ First we define _filenames_, which are local variables referring to data files (
[NOTE]
The terms `FILENAME` and `filevar` are used for legacy reasons, but a `filevar` can also be an object in a data object store.

[source,php]
[source,gsql]
.DEFINE FILENAME syntax
----
DEFINE FILENAME filevar ["=" file_descriptor ];
Expand All @@ -13,7 +13,7 @@ DEFINE FILENAME filevar ["=" file_descriptor ];
The file descriptor can be specified at compile time or at runtime.
Runtime settings override compile-time settings:

[source,php]
[source,gsql]
.Specifying file descriptor at runtime
----
RUN LOADING JOB job_name USING filevar=file_descriptor_override
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
=== Specify the data mapping
Next, we use LOAD statements to describe how the incoming data will be loaded to attributes of vertices and edges. Each LOAD statement handles the data mapping, and optional data transformation and filtering, from one filename to one or more vertex and edge types.

[source,php]
[source,gsql]
.LOAD statement syntax
----
LOAD [ source_object|filevar|TEMP_TABLE table_name ]
Expand All @@ -12,7 +12,7 @@ LOAD [ source_object|filevar|TEMP_TABLE table_name ]
<1> As of v3.9.3, TAGS are deprecated.

Let's break down one of the LOAD statements in our example:
[source,php]
[source,gsql]
.Example loading job for local files
----
LOAD file_Person TO VERTEX Person
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
When a loading job starts, the GSQL server assigns it a job ID and displays it for the user to see.
There are three key commands to monitor and manage loading jobs:

[source,php]
[source,gsql]
----
SHOW LOADING STATUS job_id|ALL
ABORT LOADING JOB job_id|ALL
Expand All @@ -12,7 +12,7 @@ RESUME LOADING JOB job_id

`SHOW LOADING STATUS` shows the current status of either a specified loading job or all current jobs, this command should be within the scope of a graph:

[source,php]
[source,gsql]
GSQL > USE GRAPH graph_name
GSQL > SHOW LOADING STATUS ALL

Expand Down