Skip to content
This repository has been archived by the owner on Sep 4, 2024. It is now read-only.

docs: Update python client's transports + some small fixes #278

Merged
merged 1 commit into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 47 additions & 8 deletions docs/client/python.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,32 @@ The list of available environment varaibles can be found [here](../development/d
##### HTTP
- type - string (required)
- url - string (required)
- endpoint - string specifying the endpoint to which events are sent. Default: api/v1/lineage (optional)
- timeout - float specifying a timeout value when sending an event. Default: 5 seconds. (optional)
- verify - boolean specifying whether or not the client should verify TLS certificates from the backend. Default: true. (optional)
- verify - boolean specifying whether the client should verify TLS certificates from the backend. Default: true. (optional)
- auth - dictionary specifying authentication options. Requires the type property. (optional)
- type - string specifying the "api_key" or the fully qualified class name of your TokenProvider. (required if `auth` is provided)
- api_key - string setting the Authentication HTTP header as the Bearer. (required if `api_key` is set)
- type - string specifying the "api_key" or the fully qualified class name of your TokenProvider. (required if `auth` is provided)
- api_key - string setting the Authentication HTTP header as the Bearer. (required if `api_key` is set)

Example:
```
transport:
type: http
url: https://backend:5000
endpoint: events/receive
auth:
type: api_key
api_key: f048521b-dfe8-47cd-9c65-0cb07d57591e
```

##### Console
- type - string (required)

Example:
```
transport:
type: console
```

##### Kafka

Expand All @@ -70,20 +91,38 @@ It can be installed also by specifying kafka client extension: `pip install open
- type - string (required)
- config - string containing a Kafka producer config (required)
- topic - string specifying the topic (required)
- flush - boolean specifying whether Kafka should flush after each event. Default: true. (optional)

There's a caveat for using `KafkaTransport` with Airflow integration. In this integration, a Kafka producer needs to be created
for each OpenLineage event.
It happens due to the Airflow execution and plugin model, which requires us to send messages from worker processes.
These are created dynamically for each task execution.


- flush - boolean specifying whether or not Kafka should flush after each event. Default: true. (optional)
Example:
```
transport:
type: kafka
config:
bootstrap.servers: mybroker
acks: all
retries: 3
topic: my_topic
flush: true
```

#### File

- log_file_path - string specifying the path of the file (if append is true, a file path is expected, otherwise a file prefix is expected). (required)
- append - boolean . If set to True, each event will be appended to a single file (log_file_path); otherwise, all events will be written separately in distinct files suffixed by a timestring. Default: false. (optional)

Example:
```
transport:
type: file
log_file_path: ol_events_
append: false
```

### Custom Transport Type

To implement a custom transport, follow the instructions in [`transport.py`](https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/transport/transport.py).
Expand Down Expand Up @@ -135,9 +174,9 @@ In the python_scripts directory, create a Python script (we used the name `gener
In `openlineage.yml`, define a transport type and URL to tell OpenLineage where and how to send metadata:

```
Transport:
Type: “http
Url: “http://localhost:5000
transport:
type: http
url: http://localhost:5000
```

In `generate_events.py`, import the Python client and the methods needed to create a job and datasets. Also required (to create a run): the `datetime` and `uuid` packages:
Expand Down
3 changes: 2 additions & 1 deletion docs/development/developing/python/tests/airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ OpenLineage provides an integration with Apache Airflow. As Airflow is actively
* 2.2.4
* 2.3.4 (TaskListener API introduced)
* 2.4.3
* 2.5.0
* 2.5.2
* 2.6.1

### Unit tests
In order to make running unit tests against multiple Airflow versions easier there is possibility to use [tox](https://tox.wiki/).
Expand Down
6 changes: 3 additions & 3 deletions docs/integrations/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ sidebar_position: 1
This matrix is not yet complete.
:::

The matrix below shows the relationship between an input facet and various mechanisms OpenLineage uses to gather metadata. Not all mechanisms collect data to fill in all facets, and some facets are specific to one integration.
✔️: The mechanism does implement this facet.
✖️: The mechanism does not implement this facet.
The matrix below shows the relationship between an input facet and various mechanisms OpenLineage uses to gather metadata. Not all mechanisms collect data to fill in all facets, and some facets are specific to one integration.
✔️: The mechanism does implement this facet.
✖️: The mechanism does not implement this facet.
An empty column means it is not yet documented if the mechanism implements this facet.

| Mechanism | Integration | Metadata Gathered | InputDatasetFacet | OutputDatasetFacet | SqlJobFacet | SchemaDatasetFacet | DataSourceDatasetFacet | DataQualityMetricsInputDatasetFacet | DataQualityAssertionsDatasetFacet | SourceCodeJobFacet | ExternalQueryRunFacet | DocumentationDatasetFacet | SourceCodeLocationJobFacet | DocumentationJobFacet | ParentRunFacet |
Expand Down
8 changes: 5 additions & 3 deletions docs/spec/run-cycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Each Run State Update contains the run state (i.e., `START`) along with metadata

## Run States

There are five run states currently defined in the OpenLineage [spec](https://openlineage.io/apidocs/openapi/):
There are six run states currently defined in the OpenLineage [spec](https://openlineage.io/apidocs/openapi/):

* `START` to indicate the beginning of a Job

Expand All @@ -24,7 +24,9 @@ There are five run states currently defined in the OpenLineage [spec](https://op

* `FAIL` to signify that the Job has failed

We assume events describing a single run are accumulative and
* `OTHER` to send additional metadata outside standard run cycle

We assume events describing a single run are **accumulative** and
`COMPLETE`, `ABORT` and `FAIL` are terminal events. Sending any of terminal events
means no other events related to this run will be emitted.

Expand All @@ -41,6 +43,6 @@ A batch Job - e.g., an Airflow task or a dbt model - will typically be represent

![image](./run-cycle-batch.svg)

A long-running Job - e.g., a microservice or a stream - will typically be represented by a `START` event followed by a series of `RUNNING` events that report changes in the run or emit performace metrics. Occasionally, a `COMPLETE`, `ABORT`, or `FAIL` event will occur, often followed by a `START` event as the job is reinitiated.
A long-running Job - e.g., a microservice or a stream - will typically be represented by a `START` event followed by a series of `RUNNING` events that report changes in the run or emit performance metrics. Occasionally, a `COMPLETE`, `ABORT`, or `FAIL` event will occur, often followed by a `START` event as the job is reinitiated.

![image](./run-cycle-stream.svg)
Loading