-
Notifications
You must be signed in to change notification settings - Fork 156
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get attribute level lineage "across" spark-jobs. through Spline-UI/AQL ? #1088
Comments
From what I understand you are looking for an attribute-level lineage. Spline collects all necessary information (#114) that is required to build attribute-level lineage, and actually shows the path in the boundaries of a single execution plan (go to the execution plan detail and click on any attribute in the operation details pane). However, the AQL query, API and the UI to build attribute-level lineage across jobs have yet to be implemented. The discussion #937 might give you a few hints on how it could be achieved. Basically, every operation has an output schema that consists of attributes. Each attribute has a reference to an expression from which it was created, or to another attribute. There is no link between attributes from different execution plans. So, to build a end-to-end attribute-level lineage you can reuse the same query that builds high-level lineage, build a partial attribute-level graph for every visited execution plan, and then connect input attributes of one graph with output attributes of another one on a later stage of the traversal.
No, Spark doesn't provide this to the listener, neither does Spline project have any aim to capture actual data. |
Thanks @wajda for the pointers. I've an input dataset consisting of 3 columns, say c1 (string), c2 (string) and c3 (int). For above lineage, following are the Ids of my
and following are the Ids fo my
I'm now in process of building up an attribute level lineage wherein user will provide the
Which gives me result as: [
{
"lineage": [
{
"parent_attribute_id": "f3c32a85-ae94-5978-9611-187a7440c694:attr-2",
"parent_attribute_name": "c3",
"from_operation_id": "f3c32a85-ae94-5978-9611-187a7440c694:op-4"
},
{
"intake_attribute_id": "f3c32a85-ae94-5978-9611-187a7440c694:attr-4",
"intake_attribute_name": "c3",
"from_operation_id": "f3c32a85-ae94-5978-9611-187a7440c694:op-1"
}
]
}
] So here my starting point is I now need to do some addition to this query to capture the expressions/attributes that were used in each operation of origin in the lineage pipeline. I understand that we have the [
{
"lineage": [
{
"parent_attribute_id": "f3c32a85-ae94-5978-9611-187a7440c694:attr-2",
"parent_attribute_name": "c3",
"from_operation_id": "f3c32a85-ae94-5978-9611-187a7440c694:op-4",
"derived_using":""
},
{
"intake_attribute_id": "f3c32a85-ae94-5978-9611-187a7440c694:attr-4",
"intake_attribute_name": "c3",
"from_operation_id": "f3c32a85-ae94-5978-9611-187a7440c694:op-1",
"derived_using":""
}
]
}
] In above, I need to fill in the "derived_using" property and I need to capture value similar to the format that SplineUI does when showing lambda against a column (so if Any help/pointers on enriching the above AQL will be highly appreciated. Thanks ! |
@adcb6gt, I'm sorry for the late response. According to your question:
In the Spline graph model, there is an edge |
Regarding the |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I've an application that sources data from upstream, processes it and then passes it to downstream.
The application does this using an in-house concept called "Workflows".
Workflows are nothing but a set of activities (that can be sequential or parallel). There is a start activity, then set of transformation activities (which can be sequential or parallel) and then an end activity.
Each activity behind the scene translates to a spark-submit job.
As an example, one of our simple workflow would look like:
Source a file --> Select few columns from file --> Apply a filter condition --> Modify the value of a column based on a condition --> Save the modified file
Each of the above activity mentioned above is submitted as a pypsark job. At each step we create a dataframe and then persist to HDFS.
Scroll below frame for complete data view
We then had requirement to get data lineage.
To achieve this objective, I explored the Spline framework and incorporated it in my application.
We placed the spark-spline-agent bundle on the spark driver (my
--deploy-mode
isclient
and--master
isyarn
) (placed the jar underjars
folder of spark-home).We also brought up the spark-rest-gateway and spline-ui along with arango db (used the spline 0.7 version).
On submitting our spark-jobs, we did the initialized via code by calling the enableLineageTracking on SparkLineageInitializer (for some reason codeless initialization via
--packages
didn't work).I can see the spark-jobs listed on Spline-UI under ExecutionEvents screen (one record per spark-job; so for above example, we have 5 spark-jobs listed on Spline-UI).
On clicking on the Execution Plan hyperlink against each job, I can see the lineage that Spline-UI brings through ArangoDB.
Also in ArangoDB, I can see the vertices and edges in graph-view.
However, what I'm looking for is:
We now have a requirement to figure out data lineage "across" the spark jobs we submit.
For above example, I need some way where I provide "col2" as input and it gives me a lineage that should logically translate as:
col2 (Activity-1) -> col2 (Activity-2) -> col2 (Activity-3) -> col2*3 (Activity-4) -> col2*3 (Activity-5)
So if my col2 value at source is, say, value 1, then lineage should be:
1 -> 1 -> 1 -> 3 -> 3
Also against each value, I need to know the activity applied (if it was a pass-through (i.e., no transformations applied) or if there was any transformation on it (in above example, at activity 4, column value was multiplied by 3))
Is there a way I can achieve it by using AQL ?
I understand every spark job will get created as a Document in the document Collection (executionPlan) but if i've to get lineage of a specific attribute across these collections, how can I do it ? In other words, how can i get lineage of an attribute across different spark jobs (i.e., across different execution IDs) ?
Also can I capture the runtime value of the attribute as it goes through different transformations ?
I read through ArangoDB's traverse path graph but not able to get hold of the way i can achieve it via AQL (or throug any other approach in Spline)
Any pointers on this will be highly appreciated.
Thanks !
The text was updated successfully, but these errors were encountered: