-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attribute lineage #114
Comments
Hi, @wajda , What we want is to have the ability to say which output dataset columns depend on which columns of input datasets (and also to know if that's what's the state of attribute / per-column lineage in spline? it seems that it's not supported by design? in spline, some operations have are there any plans to add this? P.S. there might be a hacky way to parse existing spline operations to obtain limited column lineage, however, it doesn't seem very reliable (and it seems it might not catch all column lineage...): data lineage: operations like
control lineage: operations like it would be much better if each operation clearly defined which output attributes depend on which input data... |
Hi @vidma, |
I would like to hear more on how it's implemented (maybe you could point me to the UI code which do find out the attribute lineage) and how much limited it is? I guess if the query had multiple columns of the same name (e.g. coming from two different datasets which are joined) or complex expressions with unnamed "attributes" in it, it might sometimes fail? |
Sorry I was wrong saying that it's based on attribute names (it used to be that in earlier versions). In 0.3.6 it is actually based on attribute IDs, just like in Spark. Basically what Spline does is it simply takes Spark attributes and converts to Spline ones one by one, as well as operations. So if some operations share the same attribute Spline UI will simply highlight those operations. |
more interesting would be to highlight end-to-end lineage from input dataset attributes to output dataset attributes (which may be multiple complex operations apart). As I understand this is not available yet for earlier mentioned reasons/issues? |
No, that's not available yet, but we have a plan to eventually get there. |
hi, any update on this? is the per-attribute lineage supported already or are there any plans for it? |
Hi, |
a few points to note from our experiences at Kensu Inc. you might want consider a few interesting special cases when part of lineage maybe need to be provided semi-manually, e.g.:
also making easier to customize stuff if needed: avoid private/final methods would be great. finally, probably some special care needed for struct fields support. |
Almost there. There is one outstanding issue however that we'll be addressing in the future releases - #791 - data lineage for @vidma, thanks again for your input. Please let us know if it's something that you expected? I'm really interested to hear your feedback. The doc location - https://github.com/AbsaOSS/spline/tree/gh-pages/docs |
As for the RDD lineage there is separate issue in the spark-agent repo - AbsaOSS/spline-spark-agent#33 UDF case is worth investigating, created another issue for that - AbsaOSS/spline-spark-agent#181 |
No description provided.
The text was updated successfully, but these errors were encountered: