-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support intermediate artifacts #683
Comments
Hi @PertuyF thanks so much for creating this issue. We are indeed working on this and really appreciate your input! We will get back with more detailed design proposals soon. |
Hi @PertuyF thanks again for bring this issue up. Looks like you were running the following code to generate your two artifacts, import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
target=[iris.target_names[i] for i in iris.target]
)
iris_agg = df.set_index("target")
master_data = lineapy.save(iris_agg, 'master_data')
iris_clean = iris_agg.dropna().assign(test="test")
dataset = lineapy.save(iris_clean, 'dataset') and you were using the following code to generate your pipeline: lineapy.to_pipeline(artifacts=[master_data.name, dataset.name],
dependencies={dataset.name: {master_data.name}},
framework='AIRFLOW', pipeline_name='my_great_airflow_pipeline', output_dir='airflow') We totally agree the output you are currently getting is not ideal. The reason for this is LineaPy does not track the dependencies between artifacts generated within the same session and we are working on this. There is probably more than one reasonable pipeline that can be generated from the artifact-creating code depending on what is your end goal.
In order to generate pipelines for each scenario, LineaPy will first detect the dependencies between artifacts within the session in the background and the user will be able to generate pipelines for each scenario based on how they call the The followings are some proposed solutions (not finalized yet and your input would be very welcome!): For scenario 1 (outputting both the
|
Thank you so much @mingjerli for sharing this reflection and giving me an opportunity to provide feedback! These strategies you mention totally make sense to me, and probably scenarii 1 and 3 would fit my use cases the best. The reason is that for now I forsee LineaPy as an assistant toward productionizing prototyping. To me this would typically covers three main stages:
Typical example would be to explore an hypothesis, starting by creating master data from an existing relational DB, then engineering features into dataset, then training model. This is stage 0 and LineaPy would help ensuring reusability of the code (e.g. re-execute on updated data source, revisit after a while, handover to another data scientist,...) If eventually models are worth it I will want to integrate the pipeline with my DataOps stack and my MLOps stack. This would be stage 2 and there comes my semantic:
In my current vision, LineaPy would be involved to facilitate transition from prototype to production. Hence if we consider three artifacts master data, dataset and model from above: Your additional points are very relevant. I hope this makes sense to you! I probably still have to give it some thoughts, although I prefer to throw ideas here so we can engage a discussion 🙂 Also, this is the use case from just one user. I would completely understand that others ways of working could require different behaviours. Happy to continue this discussion as needed! |
Hi Fabien, thanks again for your detailed feedback! We've been working on supporting scenario 1/3 for a couple weeks and will keep you posted once it's ready for use. In the meantime, would love to explore more how you are using |
Hi Fabien (@PertuyF), hope all is well with you. We are happy to share that we finally have the support for scenario 1! (We are working on the other two scenarios and will keep you posted as they get ready too.) This means that, with the latest version of Hence, with the same example discussed earlier, i.e., import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
target=[iris.target_names[i] for i in iris.target]
)
iris_agg = df.set_index("target")
master_data = lineapy.save(iris_agg, "master_data")
iris_clean = iris_agg.dropna().assign(test="test")
dataset = lineapy.save(iris_clean, "dataset") running lineapy.to_pipeline(
artifacts=[master_data.name, dataset.name],
dependencies={dataset.name: {master_data.name}},
framework="AIRFLOW",
pipeline_name="iris_pipeline",
output_dir="./airflow",
) would generate a module file that looks like: ### ./airflow/iris_pipeline_module.py
import pandas as pd
from sklearn.datasets import load_iris
def get_master_data():
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
target=[iris.target_names[i] for i in iris.target]
)
iris_agg = df.set_index("target")
return iris_agg
def get_dataset(iris_agg):
iris_clean = iris_agg.dropna().assign(test="test")
return iris_clean
def run_session_including_master_data():
# Given multiple artifacts, we need to save each right after
# its calculation to protect from any irrelevant downstream
# mutations (e.g., inside other artifact calculations)
import copy
artifacts = dict()
iris_agg = get_master_data()
artifacts["master_data"] = copy.deepcopy(iris_agg)
iris_clean = get_dataset(iris_agg)
artifacts["dataset"] = copy.deepcopy(iris_clean)
return artifacts
def run_all_sessions():
artifacts = dict()
artifacts.update(run_session_including_master_data())
return artifacts
if __name__ == "__main__":
run_all_sessions() As shown, the modularized code now contains "non-overlapping" functions, e.g., Moreover, LineaPy can now smartly identify any "common" computation among different artifacts and factor it out into its own function — even if that common computation has not been stored as an artifact (check our recent GH discussion post for a concrete example). This of course further reduces any redundant computation in the pipeline. Given your valuable inputs earlier, we would love to get your feedback on this new style of pipeline generation. Please give it a try and let us know what you think (or any questions)! |
Hi all, thank you so much for developing LineaPy, looks great and I'm really excited about it!
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
When I develop a pipeline, I may want to integrate semantic steps to build my refined dataset table. As an illustration,
master_data
would be data loaded and assembled from a relational DB, whereasdataset
would be the same table refined with some feature engineering.Currently, if I try to do this I would save both
master_data
anddataset
as artifacts, then create a pipeline like:My issue is that Lineapy would then create steps to build
master_data
from scratch, and also to createdataset
from scratch instead of loadingmaster_data
as a starting point. Like:Describe the solution you'd like
A clear and concise description of what you want to happen.
Ideally LineaPy would capture the dependency and build:
My issue is that Lineapy would then create steps to build
master_data
from scratch, and also to createdataset
from scratch instead of loadingmaster_data
as a starting point. Something like:Is it planned to support this behavior?
Am I missing something?
The text was updated successfully, but these errors were encountered: