Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Node grouping #4427

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

[Draft] Node grouping #4427

wants to merge 3 commits into from

Conversation

ankatiyar
Copy link
Contributor

@ankatiyar ankatiyar commented Jan 17, 2025

Description

Draft PR to demo - #4376

Development notes

Add attributes to pipeline to get dependencies and node grouped by namespace. (The property names just for the prototype)
This PR offers a version of the information from the group_by_namespace() method added in kedro-airflow in kedro-org/kedro-plugins#981

Questions to be considered:

  • Does it make sense to have this API in the Pipeline class?
  • The plugins would still have to discern whether they're executing a node or a group of nodes i.e would have to pick kedro run --node=<nodename> or kedro run --namespace=<namespace> for each "task". How do we make this easier?

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@ankatiyar ankatiyar requested a review from DimedS January 17, 2025 15:08
Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @ankatiyar! That’s a great algorithm!

I have a few questions and comments:

  1. I think the Pipeline class is an excellent place for this API.

  2. I agree with you that we should not return just the names but also include object types. This would allow the plugin to understand what exactly needs to be executed (e.g., a node or a namespace).

  3. Additionally, I think it would be more beneficial to return a topologically sorted list (similar to how pipeline.nodes currently works) instead of a dictionary. This way, the list can be executed in order.

    Specifically, the format could be:
    [object_name, object_type, full_list_of_nodes], where:

    • object_name: The name of the object, such as a namespace or a node.
    • object_type: The type of the object, either a namespace or a node.
    • full_list_of_nodes: A list of all nodes included under the object_name. For example, if the object_name is a namespace, it will contain all the nodes within that namespace. If the object_name is a node, this list would just contain the node itself.
    • list_of_dependencies (to consider): Perhaps we should include the list of dependencies in the same list, rather than separating it into another method. This way, all the information needed for deployment would be available in a single call.

    Example structure:

    [
        [ns1, namespace, [n1, n2, n3]],  # Namespace containing nodes n1, n2, n3
        [n4, node, [n4]]                 # Single node n4
    ]

    Each element in the list would represent one deployment step that plugin should create. The full_list_of_nodes is included for informational purposes and isn’t required for execution.

We could also consider whether it would be beneficial to avoid coding the logic for handling node grouping separately in each plugin. Instead, we could provide a generic API in Kedro, allowing plugins to query Kedro for what should be deployed. This API could accept optional parameters, such as the type of node grouping, and return a list of objects to be deployed. The plugin’s responsibility would then simply be to take this list and handle the pipeline conversion, streamlining the process.

Lastly, a small question: can node.name be equal to namespace.name? If so, how would the algorithm handle this scenario?

Returns:
The pipeline nodes dependencies grouped by namespace.
"""
node_dependencies_by_namespace = defaultdict(dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe should be defaultdict(set) to avoid lines 407-409

Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, @ankatiyar! It looks good to me - let's see how it works with Airflow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants