Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialization overhead at scale #4

Open
rgioiosa78 opened this issue Nov 29, 2023 · 2 comments
Open

Initialization overhead at scale #4

rgioiosa78 opened this issue Nov 29, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@rgioiosa78
Copy link

rgioiosa78 commented Nov 29, 2023

When initializing the RDMA data structure and process/NIC table at scale (256+ nodes) the time is about 80s. This process is mostly serialized at node 0 and should be re-worked with better collectives.

@rgioiosa78 rgioiosa78 self-assigned this Nov 29, 2023
@rgioiosa78 rgioiosa78 added the enhancement New feature or request label Nov 29, 2023
@rgioiosa78 rgioiosa78 added this to the Version 0.3 milestone Nov 29, 2023
@rgioiosa78
Copy link
Author

This might also occur when standing up new RMA regions

@rdfriese
Copy link
Contributor

rdfriese commented Dec 1, 2023

While likely not the complete solution we want for version 0.3, I changed the serial handshaking that was going on to instead use a libPMI based exchange.

This should help a bit during the init process, and during creating of new RMA regions.

If using with Lamellar, the impact may be minimal for versions <=0.5 as I found another portion of init process that was also inefficient after changing Rofi, my recommendation is to use lamellar >=0.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants