-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hang in MPI_Init with unbalanced ranks #222
Comments
-N might be flipping where the unused core is located. Example: 2 nodes with 4 cores each
It might be worth doing a little audit here to see if anything stands out with these layouts in mind. |
I think @garlick meant to put this comment here:
I wonder if the two jobs have the same R? I'll try to reproduce this. |
yes sorry! |
Hm, this is interesting (did we know this and just forgot?) $ flux run -N2 -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 1727383096.7284338, "expiration": 0.0, "nodelist": ["corona[82,82]"]}} The $ flux run -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0-3"}}, {"rank": "1", "children": {"core": "0-2"}}], "starttime": 1727383280.7969263, "expiration": 0.0, "nodelist": ["corona[82,82]"]}} This seems to be explicit in the jobspec created by the first case: $ flux run -N2 -n7 --dry-run hostname | jq .resources
[
{
"type": "node",
"count": 2,
"with": [
{
"type": "slot",
"count": 4,
"with": [
{
"type": "core",
"count": 1
}
],
"label": "task"
}
]
}
] There is even a comment in the code: if num_nodes is not None:
num_slots = int(math.ceil(num_tasks / float(num_nodes)))
if num_tasks % num_nodes != 0:
# N.B. uneven distribution results in wasted task slots
task_count_dict = {"total": num_tasks}
else:
task_count_dict = {"per_slot": 1}
slot = cls._create_slot("task", num_slots, children)
resource_section = cls._create_resource(
"node", num_nodes, [slot], exclusive
) |
Anyway, maybe the extra task slot is confusing the taskmap stuff into running the wrong number of tasks on one of the nodes? |
I think the taskmaps are actually correct and I was confused. Fluxion is packing 4 ranks onto the first node in both cases, and 3 on the second, but for some reason when -N is specified, the order of nodes is reversed. |
FWIW I ran the two cases dumping the apinfo using The apinfo comes out the same for both jobs (on both nodes). The environment seems to only differ in the expected ways. I did notice that slurm is now up to version 5 of the apinfo struct, while we are on version 0. Also slurm sets |
slurm also sets several PMI variables that we don't set. I assumed these would be set by cray's libpmi if at all since they are normally for the benefit of MPI, but since we're seeing a problem, maybe worth noting: I took the extra step of adding a |
Looks like cray pmi prints debug on stderr when In @ryanday36's failing case above, the first rank on the second node (96) correctly identifies the address and port that the PMI rank 0 node is listening on, apparently successfully connects, and sends a barrier request there on behalf of its other local ranks. The rank 0 node never sees the connection. Where did it connect? In the good case, the connection is logged and a barrier release packet is returned and things proceed. Sanitized failing log of PE_0 and PE_96 with some noise removed
Sanitized log of PE_0 and PE_95 (1st rank on second node) for good case:
|
FWIW, this does not require MPI to reproduce. This example reproduces the hang with logging using a client that only uses PMI:
|
I tried setting |
as described in https://rzlc.llnl.gov/jira/browse/ELCAP-705:
(these are all run with
-o mpibind=off
, fwiw)in a two node allocation (
flux alloc -N2
), runningflux run -n190 ...
puts 96 tasks on one node and 94 on the other and hangs until I ctrl-c.If I run with
flux run -N2 -n190 ...
flux puts 95 tasks on each node and things run fine (if slowly).If I use flux's pmi2 (
-o pmi=pmi2
instead of whatever cray mpi is using by default, the original case runs fine.I did some good old fashioned printf debugging, and it looks like the hang is in MPI_Init, but I haven't gotten any deeper than that. I suspect that this is a an HPE issue, but I'm opening it here too in case you all have any insight. The bit that seems extra confusing is that
flux run -n191 ...
hangs, butflux run -N2 -n191 ...
doesn't. Both of those should have 96 tasks on one node and 95 on the other, so that doesn't fit super well with my characterization of this as an issue with unbalanced ranks / node.The text was updated successfully, but these errors were encountered: