-
Notifications
You must be signed in to change notification settings - Fork 312
(3.6.0‐3.6.1) Slurm NodeHostName and NodeAddr mismatch for MultiNIC instance when managed DNS is disabled and EC2 Hostnames are used
When using Slurm compute nodes backed by an instance type with multiple network cards (e.g. p4d.24xlarge, hpc6id.32xlarge), it would be possible that the Slurm node NodeHostName
attribute doesn’t match the NodeAddr
attribute, when cluster managed DNS is disabled and EC2 Hostnames are used.
The mismatch between NodeHostName
and NodeAddr
is caused by the random order of the networking interfaces in the EC2 DescribeInstances API output, which is used by ParallelCluster to enumerate those interfaces, and can cause problems for jobs that rely on node hostname knowledge in order to be executed, like MPI jobs.
Issue can be identified by looking at the launch instance log, either in /var/log/parallelcluster/slurm_resum.log
(for dynamic nodes) or /var/log/parallelcluster/clustermgtd.log
(for static nodes), where the attribute hostname
doesn’t correspond to the value of the private_ip
, like in the following example log:
2023-08-25 09:44:39,979 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('q1-dy-c1-1', EC2Instance(id='i-03faa591c09e638cc', private_ip='192.168.90.6', hostname='ip-192-168-93-90', launch_time=datetime.datetime(2023, 8, 25, 9, 44, 34, tzinfo=tzlocal()), slurm_node=None))"]
where private_ip='192.168.90.6'
doesn’t match the hostname='ip-192-168-93-90'
.
- ParallelCluster 3.6.0 - 3.6.1
- Slurm Scheduler
- cluster managed DNS disabled, via
SlurmSettings/Dns/DisableManagedDns=true
- EC2 hostnames enabled, via
SlurmSettings/Dns/UseEc2Hostnames=true
- multi-NIC instance types, e.g. p4d.24xlarge, hpc6id.32xlarge, etc...
The following mitigation has been tested on ParallelCluster version 3.6.1
- Save the following text as
pcluster.patch
to/tmp
onto your head node:
diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py
index 9c21a48..5a42772 100644
--- a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py
+++ b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py
@@ -27,3 +27,23 @@ def get_private_ip_address(instance_info):
private_ip = network_interface["PrivateIpAddress"]
break
return private_ip
+
+
+def get_private_ip_address_and_dns_name(instance_info):
+ """
+ Return the PrivateIpAddress and PrivateDnsName of the EC2 instance.
+
+ The PrivateIpAddress and PrivateDnsName are considered to be the ones for the
+ network interface with DeviceIndex = NetworkCardIndex = 0.
+ :param instance_info: the dictionary returned by a EC2:DescribeInstances call.
+ :return: the PrivateIpAddress and PrivateDnsName of the instance.
+ """
+ private_ip = instance_info["PrivateIpAddress"]
+ private_dns_name = instance_info["PrivateDnsName"]
+ for network_interface in instance_info["NetworkInterfaces"]:
+ attachment = network_interface["Attachment"]
+ if attachment.get("DeviceIndex", -1) == 0 and attachment.get("NetworkCardIndex", -1) == 0:
+ private_ip = network_interface["PrivateIpAddress"]
+ private_dns_name = network_interface["PrivateDnsName"]
+ break
+ return private_ip, private_dns_name
diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py
index 4bdd291..c757ce5 100644
--- a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py
+++ b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py
@@ -15,7 +15,7 @@ from abc import ABC, abstractmethod
import boto3
from botocore.exceptions import ClientError
-from common.ec2_utils import get_private_ip_address
+from common.ec2_utils import get_private_ip_address, get_private_ip_address_and_dns_name
logger = logging.getLogger(__name__)
@@ -48,10 +48,11 @@ class EC2Instance:
@staticmethod
def from_describe_instance_data(instance_info):
try:
+ private_ip, private_dns_name = get_private_ip_address_and_dns_name(instance_info)
return EC2Instance(
instance_info["InstanceId"],
- get_private_ip_address(instance_info),
- instance_info["PrivateDnsName"].split(".")[0],
+ private_ip,
+ private_dns_name.split(".")[0],
instance_info["LaunchTime"],
)
except KeyError as e:
diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py
index 7ec9bc8..646287f 100644
--- a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py
+++ b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py
@@ -21,7 +21,7 @@ from typing import Iterable
import boto3
from botocore.config import Config
from botocore.exceptions import ClientError
-from common.ec2_utils import get_private_ip_address
+from common.ec2_utils import get_private_ip_address, get_private_ip_address_and_dns_name
from common.schedulers.slurm_commands import update_nodes
from common.utils import grouper
from slurm_plugin.common import ComputeInstanceDescriptor, log_exception, print_with_count
@@ -349,11 +349,12 @@ class InstanceManager:
instances = []
for instance_info in filtered_iterator:
try:
+ private_ip, private_dns_name = get_private_ip_address_and_dns_name(instance_info)
instances.append(
EC2Instance(
instance_info["InstanceId"],
- get_private_ip_address(instance_info),
- instance_info["PrivateDnsName"].split(".")[0],
+ private_ip,
+ private_dns_name.split(".")[0],
instance_info["LaunchTime"],
)
)
- Create and run the following script on the head node as the root user:
**#!/bin/bash**
set -e
# Patch file must be run from the root path
pushd /
# Apply the patch and save a backup into *.orig
cat /tmp/pcluster.patch | patch -p1 -b
# Restart clustermgtd
/opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/supervisorctl restart clustermgtd
popd