prov/efa: avoid gdr_pin/gdr_map for dmabuf mrs #10445

aws-nslick · 2024-10-08T03:25:06Z

When FI_MR_DMABUF flags are passed, provided that dmabuf export requirements for
cuda have been confirmed during init, skip ofi_hmem_dev_register calls.

efa_mr_hmem_setup previously always called ofi_hmem_dev_register on all
FI_HMEM_CUDA calls, regardless of the presence of FI_MR_DMABUF in flags. When
gdrcopy is enabled, this means deconstructing the fi_mr_dmabuf into a struct
iovec from its {base, offset, len} 3-tuple, then passing the resulting iovec to
gdr_pin followed by gdr_map.

The very fact that we have been provided a cuda dmabuf means that the memory is
already pinned, properly aligned, and ready for import by other kmods, which is
enforced in the cudart apis for acquiring one. At best, this is unnecessary for
dmabuf arguments, and at worst the pgprot modifications done by gdrdrv are
different enough from the ones setup by cuda proper to cause issues, or the
redundant mappings become costly for the driver to maintain.

Prior to this patch, apps can only prevent these gdr_map calls on dmabuf
arguments by disabling gdrcopy entirely through environment variables before
launch. But apps may wish to use fi_mr_regattr with dmabuf arguments in the
default case, while still reserving the right to call fi_mr_regattr with iov
arguments on the same domain, where the gdr flow may still be desired in the
latter case. This makes that possible.

shijin-aws · 2024-10-08T20:45:54Z

prov/efa/src/efa_mr.c

@@ -227,7 +221,32 @@ static int efa_mr_hmem_setup(struct efa_mr *efa_mr,
 		efa_mr->needs_sync = true;
 		efa_mr->peer.device.cuda = attr->device.cuda;

-		if (cuda_is_gdrcopy_enabled()) {
+		if ((flags & FI_MR_DMABUF) != 0) {
+			if (!cuda_is_dmabuf_supported()) {


I don't think we need this check - shouldn't NCCL already check this before passing the dmabuf fd? We can add an assert(cuda_is_dmabuf_supported()) if needed.

I'm happy to go with whatever you'd prefer.

Part of the difficulty on our end with this feature is that we don't have a great way to get a direct answer from libfabric around dmabuf support ahead of time -- all we can do is probe for 1.20 + FI_HMEM and then proceed optimistically. (Well, we do have the option to query the allocation granularity, and then use that to create a mempool, and then reserve an address range, and then map the created mempool on that address range, and export that address range, and then initialize a provider, and then attempt a registration, and then tear it all down... but this becomes fairly intrusive for us). Long term, I think the API as it exists today (where dmabuf support is implied for FI_HMEM beyond some api version) is the ideal. But today there are integration pains.

Example of the pain: There exist real systems (ie: p5 + AL2 on the 5.10 kernel) where dmabuf export through cuda works, but the ib ioctls don't exist on the kernel side until linux 5.12, meaning ib_reg_mr_dmabuf will statically fail. In those cases, it's not wrong for CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED to return true, because it is a valid exporter for importers like mesa/v4l/whatever, but it's insufficient for our purposes. Whose responsibility should it be to check for kernel import support? In practice, ofi-nccl added a ~5 line function to call uname during init so that we can test the major/minor version. But IMO this feels like an abstraction has leaked and I would have preferred to be able to assume that libfabric is validating this for us. Prior to adding the check, libfabric didn't do anything to stop it from getting to the point of making a call to ib_reg_mr_dmabuf. (I think this fallback logic in efa_domain_hmem_info_check_p2p_support_cuda is responsible; On the impacted 5.10 systems, where CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED returns true, ibv_reg_dmabuf_mr will fail but the second call to ibv_reg_mr can succeed, meaning p2p_supported_by_device remains hot.)

The reason I'm bringing this case up in the context of an assert here is just to mention that there are a multitude of reasons that nccl + ofi-nccl can believe dmabuf support is available whereas you might not, and while I think we're covered today, if and when that happens we will end up hitting this assert. This is the first opportunity that we get to receive a strong yes/no signal from libfabric that there's agreement that dmabuf requirements are met... When we disagree, I'd prefer to get control returned so that we can print an explanation of what probably happened and provide instructions on how to force the non-dmabuf path.

shijin-aws · 2024-10-08T20:45:55Z

prov/efa/src/efa_mr.c

@@ -227,7 +221,32 @@ static int efa_mr_hmem_setup(struct efa_mr *efa_mr,
 		efa_mr->needs_sync = true;
 		efa_mr->peer.device.cuda = attr->device.cuda;

-		if (cuda_is_gdrcopy_enabled()) {
+		if ((flags & FI_MR_DMABUF) != 0) {


just if (flags & FI_MR_DMABUF)

ack, can do.

shijin-aws · 2024-10-08T21:04:23Z

prov/efa/src/efa_mr.c

+				 * valid exporter. Warn once, optimistically assume that the
+				 * user succeeded despite our earlier failure, and proceed.
+				 */
+				EFA_WARN_ONCE(FI_LOG_MR,


This shouldn't be only a warning. In L850 we will skip IB registration if p2p_supported_by_device is false, which is undesirable for NCCL plugin on p series (because you requested FI_OPT_CUDA_API_PERMITTED in setopt as false, we will make that setopt fail if p2p_supported_by_device is false.

ack, this this just a miss on my part, wanted to do something here but wasn't sure what. This check is wrong. Please do see the other thread where I mention that ib_reg_mr_dmabuf can fail without necessarily resulting in p2p_supported_by_device assuming false.

shijin-aws · 2024-10-08T21:05:06Z

prov/efa/src/efa_mr.c

+				EFA_WARN_ONCE(FI_LOG_MR,
+					      "Unexpected FI_HMEM_CUDA dmabuf MR, export test failed during init.");
+			}
+		} else if (cuda_is_gdrcopy_enabled()) {


See my earlier comments. I think we only need one if here:

if (cuda_is_gdrcopy_enabled() && !(flags & FI_MR_DMABUF))

Totally fine with this, but wanted to provide space for the comments.

aws-nslick · 2024-10-10T01:17:35Z

@shijin-aws we spoke offline today and I think there's agreement that the conditions being tested for here are relevant on the whole, but unintentionally, and while they should be checked somewhere they shouldn't necessarily be checked here. I know there have been patchsets previously that looked to unify some of these flows, and I believe you're looking at reposting them.

So I'm going to repost this patch with just the minimal checks required to elide the gdr calls when dmabuf arguments are provided to fi_mr_regattr, with the understanding that the two error cases covered here are going to be irrelevant at some point in the near future. This doesn't block anything in aws-ofi-nccl (our nightlies are running against libfabric as it exists in efa-installer-latest.tar.gz just fine), so no real urgency here. But I'm concerned about how this flow would react under HMM/ATS and feel it's important to ensure that gdrdrv stays out of the picture.

efa_mr_hmem_setup previously always called ofi_hmem_dev_register on all FI_HMEM_CUDA calls, regardless of the presence of FI_MR_DMABUF in flags. When gdrcopy is enabled, this means deconstructing the fi_mr_dmabuf into a struct iovec from its {base, offset, len} 3-tuple, then passing the resulting iovec to gdr_pin followed by gdr_map. a dmabuf cannot be exported by the nvidia module without an implicit promise that the address space is already reserved and mapped in the current pid, of appropriate size and alignment, and that all pages/ranges backing it can be made available to an importer. All requirements are enforced by the cuda APIs used to acquire one. At best, calls to libgdrcopy here are unnecessary for dmabufs, and at worst the pgprots set by gdrdrv are different enough from the ones setup by cuda proper to cause issues, or the redundant mappings become costly for the driver to maintain. Prior to this patch, apps can only prevent these gdr_map calls on dmabuf arguments by disabling gdrcopy entirely through environment variables before launch. But apps may wish to use fi_mr_regattr with dmabuf arguments in the default case, while still reserving the right to call fi_mr_regattr with iov arguments on the same domain, where the gdr flow may still be desired in the latter case. This makes that possible. Signed-off-by: Nicholas Sielicki <[email protected]>

shijin-aws

Thanks! We will refine the dmabuf support check as a follow up PR.

shijin-aws requested a review from a team October 8, 2024 18:05

shijin-aws requested changes Oct 8, 2024

View reviewed changes

aws-nslick force-pushed the efa-dmabuf-fix branch from 59f4b2b to 577c4a1 Compare October 10, 2024 08:21

aws-nslick force-pushed the efa-dmabuf-fix branch from 577c4a1 to ef6b325 Compare October 10, 2024 08:21

shijin-aws approved these changes Oct 10, 2024

View reviewed changes

shijin-aws merged commit 8708b5c into ofiwg:main Oct 10, 2024
13 checks passed

aws-nslick deleted the efa-dmabuf-fix branch October 27, 2024 23:39

aws-nslick mentioned this pull request Oct 29, 2024

hmem/cuda: Add env variable to enable/disable CUDA DMABUF #10502

Merged

shijin-aws mentioned this pull request Nov 6, 2024

[v1.22.x] prov/efa: avoid gdr_pin/gdr_map for dmabuf mrs #10521

Merged

aws-nslick mentioned this pull request Jan 21, 2025

prov/efa: fix leak of dmabuf fd in cuda p2p probe #10708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/efa: avoid gdr_pin/gdr_map for dmabuf mrs #10445

prov/efa: avoid gdr_pin/gdr_map for dmabuf mrs #10445

aws-nslick commented Oct 8, 2024

shijin-aws Oct 8, 2024 •

edited

Loading

aws-nslick Oct 9, 2024

shijin-aws Oct 8, 2024

aws-nslick Oct 9, 2024

shijin-aws Oct 8, 2024 •

edited

Loading

aws-nslick Oct 9, 2024

shijin-aws Oct 8, 2024

aws-nslick Oct 9, 2024

aws-nslick commented Oct 10, 2024

shijin-aws left a comment

prov/efa: avoid gdr_pin/gdr_map for dmabuf mrs #10445

prov/efa: avoid gdr_pin/gdr_map for dmabuf mrs #10445

Conversation

aws-nslick commented Oct 8, 2024

shijin-aws Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

aws-nslick Oct 9, 2024

Choose a reason for hiding this comment

shijin-aws Oct 8, 2024

Choose a reason for hiding this comment

aws-nslick Oct 9, 2024

Choose a reason for hiding this comment

shijin-aws Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

aws-nslick Oct 9, 2024

Choose a reason for hiding this comment

shijin-aws Oct 8, 2024

Choose a reason for hiding this comment

aws-nslick Oct 9, 2024

Choose a reason for hiding this comment

aws-nslick commented Oct 10, 2024

shijin-aws left a comment

Choose a reason for hiding this comment

shijin-aws Oct 8, 2024 •

edited

Loading

shijin-aws Oct 8, 2024 •

edited

Loading