-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: avoid gdr_pin/gdr_map for dmabuf mrs #10445
Conversation
prov/efa/src/efa_mr.c
Outdated
@@ -227,7 +221,32 @@ static int efa_mr_hmem_setup(struct efa_mr *efa_mr, | |||
efa_mr->needs_sync = true; | |||
efa_mr->peer.device.cuda = attr->device.cuda; | |||
|
|||
if (cuda_is_gdrcopy_enabled()) { | |||
if ((flags & FI_MR_DMABUF) != 0) { | |||
if (!cuda_is_dmabuf_supported()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this check - shouldn't NCCL already check this before passing the dmabuf fd? We can add an assert(cuda_is_dmabuf_supported()) if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to go with whatever you'd prefer.
Part of the difficulty on our end with this feature is that we don't have a great way to get a direct answer from libfabric around dmabuf support ahead of time -- all we can do is probe for 1.20 + FI_HMEM and then proceed optimistically. (Well, we do have the option to query the allocation granularity, and then use that to create a mempool, and then reserve an address range, and then map the created mempool on that address range, and export that address range, and then initialize a provider, and then attempt a registration, and then tear it all down... but this becomes fairly intrusive for us). Long term, I think the API as it exists today (where dmabuf support is implied for FI_HMEM beyond some api version) is the ideal. But today there are integration pains.
Example of the pain: There exist real systems (ie: p5 + AL2 on the 5.10 kernel) where dmabuf export through cuda works, but the ib ioctls don't exist on the kernel side until linux 5.12, meaning ib_reg_mr_dmabuf
will statically fail. In those cases, it's not wrong for CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED
to return true, because it is a valid exporter for importers like mesa/v4l/whatever, but it's insufficient for our purposes. Whose responsibility should it be to check for kernel import support? In practice, ofi-nccl added a ~5 line function to call uname during init so that we can test the major/minor version. But IMO this feels like an abstraction has leaked and I would have preferred to be able to assume that libfabric is validating this for us. Prior to adding the check, libfabric didn't do anything to stop it from getting to the point of making a call to ib_reg_mr_dmabuf
. (I think this fallback logic in efa_domain_hmem_info_check_p2p_support_cuda is responsible; On the impacted 5.10 systems, where CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED
returns true, ibv_reg_dmabuf_mr
will fail but the second call to ibv_reg_mr
can succeed, meaning p2p_supported_by_device
remains hot.)
The reason I'm bringing this case up in the context of an assert here is just to mention that there are a multitude of reasons that nccl + ofi-nccl can believe dmabuf support is available whereas you might not, and while I think we're covered today, if and when that happens we will end up hitting this assert. This is the first opportunity that we get to receive a strong yes/no signal from libfabric that there's agreement that dmabuf requirements are met... When we disagree, I'd prefer to get control returned so that we can print an explanation of what probably happened and provide instructions on how to force the non-dmabuf path.
prov/efa/src/efa_mr.c
Outdated
@@ -227,7 +221,32 @@ static int efa_mr_hmem_setup(struct efa_mr *efa_mr, | |||
efa_mr->needs_sync = true; | |||
efa_mr->peer.device.cuda = attr->device.cuda; | |||
|
|||
if (cuda_is_gdrcopy_enabled()) { | |||
if ((flags & FI_MR_DMABUF) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just if (flags & FI_MR_DMABUF)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, can do.
prov/efa/src/efa_mr.c
Outdated
* valid exporter. Warn once, optimistically assume that the | ||
* user succeeded despite our earlier failure, and proceed. | ||
*/ | ||
EFA_WARN_ONCE(FI_LOG_MR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be only a warning. In L850 we will skip IB registration if p2p_supported_by_device is false, which is undesirable for NCCL plugin on p series (because you requested FI_OPT_CUDA_API_PERMITTED in setopt as false, we will make that setopt fail if p2p_supported_by_device
is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, this this just a miss on my part, wanted to do something here but wasn't sure what. This check is wrong. Please do see the other thread where I mention that ib_reg_mr_dmabuf can fail without necessarily resulting in p2p_supported_by_device assuming false.
prov/efa/src/efa_mr.c
Outdated
EFA_WARN_ONCE(FI_LOG_MR, | ||
"Unexpected FI_HMEM_CUDA dmabuf MR, export test failed during init."); | ||
} | ||
} else if (cuda_is_gdrcopy_enabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my earlier comments. I think we only need one if
here:
if (cuda_is_gdrcopy_enabled() && !(flags & FI_MR_DMABUF))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally fine with this, but wanted to provide space for the comments.
@shijin-aws we spoke offline today and I think there's agreement that the conditions being tested for here are relevant on the whole, but unintentionally, and while they should be checked somewhere they shouldn't necessarily be checked here. I know there have been patchsets previously that looked to unify some of these flows, and I believe you're looking at reposting them. So I'm going to repost this patch with just the minimal checks required to elide the gdr calls when dmabuf arguments are provided to fi_mr_regattr, with the understanding that the two error cases covered here are going to be irrelevant at some point in the near future. This doesn't block anything in aws-ofi-nccl (our nightlies are running against libfabric as it exists in efa-installer-latest.tar.gz just fine), so no real urgency here. But I'm concerned about how this flow would react under HMM/ATS and feel it's important to ensure that gdrdrv stays out of the picture. |
59f4b2b
to
577c4a1
Compare
efa_mr_hmem_setup previously always called ofi_hmem_dev_register on all FI_HMEM_CUDA calls, regardless of the presence of FI_MR_DMABUF in flags. When gdrcopy is enabled, this means deconstructing the fi_mr_dmabuf into a struct iovec from its {base, offset, len} 3-tuple, then passing the resulting iovec to gdr_pin followed by gdr_map. a dmabuf cannot be exported by the nvidia module without an implicit promise that the address space is already reserved and mapped in the current pid, of appropriate size and alignment, and that all pages/ranges backing it can be made available to an importer. All requirements are enforced by the cuda APIs used to acquire one. At best, calls to libgdrcopy here are unnecessary for dmabufs, and at worst the pgprots set by gdrdrv are different enough from the ones setup by cuda proper to cause issues, or the redundant mappings become costly for the driver to maintain. Prior to this patch, apps can only prevent these gdr_map calls on dmabuf arguments by disabling gdrcopy entirely through environment variables before launch. But apps may wish to use fi_mr_regattr with dmabuf arguments in the default case, while still reserving the right to call fi_mr_regattr with iov arguments on the same domain, where the gdr flow may still be desired in the latter case. This makes that possible. Signed-off-by: Nicholas Sielicki <[email protected]>
577c4a1
to
ef6b325
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We will refine the dmabuf support check as a follow up PR.
When FI_MR_DMABUF flags are passed, provided that dmabuf export requirements for
cuda have been confirmed during init, skip ofi_hmem_dev_register calls.
efa_mr_hmem_setup previously always called ofi_hmem_dev_register on all
FI_HMEM_CUDA calls, regardless of the presence of FI_MR_DMABUF in flags. When
gdrcopy is enabled, this means deconstructing the fi_mr_dmabuf into a struct
iovec from its {base, offset, len} 3-tuple, then passing the resulting iovec to
gdr_pin followed by gdr_map.
The very fact that we have been provided a cuda dmabuf means that the memory is
already pinned, properly aligned, and ready for import by other kmods, which is
enforced in the cudart apis for acquiring one. At best, this is unnecessary for
dmabuf arguments, and at worst the pgprot modifications done by gdrdrv are
different enough from the ones setup by cuda proper to cause issues, or the
redundant mappings become costly for the driver to maintain.
Prior to this patch, apps can only prevent these gdr_map calls on dmabuf
arguments by disabling gdrcopy entirely through environment variables before
launch. But apps may wish to use fi_mr_regattr with dmabuf arguments in the
default case, while still reserving the right to call fi_mr_regattr with iov
arguments on the same domain, where the gdr flow may still be desired in the
latter case. This makes that possible.