Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tuner: Fix missing config.h include #362

Merged

Conversation

rajachan
Copy link
Member

Without this, we were incorrectly including neuron headers instead of CUDA ones when HAVE_CUDA was not defined. Also fail compilation if neither definition was found rather than having a fall-through case.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

aws-nslick
aws-nslick previously approved these changes Mar 14, 2024
Copy link
Contributor

@aws-nslick aws-nslick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that #351 has the same fix and integrates with clang-tidy to prevent this in perpetuity.

@aws-nslick
Copy link
Contributor

Why did this pass distcheck given the unguarded usage of ncclInvalidUsage here?

@rauteric
Copy link
Contributor

Why did this pass distcheck given the unguarded usage of ncclInvalidUsage here?

I think it's because nccl_ofi_tuner.c includes nccl-headers/nvidia/tuner.h (which includes nccl-headers/net.h which includes nccl-headers/neuron/net.h which includes nccl-headers/neuron/error.h) before including nccl_ofi_tuner.h (which includes config.h which defines HAVE_CUDA) 🙃🙃🙃

tl;dr nccl_ofi_tuner.c should include nccl_ofi_tuner.h first.

@bwbarrett
Copy link
Contributor

bwbarrett commented Mar 15, 2024

tl;dr nccl_ofi_tuner.c should include nccl_ofi_tuner.h first.

Incorrect. The first include in any .c file should be #include "config.h". I missed this in the PR review, but that needs to be fixed.

include/nccl_ofi_tuner.h Outdated Show resolved Hide resolved
@rajachan rajachan force-pushed the missing-config-stop-relying-on-getting-lucky branch 2 times, most recently from 890e1c3 to 8476fb9 Compare March 28, 2024 23:50
bwbarrett
bwbarrett previously approved these changes Mar 30, 2024
@bwbarrett bwbarrett added the BuildTriggerRequest CI build will be triggered when this label is set label Mar 30, 2024
@bwbarrett
Copy link
Contributor

The failure looks like a real issue:

In file included from ../include/nccl-headers/nvidia/tuner.h:12,
                 from tuner/nccl_ofi_model.c:3:
../include/nccl-headers/error.h:13:2: error: #error "Neither CUDA nor Neuron support is available"
   13 | #error "Neither CUDA nor Neuron support is available"
      |  ^~~~~
make[2]: *** [Makefile:559: tuner/nccl_ofi_model.lo] Error 1
make[1]: *** [Makefile:439: all-recursive] Error 1
make: *** [Makefile:371: all] Error 2

bwbarrett and others added 2 commits March 30, 2024 15:28
To make autoconf config.h useful, it needs to be the first include
in any .c file.  Fix a couple of occurances that were missed.  With
those fixups, there's no need for the bandaid includes in some of
the header files, so remove those as well.

Signed-off-by: Brian Barrett <[email protected]>
Fail compilation if neither definition was found rather than
having a fall-through case.

Signed-off-by: Raghu Raja <[email protected]>
@bwbarrett bwbarrett force-pushed the missing-config-stop-relying-on-getting-lucky branch from 8476fb9 to b13a103 Compare March 30, 2024 15:28
@AmedeoSapio AmedeoSapio added BuildTriggerRequest CI build will be triggered when this label is set and removed BuildTriggerRequest CI build will be triggered when this label is set labels Mar 30, 2024
@bwbarrett bwbarrett merged commit 8f26257 into aws:master Mar 31, 2024
13 checks passed
@rajachan rajachan deleted the missing-config-stop-relying-on-getting-lucky branch April 1, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BuildTriggerRequest CI build will be triggered when this label is set
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants