Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for build #119

Merged
merged 191 commits into from
Aug 5, 2024

Conversation

jozwior
Copy link
Contributor

@jozwior jozwior commented Jul 17, 2024

https://linear.app/worldcoin/issue/INFRA-2749/add-github-self-hosted-gpu-runners

change labels to match for gpu to use gpu-based instances for build

Copy link

linear bot commented Jul 17, 2024

INFRA-2749 Add GitHub Self-hosted GPU runners

This is required to run CI tests on https://github.com/worldcoin/gpu-iris-mpc

Screenshot 2024-07-12 at 11.40.41.png

The instance needs to have 3 GPUs at least, e.g.: g4dn.12xlarge

Screenshot 2024-07-12 at 15.01.59.png

@github-actions github-actions bot added the enhancement New feature or request label Jul 17, 2024
@dkales
Copy link
Collaborator

dkales commented Jul 17, 2024

Seems like the self-hosted runner images do not have the same tooling as the gh-provided ones, ran into that myself once or twice.
I would recommend https://github.com/dtolnay/rust-toolchain/ as an action to install rust which should work across both images.

@dkales
Copy link
Collaborator

dkales commented Jul 17, 2024

I added a job to run the e2e tests on the GPU runner. The image is missing the CUDA and NCCL libs atm, should they be baked into the image used or installed during runtime?

also the build is pretty slow on them, could probably use some form of caching like https://github.com/Swatinem/rust-cache

@philsippl
Copy link
Contributor

After a long battle with this PR, it seems I finally found the issue.
As soon as m (lda) in the gemm_ex is not divisible by 4, we're getting wrong results on the Nvidia L40 with Cuda 12.2 (what the github runner uses). I've verified this issue with a CUDA C implementation.

Meanwhile on the Nvidia H100, this works without any issues and that's why we didn't find it for so long. This PR now makes sure thatm = chunksize is always divisible by 4 and asserts that in the gemm_ex. I'm also fairly sure that this is also the cause for staging issues (Nvidia T4), but here it doesn't just silently give wrong results, but raise a cuBLAS error. This PR now also tries to get more details about this error by calling cublasGetStatusString.

@wojciechsromek what's the deal with this dockerfile? that's the only failing check now.

src/dot/share_db.rs Outdated Show resolved Hide resolved
src/dot/share_db.rs Outdated Show resolved Hide resolved
src/helpers/device_manager.rs Outdated Show resolved Hide resolved
src/server/actor.rs Outdated Show resolved Hide resolved
src/server/actor.rs Outdated Show resolved Hide resolved
src/server/actor.rs Show resolved Hide resolved
src/server/actor.rs Outdated Show resolved Hide resolved
src/threshold_ring/protocol.rs Outdated Show resolved Hide resolved
src/helpers/mod.rs Outdated Show resolved Hide resolved
@philsippl philsippl merged commit d725f84 into main Aug 5, 2024
9 checks passed
@philsippl philsippl deleted the feat/INFRA-2749 branch August 5, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants