feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for build #119

jozwior · 2024-07-17T06:48:16Z

https://linear.app/worldcoin/issue/INFRA-2749/add-github-self-hosted-gpu-runners

change labels to match for gpu to use gpu-based instances for build

…stances for build

linear · 2024-07-17T06:48:18Z

INFRA-2749 Add GitHub Self-hosted GPU runners

This is required to run CI tests on https://github.com/worldcoin/gpu-iris-mpc

Screenshot 2024-07-12 at 11.40.41.png

The instance needs to have 3 GPUs at least, e.g.: g4dn.12xlarge

Screenshot 2024-07-12 at 15.01.59.png

…stances for tests

dkales · 2024-07-17T08:16:32Z

Seems like the self-hosted runner images do not have the same tooling as the gh-provided ones, ran into that myself once or twice.
I would recommend https://github.com/dtolnay/rust-toolchain/ as an action to install rust which should work across both images.

.github/workflows/lint-clippy.yaml

dkales · 2024-07-17T14:09:43Z

I added a job to run the e2e tests on the GPU runner. The image is missing the CUDA and NCCL libs atm, should they be baked into the image used or installed during runtime?

also the build is pretty slow on them, could probably use some form of caching like https://github.com/Swatinem/rust-cache

… feat/INFRA-2749

philsippl · 2024-08-04T21:08:14Z

After a long battle with this PR, it seems I finally found the issue.
As soon as m (lda) in the gemm_ex is not divisible by 4, we're getting wrong results on the Nvidia L40 with Cuda 12.2 (what the github runner uses). I've verified this issue with a CUDA C implementation.

Meanwhile on the Nvidia H100, this works without any issues and that's why we didn't find it for so long. This PR now makes sure thatm = chunksize is always divisible by 4 and asserts that in the gemm_ex. I'm also fairly sure that this is also the cause for staging issues (Nvidia T4), but here it doesn't just silently give wrong results, but raise a cuBLAS error. This PR now also tries to get more details about this error by calling cublasGetStatusString.

@wojciechsromek what's the deal with this dockerfile? that's the only failing check now.

src/dot/share_db.rs

src/helpers/device_manager.rs

src/server/actor.rs

src/threshold_ring/protocol.rs

src/helpers/mod.rs

.github/workflows/test-gpu.yaml

This reverts commit 898eb13.

jozwior added 2 commits July 16, 2024 13:39

feat: [INFRA-2749] change label to gpu for build on gpu-based instances

39dfd80

feat: [INFRA-2749] change labels to match for gpu to use gpu-based in…

b30cb79

…stances for build

github-actions bot added the enhancement New feature or request label Jul 17, 2024

marcin-janas approved these changes Jul 17, 2024

View reviewed changes

feat: [INFRA-2749] change labels to match for gpu to use gpu-based in…

045d9b4

…stances for tests

wojciechsromek reviewed Jul 17, 2024

View reviewed changes

.github/workflows/lint-clippy.yaml Outdated Show resolved Hide resolved

dkales added 3 commits July 17, 2024 15:53

ci: try running E2E test on self-hosted gpu runner

816a797

ci: install some missing packages

f4c104c

ci: also apt update

786b34c

philsippl added 18 commits July 18, 2024 08:18

install cuda and nccl

4ed1966

cuda 12.1

591de3b

new try

3267185

more deps

2b45bf9

more

2294296

more

b5887e0

manual

6465e1b

quiet

0cfff84

deb

376933d

quiet

d6d7b67

another try

14db094

up

bbb3b07

remove existing

2955955

up

8ad8a01

up

3d13011

up

9503a01

install all deps

5386730

up

94a10c9

philsippl added 17 commits August 2, 2024 20:41

up

77cee60

up

9ee37d8

up

ee6d180

up

3b1fac1

2 byte aligned

5148c07

odd len in phase 2

7891976

dbg

45d8783

up

ffb5604

up

2ba410b

cublas test

3711e60

up

828969c

up

0f5e0e2

up

11cc2d9

up

567e865

add asserts

e7ffa6c

Merge branch 'main' of https://github.com/worldcoin/gpu-iris-mpc into…

4da44b5

… feat/INFRA-2749

cublasGetStatusString

60c8eee

philsippl requested review from dkales and wojciechsromek August 4, 2024 21:08

remove cuda test

fb2bfd9

dkales requested changes Aug 5, 2024

View reviewed changes

dkales reviewed Aug 5, 2024

View reviewed changes

.github/workflows/test-gpu.yaml Outdated Show resolved Hide resolved

philsippl added 4 commits August 5, 2024 12:48

update batch size in server

236f155

PR feedback

4101862

Revert "alloc on streams"

796bede

This reverts commit 898eb13.

fmt

773ed95

dkales approved these changes Aug 5, 2024

View reviewed changes

philsippl merged commit d725f84 into main Aug 5, 2024
9 checks passed

philsippl deleted the feat/INFRA-2749 branch August 5, 2024 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for build #119

feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for build #119

jozwior commented Jul 17, 2024

linear bot commented Jul 17, 2024

dkales commented Jul 17, 2024

dkales commented Jul 17, 2024

philsippl commented Aug 4, 2024

feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for build #119

feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for build #119

Conversation

jozwior commented Jul 17, 2024

linear bot commented Jul 17, 2024

dkales commented Jul 17, 2024

dkales commented Jul 17, 2024

philsippl commented Aug 4, 2024