Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMDGPU extension #470

Merged
merged 4 commits into from
Feb 13, 2023
Merged

Add AMDGPU extension #470

merged 4 commits into from
Feb 13, 2023

Conversation

pxl-th
Copy link
Member

@pxl-th pxl-th commented Feb 3, 2023

This PR adds AMDGPU extension similar to #445.
Relies on JuliaGPU/AMDGPU.jl#370 and we'd need to tag a new version of AMDGPU.jl for compatibility, so I'll mark it as WIP for now.

For now it has following layers which use MIOpen:

  • Convolutions.
  • Mean/max pooling.
  • Softmax.
  • Activations.

Things like pooling require passing workspace buffer that was used in forward pass, to the backward pass.
Same for batch normalization, which allows passing computed mini-batch mean & inverse variance to the backward pass to speed-up the computation.
(Although batch norm is not defined in this package).

Therefore I've defined rrules for it as follows.
I decided it is much simpler than introducing some global caching for workspaces.

function rrule(::typeof(maxpool), x) # high-level description
    y, workspace = MIOpen.maxpool(x)
    pullback(dy) = MIOpen.∇maxpool(dy, y, x, workspace)
    y, pullback
end

Question

  1. Do we need to register and maintain NNlibROC.jl package for compatibility with older Julia versions?
    Julia 1.8 does not work properly with AMDGPU, and MIOpen artifact is only available for Julia 1.9.
    So older versions would need to use system installation of ROCm stack.
    Maybe we leave it only as an extension for AMDGPU?

  2. MIOpen supports only cross-correlation as convolution (flipkern(conv_dims) is true).
    If a user wants regular convolution, we can either:

  • detect that at a Flux level and transpose the kernel at creation time
  • or transpose it right before the call to MIOpen functions

For now I just throw an error.

PR Checklist

  • Tests are added
  • Documentation, if applicable
  • Update AMDGPU compat.

@ToucheSir
Copy link
Member

I like this workspace model far more than global caching, so it seems reasonable. Do the backwards functions ever overwrite the workspace? If so, defensive copying may be required.

  1. Do we need to register and maintain NNlibROC.jl package for compatibility with older Julia versions?
    Julia 1.8 does not work properly with AMDGPU, and MIOpen artifact is only available for Julia 1.9.
    So older versions would need to use system installation of ROCm stack.
    Maybe we leave it only as an extension for AMDGPU?

If it's ok with you'd I'd like to avoid maintaining a separate package. Anyone who wants AMDGPU support can just use 1.9.

2. MIOpen supports only cross-correlation as convolution (flipkern(conv_dims) is true).
If a user wants regular convolution, we can either:

Is this still the case when considering the row to column major memory layout interop? I would've assumed it works like the CUDNN integration where we don't need to transpose for "normal" convs.

@pxl-th
Copy link
Member Author

pxl-th commented Feb 3, 2023

Do the backwards functions ever overwrite the workspace? If so, defensive copying may be required.

In the case a pullback is called more than once?
I need to check if they modify it.
UPD: no, they do not modify it.

If it's ok with you'd I'd like to avoid maintaining a separate package.

Yeah, I didn't want to maintain it either :)

I would've assumed it works like the CUDNN integration where we don't need to transpose for "normal" convs.

You mean that for CUDNN we specify regular or cross-correlation convolution?
If so, then MIOpen only has cross-correlation for convolution.

@ToucheSir
Copy link
Member

Good to know. I think we'll have to support transposing eventually since Flux.Conv relies on it, but erroring for now is fine.

I also forgot to mention that normalization can wait for #452. I had NNlibROC (and NNlibCUDA) explicitly in mind while developing that PR.

@pxl-th pxl-th marked this pull request as ready for review February 6, 2023 16:32
@pxl-th
Copy link
Member Author

pxl-th commented Feb 6, 2023

Not sure how to proceed with AMDGPU in Project.toml.
Should it be in [extras] section?
Julia 1.6 fails and AMDGPU.jl cannot be installed, because there are no artifacts for it, they are available only for 1.7+.

@CarloLucibello
Copy link
Member

We can just support 1.7+, or even 1.9+

@pxl-th
Copy link
Member Author

pxl-th commented Feb 6, 2023

I just realized, that MIOpen for Julia 1.7 might not even support certain functions that I'm using. We can go with 1.9 only, since it supports most of the devices (if not all) that 1.7 supports anyway.

Is it ok then for 1.6 CI to fail for now?

@ToucheSir
Copy link
Member

As-is it seems like we wouldn't be able to install NNlib on 1.6? Is there something we can do to fix that?

@pxl-th
Copy link
Member Author

pxl-th commented Feb 7, 2023

As-is it seems like we wouldn't be able to install NNlib on 1.6? Is there something we can do to fix that?

We can remove AMDGPU from test targets and add it manually when we want to run tests.
Or we can have test/Project.toml if we just want to be able to install on 1.6.

(accidentally closed the PR)

@pxl-th pxl-th closed this Feb 7, 2023
@pxl-th pxl-th reopened this Feb 7, 2023
@pxl-th
Copy link
Member Author

pxl-th commented Feb 7, 2023

As-is it seems like we wouldn't be able to install NNlib on 1.6? Is there something we can do to fix that?

We can remove AMDGPU from test targets and add it manually when we want to run tests. Or we can have test/Project.toml if we just want to be able to install on 1.6.

(accidentally closed the PR)

Removed AMDGPU.jl from targets.

When running builkite job for the AMDGPU, we populate 'test' target
with AMDGPU before running them.
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got confirmation that tests pass locally and the CI failure may have been a machine hiccup. Let's merge and see if Buildkite has recovered.

@ToucheSir ToucheSir merged commit 1b24915 into FluxML:master Feb 13, 2023
@pxl-th pxl-th deleted the amdgpu-extension branch February 13, 2023 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants