Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some initial code. #1

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add some initial code. #1

wants to merge 3 commits into from

Conversation

chriselrod
Copy link
Collaborator

This is a WIP. Just wanted to have something up in case anyone wanted to look.

@mcabbott
Copy link
Member

mcabbott commented Aug 8, 2021

How is it envisioned that this interacts with NNlib?

One version would look like this, but perhaps you have something else in mind:

  • NNlib defines a new function dense(f, W, X, b) and the gradient rules for this. It could just call * but can take a few steps through dense! etc. to provide places to hook on. Likewise conv_bias_act.
  • Flux calls those functions, instead of writing things like σ.(... .+ b) directly.
  • CUDA obviously handles mul! and broadcasting, NNlibCUDA overloads only in-place functions conv!(y::CuArray, ...) etc. Flux's Conv cannot be used with CUDA without this.
  • NNlibCPU wants to hook in slightly earlier, to fuse dense! instead. There should be a dense∇weight! for it, too?

@chriselrod
Copy link
Collaborator Author

@DhairyaLGandhi

My plan here was to commit type piracy and add methods to existing functions.

However, I think adding dense(f, W, X, b) (and the associated ChainRules) would be a good idea.

NNlibCPU wants to hook in slightly earlier, to fuse dense! instead. There should be a dense∇weight! for it, too?

What do you mean, in particular regarding fusion with conv?

@mcabbott
Copy link
Member

mcabbott commented Aug 9, 2021

Maybe clearer with code, I pushed the dense_bias_act(f, w, x, b) I was fiddling with, which is more or less what dense(f, W, X, b) could piratically extend.

In how that works now, it only defines a gradient for bias_act!, which gets Wx and does σ.(Wx .+ b), as that's where the fusion could save memory. But I think you want to do one more layer of fusion here, and fold the matmul into the same function. In which case perhaps it needs to change to match, rather than relying on the existing gradient for *.

I see you fuse out[m,n] = f(outₘₙ + b[m]) for the forward pass, what thoughts on the gradient of f? In my PR it gets computed once then used for Δx, Δw, Δb. Computing this is the copy which I thought you could avoid, by over-writing the forward pass, until I remembered that jacobian is going to do several backward passes off one forward.

Convolutional stuff is more WIP. There is a fused conv_bias_act! but only for relu & identity, so it needs to sometimes call bias_act! in addition. Do you plan a fused convlayer!(f, out, ...) here too?

@chriselrod
Copy link
Collaborator Author

In how that works now, it only defines a gradient for bias_act!, which gets Wx and does σ.(Wx .+ b), as that's where the fusion could save memory. But I think you want to do one more layer of fusion here, and fold the matmul into the same function. In which case perhaps it needs to change to match, rather than relying on the existing gradient for *.

I agree.
You noted on the GPU that tanh is as fast as identity.
Maybe tanh_faster isn't there yet on the CPU, but relu almost certainly is, so I think it's worth fusing.
Especially if multithreading.

I think for data locality, a simple pattern would be to have all NNlibCPU functions by default only thread across the batch, and break up data into batches in the same way. The idea being that this lets us minimize movement of data between cores.

Becomes some applications may have smaller batch sizes than the number of available cores, we probably want to add an option to instead break up a problem among threads in another way. This probably won't be able to yield as good locality.

@mcabbott
Copy link
Member

mcabbott commented Aug 9, 2021

Sounds like we're on the same page, good.

Re tanh, even if in the forward pass the computation dominates, the reverse is much cheaper. You'd have to do it in each of 3 different operations, probably? It would I think save you one more allocation, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants