-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved time to first gradient #151
Conversation
Seems to have eased up some memory pressure as well - Ubuntu tests on nightly are now passing 🎉 |
This is the first time that compilation time for a first gradient has gone under 90%. I can't believe my eyes. Is it safe to say that DenseNet TTFG is no longer a concern either? |
Well, there's probably a way to get it down even further but for now, this improvement looks pretty surreal ( Before: julia> model = DenseNet();
julia> ip = rand(Float32, 224, 224, 3, 1);
julia> @time Zygote.gradient((m,x) -> sum(m(x)), model, ip);
78.400696 seconds (124.71 M allocations: 11.321 GiB, 1.71% gc time, 96.65% compilation time) This PR: julia> @time Zygote.gradient((m,x) -> sum(m(x)), model, ip);
28.161918 seconds (88.19 M allocations: 8.970 GiB, 3.66% gc time, 89.48% compilation time) |
This might be slightly misleading, I think I left the REPL running 😅 The exact benchmarks of the improvements are varying over runs, but one thing is clear - in every case (including a completely fresh REPL), there's at least 2x improvement. There is also a common trend of about 17-18 seconds that Zygote itself takes to compile the first gradient call - not sure if there's some way that can go down, but that should help because I think the models are currently doing all they can |
That's right and something I showed in #150 as well. Difference being that I see an order of magnitude more on first compile which is very curious.
Yes, there is. I have experimented with precompile statements. In fact we used to do a call to gradient in Zygote during precompilation for exactly this reason. We were able to shave the "compile zygote" time by almost an order of magnitude iirc.
That's not strictly the case 😅 . If you were to try an older Metalhead + Flux + Zygote, you would see ~4x faster TTFGs in some cases. There are still some tricks we can apply to get compilation pressure eased off, mostly to do with caching and stability. |
I'm only aware of the switch from custom layer types to |
Can you test this without returning a This is really a sad state of affairs for Zygote that Julia 1.5 -> 1.6 caused such major performance regressions for the most basic operation in ML. |
Even the "old" Metalhead used a flat chain for VGG, and DenseNet used a flat chain only replacing |
@ToucheSir I tried testing with FluxML/Zygote.jl#1195 and an Another curiosity is that the CI test times have not gone down for this PR. @theabhirath are both the screenshots of the tests above with the same Julia version? I know you like to run nightly so I'd be curious if Julia versions are making a big difference here. |
There doesn't seem to be much difference in gradient times but it shaves some time off the forward pass (returning |
Well, they're both nightly 😅 But there's been no major PRs to master that I think could've changed things this drastically, and nothing else has changed between the runs
That may be limited by memory? I'm on a 16 gigs machine, while IIRC the runners have lesser to work with (7, I think?). Not sure if the difference should still show up in some fashion though |
@darsnack that PR won't help TTFG much since Zygote + IRTools still has to churn through all of the control flow in What might help is optimizing the AD compilation pass itself. I have a local IRTools branch that shaves ~10s off TTFG for ViT through a combination of precompilation and reducing memory allocations in one particularly time-consuming function. However, it's unclear how much mileage is left for this approach, as profiling suggests a lot of time is spent in inference or LLVM. Perhaps 1.8/9 will help with those? |
Just for clarity: you're saying that |
|
Great, thanks @theabhirath ! I think this is good to go since there is plenty of improvement in there already and we can move ahead with the compilation tirade. Returning nested One final thing would be to reenable testing gradients out of the models. Those are skipped currently. |
The memory issues on GA actions prevents this - testing locally does take a lot of memory (I've had to intervene to ensure it doesn't write too much into my swap) |
This is my fault for not commenting, but I would actually prefer a follow up PR to remove the nesting. Not just because arbitrary nesting makes iteration and indexing inconvenient, but more practically because nesting is a breaking change. And it doesn't seem necessary when using |
No a breaking change is okay if it is actually making a difference. |
Why is the test time so different for AlexNet? It contains no I feel like we need a more rigorous benchmarking environment beyond |
True, this is completely local and it's not really much of a benchmark because I've just run the tests twice 😅 |
Edit: Initially this had some benchmarks that weren't completely accurate because I'd left the REPL running and it wasn't the first
Zygote.gradient
. TheDenseNet
benchmark is pretty accurate in this regard.This PR (building on the work done by @DhairyaLGandhi in #150) uses a Flux v0.13 feature (namely, the fact that
Chain(::Vector)
is valid syntax, along with returning aChain
as a output fromconv_bn
to halve compilation time for most models (and for some models, even better). From a cold start (firstZygote.gradient
):