-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues when switching from local machine to HPC cluster. #2484
Comments
Timing like this mostly measures compilation time. There's some chance this is much slower on your server because it's the first run with these package versions, or something like that. I'd suggest as a first step timing separately the first and subsequent runs, like so: @time for i = 1:2 # this will print only compilation time
Flux.train!(loss, ps, dataset, opt)
end
@time for i = 3:1000 # this will print the time to run after compilation
Flux.train!(loss, ps, dataset, opt)
end |
Aside, I highly recommend changing this to use "explicit" style, as in current docs https://fluxml.ai/Flux.jl/stable/guide/training/training/ . This "implicit" style with loss(m, x, y) = Flux.mse(m(x), y) # no longer closes over global variable `model`
opt_state = Flux.setup(ADAGrad(), model)
for i = 1:1000
Flux.train!(loss, model, dataset, opt_state)
end |
Thank you for your suggestions. I cannot achieve the same speed up for the model I am actually interested in training. Output local OSX machine: Output HPC: # code [edited to add code block, three `s]
using Flux
network_width = 32
perceptron = Chain(
Dense(6, network_width, leakyrelu),
Dense(network_width, network_width, leakyrelu),
Dense(network_width, network_width, leakyrelu),
Dense(network_width, network_width, leakyrelu),
Dense(network_width, 3, relu)
)
opt_state = Flux.Optimiser(Flux.Adam(1e-6),ClipValue(1e-5))
state = Flux.setup(opt_state, perceptron)
k = Float32.(Vector(range(1,100,1000)))
b = Float32.(Vector(range(1,100,1000)))
w = Float32.(Vector(range(1,100,1000)))
r_k = Float32.(vcat(fill(0.1, length(k))...))
r_b = Float32.(vcat(fill(0.01, length(k))...))
p = Float32.(vcat(fill(1, length(k))...))
pi_ = Float32.(vcat(fill(0.01, length(k))...))
function abs_appr(x)
y = sqrt.(x.^2 .+ Float32(1e-6))
return y
end
function Residuals(perceptron, r_k, r_b, k, b, w, p, pi_,weights)
n = size(w, 1)
s = hcat(r_k, r_b, k, b, w, p)'
x = perceptron(s)
c = x[1, :]
k1 = x[2, :]
b1 = x[3, :]
d = k1 .- (1 .+ r_k) .* k
rknext = Float32.(max.(exp.(log.(1 .+ r_k) .* 0.9 .+ 0.1 .+ 0.1 * randn(Float32, n)) .- 1,0))
rbnext = Float32.(exp.(log.(1 .+ r_b) .* 0.9 .+ 0.1 .+ 0.1 * randn(Float32, n)) .- 1)
pinext = Float32.(exp.(log.(1 .+ pi_) .* 0.9 .+ 0.1 .+ 0.1 * randn(Float32, n)) .- 1)
wnext = Float32.(w .* 0.9 .+ 0.1 .+ 0.1 * randn(Float32, n))
p1 = Float32.(p .* (1 .+ pinext))
s = hcat(rknext, rbnext, k1, b1, wnext, p1)'
x = perceptron(s)
c1 = x[1, :]
k2 = x[2, :]
d1 = k2 .- (1 .+ rknext) .* k1
R1 = Float32.(1 .- 0.95 .* (1 .+ rbnext) .* (c1 ./ c ).^(-1.5) .* (p ./ p1))
R2 = Float32.(w .+ (1 .+ r_b) .* b .+ (1 .+ r_k) .* k .- c .* p .- b1 .- 0.01 .* abs_appr.(d).^1.5 .- k1)
R3 = Float32.(1 .+ d .* 0.01 .* 1.5 .* abs_appr.(d).^(1.5 - 2) .- 0.95 .* (1 .+ rknext) .* (c1 ./ c ).^(-1.5) .* (p ./ p1) .* (1 .+ d1 .* 0.01 .* 1.5 .* abs_appr.(d1).^(1.5 - 2)))
R_squared = sum(weights[1] * R1.^2 + weights[2] *R2.^2 + weights[3] *R3.^2)/n
return R_squared
end
function train_me!(epochs, perceptron, w, k, b, r_k, r_b, p, pi_, state; weights = [1,1,1])
for epoch in 1:epochs
# Compute the value and gradients of the loss function
val, grads = Flux.withgradient(perceptron) do m
loss = Residuals(m, r_k, r_b, k, b, w, p, pi_, weights)
end
Flux.update!(state, perceptron, grads[1])
end
end
@time train_me!(2, perceptron, w, k, b, r_k, r_b, p, pi_, state; weights = [1, 0.1, 1]);
@time train_me!(1000, perceptron, w, k, b, r_k, r_b, p, pi_, state; weights = [1, 0.1, 1]);
|
No time to look closely now. But there is some chance your mac is just fast! M-processor memory is very quick. Some chance matmul isn't taking advantage of threads -- what does |
I don't think it necessarily has to do that with that my Mac is faster. The first example I gave was faster on the cluster than on my Mac after I implemented the changes you suggested. The cluster took about 0.027025 seconds (97.80 k allocations: 6.853 MiB, 38.61% gc time) for 1000 training steps while my Mac took 0.469306 seconds (399.92 k allocations: 41.605 MiB, 0.97% gc time). This makes me think the performance differences are due to some non optimal code... I get 8 threads on my Mac machine and 48 threads on the Linus machine. What does this mean in terms of performance? Finally, I am new to asking help on forums like Github. Is cross posting bad form? I thought some that this would perhaps get some extra insights. Thanks! |
Fine to ask in a few places, I just like cross-linking so that anyone can check whether what they're about to say has already been typed up nicely elsewhere. I'm not so surprised if relative performance looks different at different sizes. For large enough matrices, matmul will usually dominate the time, and this will depend on what BLAS library & how it works with your processor. Here at sizes like 32x1000 I'm not super-sure. Standard advice would be to profile & see where time is spent, but IMO this is seldom revealing once Zygote is involved. I don't see obvious performance-killing mistakes. |
The user p_f suggested increasing the number of BLAS threads. Strangely setting the BLAS threads lower improved performance. I have tried experimenting a bit with setting the number of threads to a different number. I got the following results: Running on node: fcn1 Apparently less threads are better in this case? Setting BLAS threads to one makes the HPC as fast as my Mac (excluding compilation time). |
Hey everyone,
First time posting here so let me know if I did not do it correctly or need to add more information.
I am dealing with a performance issues switching from my local machine to an HPC cluster.
I found that the bottleneck is Flux. I got a minimal working example on both machines.
This is the minimal working example: [Edited to use three backticks for a code block]
On both machines I use Julia v1.10.4 and Flux v0.14.19.
On my local machine (OSX 14.5) I get Elapsed time: 0.5202600955963135 seconds.
On the HPC cluster (Linux 5.14.0-427.31.1.el9_4.x86_64) I get 12.309736967086792 seconds.
I run many small models simultaneous and sequently so this slowdown makes the simulations I need to run unfeasible.
Does anyone have an idea what would cause this slow down?
Thank you in advance.
Best, Steven
The text was updated successfully, but these errors were encountered: