WIP: Hacky(!) but faster nbody implementations #44

smallnamespace · 2019-09-06T20:08:15Z

Here are a couple of messy implementations that, at least on my laptop with --target-cpu=core2 (the architecture of the actual BenchmarksGame test machine), beat the current ...simd.jl by about 60% and 40%:

impl	1st run	2nd run	speedup vs. simd
simd	5.95s	5.75s	-
unsafe_simd	7.6s	4.15s	40%
unsafe_simd_unroll	7.3s	3.6s	60%
Rust#7	-	3.1s	85%

Would like some feedback before cleaning this up further (and getting too deep in this rabbit hole 🙂), in particular whether this is helpful for showing off the language, since the code is getting far from idiomatic Julia.

A few caveats:

Not idiomatic Julia because we're porting gcc #4 and rust #7, which liberally use SIMD intrinsics, lay memory out by hand, etc.
Compilation time is much longer (probably due to using StaticArrays), so this awaits Julia AOT (protype to compile julia scripts to exe #35) to show real gains; or I can switch using NTuples with unsafe stores.

Btw, the ...unroll.jl file has a hacky macro that fully unrolls some of the inner loops. This mimics how Rust #7 achieves its speedup: rustc is smart enough to automatically unroll the (outer) for loops inside advance, e.g. rsqrt is seen 5 times in decompiled asm.

I didn't go all the way to unrolling the stride-2 loop, but could be persuaded to hack something up just to see how much improvement can be found.

@KristofferC Thanks again for your help getting intrinsics working.

KristofferC · 2019-09-06T20:27:44Z

nbody/nbody_unsafe_simd.jl

+const PAIRS = Tuple((i,j) for i = 1:5, j = 1:5 if j > i)
+
+struct Bodies
+    x::MMatrix{NBODIES, 3, Float64}


This is cheating imo. Are other languages really encoding the number of bodies as a compile time constant?

Very much so, e.g.:

const N_BODIES: usize = 5;

#define BODIES_COUNT 5

I don't have a strong opinion about this, although it seems inevitable that benchmark implementations slowly slide towards more compile-time computation in the absence of clear, explicit guidelines. It's also quite arguable whether hand-vectorizing everything shows much, other than how easily a language lets you call intrinsics.

Julia's well-equipped to play this game though, especially through the macro facilities. Up to you whether you think it's a net benefit to the community to showcase this.

non-Jedi · 2019-09-07T00:52:56Z

Very cool. I think we can probably polish it to be more idiomatic Julia over time. I've been trying to get this one faster using simd intrinsics on my machine for a while now and mostly failing.

My preference would be to just use NTuples with unsafe_store! for now. I think getting AOT compilation working consistently on the benchmarks-game machine might be a ways off. And it might not be accepted at all depending on whether the maintainer wants to deal with the headache.

non-Jedi · 2019-09-09T15:16:39Z

nbody/nbody_unsafe_simd.jl

+
+rsqrt(f::__m128) = ccall("llvm.x86.sse.rsqrt.ps", llvmcall, __m128, (__m128, ), f);
+_mm_cvtpd_ps(f::__m128d) = ccall("llvm.x86.sse2.cvtpd2ps", llvmcall, __m128, (__m128d, ), f);
+_mm_cvtps_pd(f::__m128) = llvmcall(("", "


How difficult would it be to contribute these to SIMD.jl instead of including them inline here? It seems like rsqrt at least should be relatively trivial.

@non-Jedi I think it's doable, but probably beyond my current Julia abilities.

I did find that LLVM will only emit the rsqrt assembly instructions for single and 4x floats (under -ffast-math and SSE3, I believe), see https://godbolt.org/z/v5DF_s . It does a single Newton-Raphson step with FMA.

The tricky bit here is that we would need to explicitly llvmcall the double to float conversions and word packing (for 2x case), as LLVM doesn't help us out here.

Also, I think SIMD.jl covers a subset of Julia math functions - would it be a bit strange to add a SIMD primitive that isn't in Base?

I guess rsqrt is only simple if you define it for only Vec{4,Float32}, which the current architecture of SIMD.jl doesn't really allow. To me the functionality still belongs in there even if it's not a Base Julia function. I'll open an issue on the repo and see what the maintainer thinks.

non-Jedi · 2019-09-10T14:09:22Z

nbody/nbody_unsafe_simd.jl

+        unsafe_store!(dx_v2d_ptr, dx3, k_v2d + NPAIRS)
+
+        dsq = dx1^2 + dx2^2 + dx3^2
+        drsqrt = rsqrt_pd_newton(dsq)


Have you tried a version of this that uses all 4 available Float32 and only has 3 calls to _mm_rsqrt_ps instead of 5? It may be slower since it would require caching the dsq values and drsqrt values so you could break this single loop into 3.

I haven't yet, will give it a try.

The double-single conversions and shuffling may end up being a bottleneck, at least for the competition's core2 target since they are only 2-wide; in particular cvtps_pd only expands a contiguous pair of floats into a contiguous pairs of doubles.

If we allow AVX, a more optimal implementation may end up using Vec4 coordinates + 4-wide rsqrt, e.g. switching between broadcasting across coordinates vs. across planets.

A 4-wide _mm_rsqrt_ps is slightly slower due to the extra caching involved (~4.3s vs. 4.15s on my machine).

Would you like me to clean this up this PR?

KristofferC reviewed Sep 6, 2019

View reviewed changes

smallnamespace force-pushed the smallnamespace-nbody-wip branch from 846e6b5 to f6ddb3f Compare September 6, 2019 20:45

WIP: Hacky(!) but faster nbody implementations

f6ddb3f

non-Jedi reviewed Sep 9, 2019

View reviewed changes

non-Jedi mentioned this pull request Sep 10, 2019

Include rsqrt function for Vec{4,Float32} eschnett/SIMD.jl#57

Open

non-Jedi reviewed Sep 10, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Hacky(!) but faster nbody implementations #44

WIP: Hacky(!) but faster nbody implementations #44

smallnamespace commented Sep 6, 2019 •

edited

Loading

KristofferC Sep 6, 2019

smallnamespace Sep 6, 2019 •

edited

Loading

non-Jedi commented Sep 7, 2019 •

edited

Loading

non-Jedi Sep 9, 2019

smallnamespace Sep 10, 2019

non-Jedi Sep 10, 2019

non-Jedi Sep 10, 2019 •

edited

Loading

smallnamespace Sep 12, 2019

smallnamespace Nov 18, 2019

WIP: Hacky(!) but faster nbody implementations #44

Are you sure you want to change the base?

WIP: Hacky(!) but faster nbody implementations #44

Conversation

smallnamespace commented Sep 6, 2019 • edited Loading

KristofferC Sep 6, 2019

Choose a reason for hiding this comment

smallnamespace Sep 6, 2019 • edited Loading

Choose a reason for hiding this comment

non-Jedi commented Sep 7, 2019 • edited Loading

non-Jedi Sep 9, 2019

Choose a reason for hiding this comment

smallnamespace Sep 10, 2019

Choose a reason for hiding this comment

non-Jedi Sep 10, 2019

Choose a reason for hiding this comment

non-Jedi Sep 10, 2019 • edited Loading

Choose a reason for hiding this comment

smallnamespace Sep 12, 2019

Choose a reason for hiding this comment

smallnamespace Nov 18, 2019

Choose a reason for hiding this comment

smallnamespace commented Sep 6, 2019 •

edited

Loading

smallnamespace Sep 6, 2019 •

edited

Loading

non-Jedi commented Sep 7, 2019 •

edited

Loading

non-Jedi Sep 10, 2019 •

edited

Loading