Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Hacky(!) but faster nbody implementations #44

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

smallnamespace
Copy link

@smallnamespace smallnamespace commented Sep 6, 2019

Here are a couple of messy implementations that, at least on my laptop with --target-cpu=core2 (the architecture of the actual BenchmarksGame test machine), beat the current ...simd.jl by about 60% and 40%:

impl 1st run 2nd run speedup vs. simd
simd 5.95s 5.75s -
unsafe_simd 7.6s 4.15s 40%
unsafe_simd_unroll 7.3s 3.6s 60%
Rust#7 - 3.1s 85%

Would like some feedback before cleaning this up further (and getting too deep in this rabbit hole 🙂), in particular whether this is helpful for showing off the language, since the code is getting far from idiomatic Julia.

A few caveats:

  • Not idiomatic Julia because we're porting gcc #4 and rust #7, which liberally use SIMD intrinsics, lay memory out by hand, etc.
  • Compilation time is much longer (probably due to using StaticArrays), so this awaits Julia AOT (protype to compile julia scripts to exe #35) to show real gains; or I can switch using NTuples with unsafe stores.

Btw, the ...unroll.jl file has a hacky macro that fully unrolls some of the inner loops. This mimics how Rust #7 achieves its speedup: rustc is smart enough to automatically unroll the (outer) for loops inside advance, e.g. rsqrt is seen 5 times in decompiled asm.

I didn't go all the way to unrolling the stride-2 loop, but could be persuaded to hack something up just to see how much improvement can be found.

@KristofferC Thanks again for your help getting intrinsics working.

const PAIRS = Tuple((i,j) for i = 1:5, j = 1:5 if j > i)

struct Bodies
x::MMatrix{NBODIES, 3, Float64}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cheating imo. Are other languages really encoding the number of bodies as a compile time constant?

Copy link
Author

@smallnamespace smallnamespace Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much so, e.g.:

I don't have a strong opinion about this, although it seems inevitable that benchmark implementations slowly slide towards more compile-time computation in the absence of clear, explicit guidelines. It's also quite arguable whether hand-vectorizing everything shows much, other than how easily a language lets you call intrinsics.

Julia's well-equipped to play this game though, especially through the macro facilities. Up to you whether you think it's a net benefit to the community to showcase this.

@smallnamespace smallnamespace force-pushed the smallnamespace-nbody-wip branch from 846e6b5 to f6ddb3f Compare September 6, 2019 20:45
@non-Jedi
Copy link
Contributor

non-Jedi commented Sep 7, 2019

Very cool. I think we can probably polish it to be more idiomatic Julia over time. I've been trying to get this one faster using simd intrinsics on my machine for a while now and mostly failing.

My preference would be to just use NTuples with unsafe_store! for now. I think getting AOT compilation working consistently on the benchmarks-game machine might be a ways off. And it might not be accepted at all depending on whether the maintainer wants to deal with the headache.


rsqrt(f::__m128) = ccall("llvm.x86.sse.rsqrt.ps", llvmcall, __m128, (__m128, ), f);
_mm_cvtpd_ps(f::__m128d) = ccall("llvm.x86.sse2.cvtpd2ps", llvmcall, __m128, (__m128d, ), f);
_mm_cvtps_pd(f::__m128) = llvmcall(("", "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to contribute these to SIMD.jl instead of including them inline here? It seems like rsqrt at least should be relatively trivial.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@non-Jedi I think it's doable, but probably beyond my current Julia abilities.

I did find that LLVM will only emit the rsqrt assembly instructions for single and 4x floats (under -ffast-math and SSE3, I believe), see https://godbolt.org/z/v5DF_s . It does a single Newton-Raphson step with FMA.

The tricky bit here is that we would need to explicitly llvmcall the double to float conversions and word packing (for 2x case), as LLVM doesn't help us out here.

Also, I think SIMD.jl covers a subset of Julia math functions - would it be a bit strange to add a SIMD primitive that isn't in Base?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess rsqrt is only simple if you define it for only Vec{4,Float32}, which the current architecture of SIMD.jl doesn't really allow. To me the functionality still belongs in there even if it's not a Base Julia function. I'll open an issue on the repo and see what the maintainer thinks.

unsafe_store!(dx_v2d_ptr, dx3, k_v2d + NPAIRS)

dsq = dx1^2 + dx2^2 + dx3^2
drsqrt = rsqrt_pd_newton(dsq)
Copy link
Contributor

@non-Jedi non-Jedi Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried a version of this that uses all 4 available Float32 and only has 3 calls to _mm_rsqrt_ps instead of 5? It may be slower since it would require caching the dsq values and drsqrt values so you could break this single loop into 3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet, will give it a try.

The double-single conversions and shuffling may end up being a bottleneck, at least for the competition's core2 target since they are only 2-wide; in particular cvtps_pd only expands a contiguous pair of floats into a contiguous pairs of doubles.

If we allow AVX, a more optimal implementation may end up using Vec4 coordinates + 4-wide rsqrt, e.g. switching between broadcasting across coordinates vs. across planets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 4-wide _mm_rsqrt_ps is slightly slower due to the extra caching involved (~4.3s vs. 4.15s on my machine).

Would you like me to clean this up this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants