Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include rsqrt function for Vec{4,Float32} #57

Open
non-Jedi opened this issue Sep 10, 2019 · 12 comments
Open

Include rsqrt function for Vec{4,Float32} #57

non-Jedi opened this issue Sep 10, 2019 · 12 comments

Comments

@non-Jedi
Copy link
Contributor

llvm.x86.sse.rsqrt.ps calls the _mm_rsqrt_ps intrinsic which can greatly speed up calculations in some cases. It would be very convenient if this was included in SIMD.jl (see discussion here: JuliaPerf/BenchmarksGame.jl#44 (comment)). I see two barriers to this:

  1. rsqrt is not a function in Julia Base like the other functions defined in this package. To me, this seems like a non-issue, but it does expand the conceptual scope of this package.
  2. rsqrt could only be defined for Vec{4,Float32}. The current architecture of the package doesn't cleanly allow this. While llvmins could be defined for Float32 but not for Float64, there's no way of defining it for length 4 vectors but not for length 2. Perhaps this makes "(Also currently missing: Type conversions, reinterpretation that changes the vector size)" a prerequisite for including something like this.

Thoughts?

@KristofferC
Copy link
Collaborator

KristofferC commented Sep 10, 2019

Just so we know what we are getting into when starting to add intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Start scrolling :)

@eschnett
Copy link
Owner

In principle I'd be happy to add support for functions that are not in Base. However, I'd like to have a reasonably performing fall-back that works on all architectures (and for all vector types and sizes).

What I don't like about the rsqrt intrinsic is that it only calculates an approximation, i.e. it is much less accurate than 1/sqrt. Since I understand the need for performance, and since there's nothing wrong with SIMD.jl experimenting with things that might (or might not) be added to Base later, I'd also be happy to do that. However, the name should then be something like approx_rsqrt (please start bikeshedding).

The way to go about this would be:

  • add a function approx_rsqrt to SIMD
  • provide a generic implementation via 1/sqrt
  • provide a specific implementation for Intel, Float32, vector lengths 4 or 8, depending on what LLVM provides. I don't know how to check for this, but this should be determined automatically.

For bonus points:

  • provide an implementation for Intel and Float64 via explicit type conversions (no need to introduce generic mechanisms here)

@non-Jedi
Copy link
Contributor Author

provide a generic implementation via 1/sqrt

In my opinion, it would be better to only define the method for Vec{4,Float32}. This way, people don't mistakenly think they're getting the performance benefit of _mm_rsqrt_ps when using it with different Vec sizes or types. But if you'd prefer to have the fallback available for symmetry, I can understand that desire.

I suppose it's also worthwhile to provide versions for lengths 1-3 that just fill the remainder of the Vec with 1f0 and then drop the extra values on return.

@eschnett
Copy link
Owner

Yes, I'd rather have a function that works anywhere. This will allow e.g. people running tests on any kind of machine. You could provide a mechanism to tell callers whether there is an efficient implementation, e.g. by overloading a function.

approx_is_fast(f) = false
approx_is_fast(::typeof(approx_rsqrt)) = true

Fallbacks for other vector lengths would also be a good idea. There could also be fallbacks for longer vectors that split them into 4-wide or 8-wide chunks.

@non-Jedi
Copy link
Contributor Author

Are you comfortable taking a dependency on https://github.com/m-j-w/CpuId.jl? Much cleaner than reimplimenting in a cross-platform way.

@eschnett
Copy link
Owner

I think Cpuid only works for Intel CPUs.

Can you instead call LLVM with that intrinsic in a try block, and see if it fails? That would be the easiest way to check whether an intrinsic is supported.

@smallnamespace
Copy link

smallnamespace commented Sep 11, 2019

FYI, LLVM will automatically convert inverse sqrt IR into rsqrt_ss (single float) and rsqrt_ps (4x) assembly as long as SSE/fastmath is enabled, and automatically includes one Newton-Raphson refinement step for ~single-precision acccuracy (it may lose a couple ulp), so intrinsic + 1 step is isn't much less accurate, as long as limited exponent range and different inf/0 handling is fine for user.

Refining to double precision takes 2 NR steps, questionable whether it's favorable on Skylake+ where sqrt throughput is improved. This SO post is a useful summary.

On the other hand, making rsqrt emit bare intrinsic to LLVM allows caller to control how much refinement occurs.

@eschnett Is there a preference between bare intrinsic or depending on LLVM fastmath conversion?

@eschnett
Copy link
Owner

As mentioned above, mapping rsqrt to the bare intrinsic would be misleading due to its inaccuracy. I'd be happy with either or both of an rsqrt that is accurate (preferably letting LLVM handle the details) and an approx_rsqrt that is the bare intrinsic, with a fallback that works for all accuracies.

@smallnamespace
Copy link

Related: JuliaLang/julia#33220

If this is fixed, I believe SIMD.jl should automatically emit vectorized rsqrt for 4x floats?

@eschnett
Copy link
Owner

I'm not sure any more what the intent behind this PR is:
(1) Ensure that the expression 1/sqrt is vectorized efficiently, using e.g. the llvm.x86.sse.rsqrt.ps intrinsic with an additional Newton iteration
(2) Give people access to the raw llvm.x86.sse.rsqrt.ps intrinsic, so that they can calculate fast but inaccurate inverse square roots

Either are good goals, but we shouldn't mix up the discussions.

@smallnamespace
Copy link

smallnamespace commented Sep 12, 2019

Ah yes, sorry for crossing the streams.

From the user's standpoint, it's probably(?) best to let Julia/LLVM do the optimization (with fast-math as the hint). Exposing the bare intrinsic or approx_sqrt here is a stopgap.

Will take a crack at a PR soon.

@nlw0
Copy link
Contributor

nlw0 commented Jan 16, 2021

I'm surprised the approximate rsqrt instruction could be emitted "automatically" even with fast-math flag on. In my opinion it should be treated as a very special instruction, only emitted via intrinsics or via some special, dedicated high-level function that always does it.

I do think it's OK to replace it with with the accurate-but-slow version for the sake of compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants