-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include rsqrt function for Vec{4,Float32} #57
Comments
Just so we know what we are getting into when starting to add intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Start scrolling :) |
In principle I'd be happy to add support for functions that are not in Base. However, I'd like to have a reasonably performing fall-back that works on all architectures (and for all vector types and sizes). What I don't like about the The way to go about this would be:
For bonus points:
|
In my opinion, it would be better to only define the method for I suppose it's also worthwhile to provide versions for lengths 1-3 that just fill the remainder of the |
Yes, I'd rather have a function that works anywhere. This will allow e.g. people running tests on any kind of machine. You could provide a mechanism to tell callers whether there is an efficient implementation, e.g. by overloading a function. approx_is_fast(f) = false
approx_is_fast(::typeof(approx_rsqrt)) = true Fallbacks for other vector lengths would also be a good idea. There could also be fallbacks for longer vectors that split them into 4-wide or 8-wide chunks. |
Are you comfortable taking a dependency on https://github.com/m-j-w/CpuId.jl? Much cleaner than reimplimenting in a cross-platform way. |
I think Can you instead call LLVM with that intrinsic in a |
FYI, LLVM will automatically convert inverse Refining to double precision takes 2 NR steps, questionable whether it's favorable on Skylake+ where On the other hand, making @eschnett Is there a preference between bare intrinsic or depending on LLVM fastmath conversion? |
As mentioned above, mapping |
Related: JuliaLang/julia#33220 If this is fixed, I believe SIMD.jl should automatically emit vectorized rsqrt for 4x floats? |
I'm not sure any more what the intent behind this PR is: Either are good goals, but we shouldn't mix up the discussions. |
Ah yes, sorry for crossing the streams. From the user's standpoint, it's probably(?) best to let Julia/LLVM do the optimization (with fast-math as the hint). Exposing the bare intrinsic or Will take a crack at a PR soon. |
I'm surprised the approximate rsqrt instruction could be emitted "automatically" even with fast-math flag on. In my opinion it should be treated as a very special instruction, only emitted via intrinsics or via some special, dedicated high-level function that always does it. I do think it's OK to replace it with with the accurate-but-slow version for the sake of compatibility. |
llvm.x86.sse.rsqrt.ps
calls the_mm_rsqrt_ps
intrinsic which can greatly speed up calculations in some cases. It would be very convenient if this was included in SIMD.jl (see discussion here: JuliaPerf/BenchmarksGame.jl#44 (comment)). I see two barriers to this:rsqrt
is not a function in Julia Base like the other functions defined in this package. To me, this seems like a non-issue, but it does expand the conceptual scope of this package.rsqrt
could only be defined forVec{4,Float32}
. The current architecture of the package doesn't cleanly allow this. Whilellvmins
could be defined forFloat32
but not forFloat64
, there's no way of defining it for length 4 vectors but not for length 2. Perhaps this makes "(Also currently missing: Type conversions, reinterpretation that changes the vector size)" a prerequisite for including something like this.Thoughts?
The text was updated successfully, but these errors were encountered: