-
Notifications
You must be signed in to change notification settings - Fork 260
If you have questions that aren't answered here please use the issue tracker to ask! We're happy to answer questions, and if you have a question odds are good someone else has the same question so we might want to add an answer here.
- Do I have to maintain two versions, one using SIMDe and one not?
- How much overhead does SIMDe introduce?
- Will SIMDe be less efficient than rewriting my code to target another ISA extension?
- Do I need to change my code to use SIMDe?
- What are "native aliases"?
- Does SIMDe do dynamic dispatching based on available ISA extensions?
- I don't want to use OpenMP.
- My project doesn't work with SIMDe; when will you add the feature(s) I need?
- Is it possible to tell if my code is using an unoptimized implementation?
- How can I help?
Not for technical reasons. If your target natively supports the function SIMDe is emulating SIMDe won't emulate it at all; it will just call the native function, even at -O1
.
If you need convincing, I suggest playing around with SIMDe on Compiler Explorer to see the (lack of) difference between the SIMDe version and the native version.
None. If you compile with the same options virtually any compiler will completely optimize SIMDe out and just call the native version, even fairly low levels of optimization; all it needs to to is some very basic inlining, which any compiler written in the last few decades will do.
SIMDe should never make your code less efficient, just more portable.
Probably. SIMDe itself can't fuse multiple operations into one, and there will usually be a bit of a mismatch between the ISA extension your code is written against and the one your compiler is targeting. A human can make adjustments which SIMDe simply can't.
That said, it's perfectly reasonable to use SIMDe and rewrite your code to target another ISA extension. In fact, SIMDe makes this much easier to do! With SIMDe you can port your SIMD code to almost any architecture with almost no effort. Once that's done, you can start adding ifdefs to rewrite as much or as little of your code as you deem necessary when you deem it necessary. This allows you to do the rewrite incrementally, testing as you go and never breaking the code. If some parts of the port are fast enough without a rewrite you can skip them altogether and save some time.
Basically, you don't lose anything by using SIMDe. The result may not be optimal, but it will likely work in places it didn't and you can still optimize as desired.
For example, here is an example of an original implementation in AVX which also has a manually optimized AArch64 version:
#define SIMDE_ENABLE_NATIVE_ALIASES
#include "simde/simde/x86/avx.h"
/* Source AVX implementation copied from
https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-on-x86 */
int32_t hadd(simde__m128i x) {
/* The FORCE_X86 option is just so we can see the
difference between the versions, it's not necessary
in your code. */
#if \
defined(SIMDE_ARCH_ARM_NEON) && \
defined(SIMDE_ARCH_AARCH64) && \
!defined(FORCE_X86)
return vaddvq_s32(*((int32x4_t*) &x));
#else
__m128i hi64 = _mm_unpackhi_epi64(x, x);
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shufflelo_epi16(sum64, _MM_SHUFFLE(1, 0, 3, 2));
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32);
#endif
}
If you look at it on Compiler Explorer you can easily see some different outputs; the first is the original AVX, the second is the optimized AArch64, and the third is the code SIMDe would have resulted in for AArch64 had we not created an optimized version.
You'll need to change your includes to include the SIMDe header for whatever ISA extension(s) you're interested in using. For example, if your code uses SSE, just change #include <xmmintrin.h>
to #include "path_to_simde/simde/x86/sse.h"
.
Beyond that, you might be able to get away with using "native aliases". To use native aliases, just define SIMDE_ENABLE_NATIVE_ALIASES
prior to including SIMDe. You can just add #define SIMDE_ENABLE_NATIVE_ALIASES
to your code, or pass it to your compiler on the command line (typically with -DSIMDE_ENABLE_NATIVE_ALIASES
). If you don't know what native aliases are, see the next question.
Optionally, in order to make sure your code is compiled to the fastest version, you should enable OpenMP 4 SIMD support if your compiler supports it. If you don't want to use OpenMP, please see the I don't want to use OpenMP question where this is discussed in more detail.
Finally, some SIMDe functions depend heavily on the compiler's autovectorization capabilities, which are generally not enabled by default. Increasing the compiler's optimization level can result in drastic performance improvements for SIMDe, so if possible we recommend you do so.
By default, SIMDe puts everything in a "simde_" namespace. For example, _mm_add_epi32
becomes simde_mm_add_epi32
, __m128
becomes simde__m128
, and vaddq_s32
becomes simde_vaddq_s32
. This means that you'll have to change your code to call the SIMDe versions instead. If you enable native aliases then SIMDe will also use the preprocessor to define macros which map the original versions to the SIMDe versions so you can use your code essentially unaltered. Essentially, each function in SIMDe also has a bit after it that looks like:
#if defined(SIMDE_ENABLE_NATIVE_ALIASES)
# define foo(bar, baz) simde_foo(bar, baz)
#endif
Unfortunately, some native APIs aren't designed to be portable. For example, Intel uses int
to mean 32-bit signed integer, char
to mean 8-bit signed integer, and so on. These assumptions don't always hold; int
may be 64 bits on some platforms, or possibly even 16 bits (it is on AVR). On ARM, char
is typically unsigned.
The SIMDe-prefixed functions "correct" this; for example, we use int32_t
and int8_t
instead of int
and char
. Unfortunately, when there is a mismatch this can trigger compiler warnings or cause code to fail. Casting in the native alias macros can solve this sometimes, but not always.
The good news is that native aliases aren't going to make your code less portable. The bad news is that your code won't be as portable as it would be if you use the prefixed versions. It's a trade-off between portability and easy of use, and the decision is up to you.
It's worth noting that you can always start with native aliases to test SIMDe with your project, then move to the prefixed versions in the future if necessary. This was actually the original use case, but honestly they work well enough that a lot of projects will be fine just using them permanently.
No. SIMDe operates at too low of a level for dynamic dispatch. If we implemented it in SIMDe then every time you call a function we would have to add a trampoline to call the "right" implementation. Without a SIMDe init function we would also have to check to make sure the trampolines have been initialized. Dynamic dispatch should be done at a higher level so you only have to do the dispatching rarely.
SIMDe can, however, be combined with additional code to do dynamic dispatching, just like you would if you were using the APIs SIMDe emulates natively (i.e., calling SSE/AVX/NEON/etc. functions directly). If you're looking for code to help with that, you may want to check out the cpu_features library. Unfortunately it requires build-system integration (it uses CMake); if you know of a better library please let us know!
SIMDe only uses the SIMD functionality from OpenMP, which are basically just annotations added to the code to help the compiler automatically vectorize the code. It doesn't depend on the OpenMP runtime. In many compilers you can even enable it without enabling the rest of OpenMP (or even linking to OpenMP):
- GCC and clang:
-fopemp-simd
- Intel C/C++ Compiler:
-qopenmp-simd
That said, if you use these flags SIMDe has to way to know; compilers define _OPENMP
only if the full OpenMP standard is supported, not just the SIMD part. To let SIMDe know to turn on the OpenMP SIMD annotations, you'll need to define SIMDE_ENABLE_OPENMP
; this is generally be done in the build system at the same time you set the OpenMP flag.
If your compiler supports the SIMD-only feature then I'm not aware of any reason not to use it, and using it may result in significant performance gains. Please enable it.
Please contact us using the issue tracker. There are a lot of functions to get through, but knowing what people need helps us prioritize. Please don't be shy… we're planning on implementing them anyway, and we would rather work on something we know people will use!
Not really. Even if you read the SIMDe source code the answer isn't always clear, since just because a function doesn't have an implementation explicitly using intrinsics for the ISA extension (e.g., NEON, AltiVec, MSA, etc.) you are targeting doesn't mean the code will be slow since the compiler can do a lot.
-
We use GCC-style vector extensions on compilers which support them. This typically results in the best code possible for all targets.
-
We use compiler-specific builtins/intrinsics such as __builtin_shufflevector, __builtin_convertvector, and __builtin_shuffle when possible. Again, these typically result in the best code possible.
-
Even when falling back on the fully portable implementations, compilers can often automatically vectorize SIMDe implementations into the best code available for your platform.
We do all we can to convince the compiler to do this whenever possible by adding annotations to loops requesting vectorization and providing extra information to the compiler so it knows it is safe to do so. There are several different implementations which are chosen from automatically at compile time depending on your compiler and settings, but in general the preferred implementation is OpenMP SIMD (if you don't like this, see I don't want to use OpenMP.).
So, if you compile with something like
-O3 -fopenmp-simd -DSIMDE_ENABLE_OPENMP
most functions should be quite fast.
Generally the best thing you can probably do is profile your code (for example, using gprof) to find where it is spending the most time, then take a close look at the top functions. If there is already a target-specific implementations there is almost certainly nothing we can do.
A lot of projects have multiple possible implementations; for example, an SSE2 version and an SSE4.1 version. Unfortunately there is no rule for which would provide a faster implementation using SIMDe; it will depend heavily on which functions you use in each implementation, how often they are called, and what the target architecture is.
The good news is that it should be pretty straightforward to port both to SIMDe, then benchmark to see which is faster. Keep in mind that the result may also vary by target; it might be better to use your SSE4.1 path on NEON, but SSE2 on POWER.
First off, if you're reading this: thank you! Even considering contributing to SIMDe is very much appreciated!
There is no shortage of tasks which could benefit from some help. If you're not sure what you'd like to do please take a look at the issue tracker to see if anything interests you. General areas include:
- Implementing portable versions of currently unsupported functions.
- Implementing functions from one ISA extension using another, for example SSE using NEON, or AVX using SSE.
- Optimizing existing portable implementations.
- Writing documentation.
- Using SIMDe to make other projects portable.
There is some basic documentation on implementing a new function.
If you need any help please feel free to reach out on the issue tracker. We're happy to answer questions, provide advice, and generally help however we can. We know the source can be a bit hard to understand at times, especially with some of the macros, so if there is anything you don't understand please ask; maybe we can even turn the answer into documentation to help other people with the same question.