Skip to content

Releases: simd-everywhere/simde

SIMDe 0.7.2

24 Jan 18:27
12069d7
Compare
Choose a tag to compare

Summary

Post v0.7.0 fixes; more portable implementations of neon intrinsics

Details

  • common: fix SIMDE_FLOAT64_C macro when SIMDE_FLOAT64_TYPE is defined 1d28a5d @rosbif
  • complex: split complex math out into separate header 0678336 @nemequ
  • diagnostic: silence a few -Weverything diagnostics on clang < 5 6f8d285 @nemequ

Implementation of NEON intrinsics:

x86 intrinsics

SSE*

AVX

AVX512

  • permutex2var: fix some signed/unsigned mismatch warnings 951caa1 @nemequ
  • avx512/s{r,l}li: the imm8 paramters should be unsigned ecc388d @nemequ

XOP

Testing with Docker/Podman & CI

Misc

SIMDe 0.7.0

27 Dec 12:37
f68981d
Compare
Choose a tag to compare

Version 0.7.0 Summary

  • Portable implementation of the NEON intrinsics: 57% finished
  • Some more WASM implementations of x86 intrinsics
  • Various SSE*, AVX*, and SVML enhancements
  • Various new and improved implementations for AltiVec, Neon, POWER architectures.
  • The "new" SSE2 _mm_{load,store}u_si{16,32,64} intrinsics are now implemented along with the SSE _MM_HINT_* defines.
  • All of the CLMUL intrinsics have been implemented. "CLMUL_instruction_set" Wikipedia; CLMUL @ Intel Intrinsics Guide.

Please see the 0.7-rc-1 and 0.7.0-rc2 release notes for more details.

Changes since 0.7.0-rc2

Implementation of NEON intrinsics:

neon/orn: add AVX-512VL (ternarylogic) implementations d667aa8 @nemequ
neon/ld3, neon/ld4: disable -Wmaybe-uninitialized on GCC eaaa71f @nemequ

x86 intrinsics

SSE*

sse: cast _MM_HINT_* values to enum _mm_hint on GCC 3f7e6f7 @nemequ

AVX512

avx512/permutex2var: add remaining intrinsics and translations 5d8d9d2

Misc

math: add modf 580e401 @nemequ

Cleanups of SIMDE_BUG_* definitions e090746 @mr-c

SIMDe v0.7.0-rc2

22 Dec 11:20
Compare
Choose a tag to compare
SIMDe v0.7.0-rc2 Pre-release
Pre-release

Summary

2 issues found in SIMDe v0.7-rc-1 via testing on Debian Experimental on the Debian release architectures (amd64, arm64, armel, armhf, i386, mips64el, mipsel, ppc64el, s390x) have been fixed.

Various new and improved implementations for AltiVec, Neon, POWER architectures.

The "new" SSE2 _mm_{load,store}u_si{16,32,64} intrinsics are now implemented along with the SSE _MM_HINT_* defines.

All of the x86 CLMUL intrinsics have been implemented Wikipedia Intel Intrinsics Guide.

Details

Implementation of NEON intrinsics:

neon/cnt: _vcntq_s8 & _vcntq_u8, add AltiVec implementations 1d56b8c @nemequ
neon/shr_n: _vshrq_n_s8, avoid shift-negative-value diagnostics 26aeda4 @rosbif
neon/bic: _vbicq_s8 & _vbicq_s64, correct PPC implementations 2779ba0 @rosbif
neon/ld3: disable -Wmaybe-uninitialized on GCC < 10 c97093f @nemequ
neon/ld3: load entire vectors sequentially 4097372 @nemequ
neon/bsl, neon/mvn: use ternary logic on AVX-512VL 1660b73 @nemequ

SVML

svml: add fallbacks on shorter functions to div/rem/hypot/erfc (#598) 9199002 @himanshi18037

x86 intrinsics

features: GFNI needs <immintrin.h> 80a2e3d @rosbif

SSE*

sse: correct POWER versions in _mm_cmpunord_ps, add POWER6 version. 2b851a5 @rosbif
sse: correct PPC P5 to P6 in _mm_store_ps f889439 @rosbif
sse: include _MM_HINT_* defines, test for _mm_prefetch 6b2a873 @mr-c @nemequ
sse: added NEON impl for _mm_shuffle_ps @masterchef2209 1777224
sse: work around missing vrndiq_f32 on GCC on armv8 with NEON b56248b @nemequ

sse, sse2: use ternary logic on AVX-512VL for NOT functions 97ac0a5 @nemequ

sse2: fix rounding of _mm_cvtps_epi32 on POWER on clang 0e60b5f @nemequ
sse2: implement the new instructions _mm_{load,store}u_si{16,32,64} b7f467f @nemequ
sse2: added NEON impl for _mm_shuffle_epi32, _mm_shuffle{lo,hi}_epi16 8525eba _mm_mul_su32 5102af0 _mm_cvtsd_f64 6800867 @masterchef2209

sse4.1: regenerate _mm_dp_ps test to avoid rare rounding difference 8358e3c @nemequ

AVX / AVX2

Normalize SIMDE_NATURAL_VECTOR_SIZE usage 98213b3 @mr-c

AVX512

avx512/test: implement _mm512{,_mask}_test_epi{8,16,32,64}_mask ab6c230 @rosbif
avx512/kshift: implement _kshift[lr]i_mask{8,16,32,64} 6bf0dfd @rosbif
avx512/shuffle: implement _mm512_{,mask_,maskz_}shuffle_[fi]{32x4,64x2} e5352c3 @rosbif
Add defines for AVX512VBMI 11c88e2 @rosbif
avx512/permutexvar: add _mm512_{,mask_,maskz_}permutexvar_epi{8,16} _mm512_{,mask_,mask2_,maskz_}permutex2var_epi{8,16} intrinsics b341db7 35c0e5d @rosbif
avx512/permutexvar: many AVX, SSE, NEON, PPC, and WASM implementations c2aa66b @rosbif
avx512/permutexvar: add 128- and 256-bit intrinsics and translations 7ff4af6 @rosbif

CLMUL

All CLMUL intrinsics implemented including _mm_clmulepi64_si128 7ced766 @nemequ
don't use __builtin_shufflevector on XLC 52848ad @nemequ
remove ' && 0' which I accidentally left in place fedae0b @nemequ
work around mscv warning-turned-error 91fe7f4 @mr-c

Testing with Docker/Podman & CI

docker: use an argument for selecting the release eaee500 @nemequ
docker: add crypto and CRC to GCC 10 cross file ca05a1f @nemequ
docker: replace clang-8 cross file with one for clang-11 c326808 @nemequ

Misc

meson: bump version to 0.7.0-rc.1 ed4d5a0 @mr-c
CONTRIBUTING: switch documentation from CMake to Meson. 15f0e24 @nemequ
drone: use Ubuntu instead of Fedora for AArch64 build c5945ca @nemequ
update icc package name for oneapi gold release 820f684 @rscohn2
Document minimum GCC version for -fopenmp-simd 01c7aeb @mr-c
GitHub Actions CI: adjust macOS versions ad6e881 @mr-c

v0.7.0-rc-1

21 Nov 14:14
Compare
Choose a tag to compare
v0.7.0-rc-1 Pre-release
Pre-release

Summary

Portable implementation of the NEON intrinsics: 57% finished
Some more WASM implementations of x86 intrinsics
Various SSE*, AVX*, and SVML enhancements

Details

Implementation of NEON intrinsics:

neon/min: correctly handle (and test) NaNs 07d3a1f @nemequ
neon/zip1: add MMX/SSE, AltiVec, and shuffle vector implementations 56b9205 @nemequ
neon/zip2: add AltiVec, SSE, shuffle vector, etc. implementations f7f36e0 @nemequ
neon/uzp1, neon/uzp2: add AltiVec, SSE, shuffle, etc. implementations 7bcfd75 @nemequ
neon/shl: use SIMDE_POWER_ALTIVEC_BOOL instead of bool aadf0ff @nemequ
neon/addv: initial implementation 49681b6 @nemequ
neon/aba: initial implementation 22c27ec @nemequ
neon/abdl: initial implementation 84c2167 @nemequ
neon/addlv: initial implementation 6b17af2 @nemequ
neon/bic: initial implementation 76d755c @nemequ
neon/bic: add x86, WASM, and AltiVec implementations 9379e5c @nemequ
neon/cnt: initial implementation b15352c @nemequ
neon/hadd: initial implementation 5da4667 @nemequ
neon/hsub: initial implementation 19454d3 @nemequ
neon/maxv: initial implementation a5522ba @nemequ
neon/minv: initial implementation d241170 @nemequ
neon/mls: initial implementation 08a3957 @nemequ
neon/mlsl: initial implementation fd2d782 @nemequ
neon/mull_high: initial implementation c50c836 @nemequ
neon/mlsl_high: initial implementation 93276e0 @nemequ
neon/rbit: add GFNI implementations of vrbit functions fad5a93 @nemequ
neon/dup_lane: initial implementation 2a063f1 @nemequ
neon/orn: initial implementation d788736 @nemequ
neon/bic: fix search & replace error in license 6a1664c @nemequ
neon/qneg: initial implementation 93d6999 @nemequ
neon/maxnm: initial implementation 928834a @nemequ
neon/max: add NaN tests, fix implementations 0d69e18 @nemequ
neon/minv: fix NaN handling, add relevant tests 73044a5 @nemequ
neon/qadd: add scalar functions and the tests to go with them 25f398c @nemequ
neon/qabs: initial implementation fc38506 @nemequ
neon/qneg: add scalar functions and tests 1bf6283 @nemequ
neon/clz: initial implementation c8d74a5 @nemequ
neon/clz: add GFNI implementation of 8x8 functions 7fd22a9 @nemequ
neon/minnm: initial implementation fbd0fd0 @nemequ
neon/uzp1, neon/uzp2: add vuzp{,q}_* implementations for armv7 1d09549 @nemequ
neon/subw: initial implementation 0008eb3 @nemequ
neon/subw_high: initial implementation 4935cd4 @nemequ
neon/addw_high: initial implementation adf12f2 @nemequ
neon/uqadd: initial implementation 451136b @nemequ
neon/mul_lane: initial implementation 92e9df1 @nemequ
neon/mlsl_n: initial implementation 72497e7 @nemequ
neon/cls: initial implementation e6dde92 @nemequ
neon/qshl: initial implementation b266b2b @nemequ
neon/max: fix unsafe SSE2 implementation of vmaxq_f64 b45b259 @nemequ
neon/minnm, neon/maxnm: correct C&P errors in floating point functions 6958298 @rosbif
neon/shl_n, neon/shr_n: add GFNI-based 8-bit shifts 177e5e1 @nemequ
neon/movn_high: initial implementation 0e3e3fd @nemequ
neon/rnd: initial implementation 1bbc67e @nemequ
neon: fix detection of A32 functionality 8ff3a8f @nemequ
neon/mlal_n: initial implementation 7a2f504 @nemequ
neon/qsub: initial implementation 6db7032 @nemequ

SVML

svml: add shorter fallbacks for remaining functions 4400413 @nemequ
svml: GCC bug #53784 also occurs on s390x 5c2d66f @nemequ
svml: fix portable fallback for simde_x_mm512_deg2rad_{pd,ps} d33d0c7 @nemequ
svml: more work-arounds for GCC bug #53784 615ba1b @nemequ

x86 intrinsics

Fix compilation failures when targeting 32-bit x86 with >= SSE2 25b5fbc 82d0065 @nemequ
test/x86: add test_simde_mm{,256}_mask{,z}_xxx_epi{8,16} to skel f1c824f @ashnewmanjones
test/x86: add NaN test case generation functions to x86 d3384dd @nemequ
x86: add SIMDE_REQUIRE{,_CONSTANT}_RANGE macros to many functions 396a018 @ashnewmanjones

MMX

mmx: fix NEON implementation of _mm_srai_pi16 7c416cf @nemequ
mmx: work around some clang <= 11 bugs on POWER9 99c0b39 @nemequ

SSE*

sse/sse2/ssse3: more WASM implementations: _mm_srli_epi{16,32,64} _mm_srl_epi{32,64} 63e63ed _mm_cvt{epi32,si32,si64,si128}_* dd21f30 _mm_sra{,i}_epi{16,32} 3bd7ea9 mm_cmp{un}ord_ps ef06821 simde_mm_sign_epi{8,16,32} 55c5619 @masterchef2209
sse2: add WASM implementation of _mm_unpackhi_pd 4cd0b90 @zekehul
sse, neon/abs: _mm512_abs_ps was introduced in GCC 7.1 fb2a06f @milot-mirdita
sse2: simde_x_mm_abs_pd throws cast errors before GCC 7.4 f70e34c @milot-mirdita
sse2: fix NEON simde_mm_cmp_pd implementation 8bc8b12 @nemequ
sse, sse2: add several AltiVec, WASM, and NEON implementations 08db479 @nemequ
sse: add __builtin_nontemporal_store version of simde_mm_stream_ps 9a8001e @nemequ
sse2: rewrite the NEON implementation of simde_mm_sad_epu8 c520b2d @nemequ
sse2: improve simde_mm_madd_epi16 NEON & AltiVec implementations 55f703f @nemequ
sse4.1: add SSE2 and shuffle-based fallbacks for _mm_cvtepi*_epi* 197610c @nemequ
sse4.1: improve AArch64 _mm_dp_{ps,pd} implementations 3ebf82f @nemequ
sse: fix NaN handling for _mm_max_ps, update test case 15aa0c4 @nemequ
sse2: add shuffle-based implementation of _mm_mul_epu32 e2da067 @nemequ
sse2: improve NEON implementations of _mm_mulhi_ep{i,u}16 f7546c7 @nemequ
sse3: improve some NEON implementations 444cae1 @nemequ
ssse3: formatting fixes a560e2e @nemequ
ssse3: improve some NEON implementations 858d169 @nemequ
sse3: armv7 implementations of deinterleave functions fa158d1 @nemequ
sse3: improve NEON implementation of hadd/hsub functions d9e860e @nemequ
ssse3: many new or improved NEON implementations of pairwise functions 94b9c2f @nemequ
sse2: add missing mm_cmpngt_{pd,sd} 8a2d249 @ktgw0316
sse, sse2, sse4.1: fix ties-toward-even rounding 3208aeb @nemequ
sse4.1: better testing of _mm_round_ps b6a7310 @nemequ
sse: add simde_x_mm_round_ps with lax_rounding argument 24e5926 @nemequ

AVX

avx: require x86_64 for _mm256_insert_epi64 82d0065 @nemequ
avx: simplify some broadcast functions bbcba0a @nemequ
avx, avx512: add missing undef directives for native aliases bb944be @nemequ

AVX2

avx2: squash clang -Weverything warning in portabl _mm256_movemask_epi8 f3de4d9 @nemequ
avx2: add NEON and 128-bit implementations of several shift functions 31fe86d @nemequ
avx2, avx512/madd: add non-vector fallbacks 90503ed @nemequ
avx2: add some fallbacks on 128-bit functions 080c2e6 @nemequ

AVX512

avx512: refactor AVX-512 implementations to be structured like NEON bc7bfdc @nemequ
avx512/add: implement simde_mm_mask{,z}_add_ss d4bb2ad @himanshi18037
avx512/add: _mm_mask{,z}_add_ss was not available in GCC until 8.1 4af1c3a @nemequ
avx512/broadcast: correct feature checks for several functions 17f11f7 @nemequ
avx512: correct many feature tests 344a666 @nemequ
gh-actions: add avx512 builds face9ad @nemequ
avx512/extract: work around ICE on GCC 6 249d926 @nemequ
avx512/s{l,r}li: use CONSTIFY macros on certain GCC versions 9ecf9f2 @nemequ
avx512/s{l,r}li: add missing native versions of _mm512_s{l,r}li_epi16 239d484 @nemequ
avx512/add: fix simde_mm_mask{,z}_add_ss 12a2b5c @nemequ
avx512/extract: work around GCC 6 ICE fffe70f @nemequ
test/avx512: fix function for writing mmask variables 8c806d3 @nemequ
avx512/srl: fix portable fallbacks ffb8515 @nemequ
avx512/fm*: fix typo in portable _mm512_fm*_{ps,pd} fallbacks 119de0b @nemequ
avx512/loadu: add remaining loadu functions and tests cfe173d @nemequ
avx512/mov_mask: implement simde_mm{,256}_movepi{8,16,32,64}_mask e54dde8 @nemequ
avx512/srlv: add simde_mm512_srlv_epi{32,64} e253dff @anrodrig
avx512/srlv: implement several srlv functions and tests d05d2eb @nemequ
avx512/blend: implement remaining blend functions 16d99c3 @nemequ
avx, avx512: add missing undef directives for native aliases bb944be @nemequ
avx512/fma: use fmaf instead of fma fol 32-bit floats f578fd5 @nemequ
avx512/div: add 256-bit fallbacks abfb353 @nemequ
avx512bw: implement mm512_mask{,z}_unpackhi_epi{8,16} 0484698 @ashnewmanjones
avx512/avg: implement simde_mm_mask{,z}avg_epu{8,16} 542c52b @himanshi18037
avx512/setzero: add mm512_setzero_p{s,d} tests a26d3d1@ashnewmanjones
avx512/set: add mm512_set
{epi{8,16,32,64},pd} tests 305e134 @ashnewmanjones
avx512vp2intersect: initial implementation a67e1be @ashnewmanjones
avx512/madd: initial implementation e8882b9 @ashnewmanjones
avx2, avx512/madd: add non-vector fallbacks 90503ed @nemequ
avx512/maddubs: implement maddubs functions 42ca3bd @ashnewmanjones
avx512/sll: add simde_mm512_mask{,z}_sll_epi16 functions 26ac148 @ashnewmanjones
avx512/avg: implement remaining avg functions abf7bd2 @ashnewmanjones
avx512/abs: add fallbacks on shorter vectors c82542d @nemequ
avx512/abs: add NEON and AltiVec implementations b47f166 @nemequ

GFNI

gfni: lower requirements for some functions 5dba288 @nemequ

Testing with Docker/Podman & CI

test: add code to generate special vectors for better coverage d0be929 @nemequ

azure-pipelines: add commented out loongson build b860895 @nemequ

travis: add gcc-6 and clang-3.5 builds 721c925 @nemequ
travis: use GCC 10 for AArch64 build b3a1794 @nemequ
travis: Add MIPS Loongson-MMI (Compile Only) 6537329 @FlyGoat
travis: new package name for intel oneapi beta10 6f6a0b1 @rscohn2

gh-actions: add avx512 builds face9ad @nemequ
gh-actions: disable xcode 10.3 build fe52903 @nemequ
gh-actions: update repo before (trying to) install pcre2grep e92f9ae @nemequ
gh-actions: read /proc/cpuinfo 8b3b405 @nemequ

testing with docker improvements a5c5826 c0c8c01 cf0cf14 @nemequ
docker: assorted clean-ups and documentation improvements a5c5826 @nemequ
docker: add 32-bit x86 builds c0c8c01 0d5a036 @nemequ
docker: add POWER clang builds cf0cf14 @nemequ
docker: add loongson and mips64el+msa builds c30b910 @nemequ
docker: add -futur...

Read more

v0.6.0

24 Aug 15:50
Compare
Choose a tag to compare

379 commits from 9 contributors, changing 273 files!

Full changelog

0.5.0

22 Jun 19:12
Compare
Choose a tag to compare

I’m pleased to announce the availability of the first release of SIMD
Everywhere
(SIMDe),
version 0.5.0,
representing more than three years of work by over a dozen developers.

SIMDe is a permissively-licensed (MIT) header-only library which
provides fast, portable implementations of
SIMD intrinsics for platforms
which aren’t natively supported by the API in question.

For example, with SIMDe you can use
SSE on
ARM,
POWER,
WebAssembly, or almost any platform with a
C compiler. That includes, of course, x86 CPUs which don't support
the ISA extension is question (e.g., calling AVX-512F functions on a
CPU which doesn't natively support them).

If the target natively supports the SIMD extension in question there
is no performance penalty for using SIMDe. Otherwise, accelerated
implementations, such as NEON on ARM, AltiVec on POWER, WASM SIMD on
WebAssembly, etc., are used when available to provide good
performance.

SIMDe has already been used to port several packages to additional
architectures through either upstream support or distribution
packages, particularly on
Debian
.

If you'd like to play with SIMDe online, you can do so on Compiler
Explorer
.

What is in 0.5.0

The 0.5.0 release is SIMDe’s first release. It includes complete
implementations of:

  • MMX
  • SSE
  • SSE2
  • SSE3
  • SSSE3
  • SSE4.1
  • AVX
  • FMA
  • GFNI

We also have rapidly progressing implementations of many other
extensions including NEON, AVX2, SVML, and several AVX-512 extensions
(AVX-512F, AVX-512BW, AVX-512VL, etc.).

Additionally, we have an extensive test suite to verify our
implementations.

What is coming next

Work on SIMDe is proceeding rapidly, but there are a lot of functions
to implement… x86 alone has about 6,000 SIMD functions, and we’ve
implemented about 2,000 of them. We will keep adding more functions
and improving the implementations we already have.

Our NEON implementation is being worked on very actively right now
by Sean Maher and Christopher Moore, and is expected to continue
progressing rapidly.

We currently have two Google Summer of Code students working on the
project as well; Hidayat
Khan

is working on finishing up AVX2, and Himanshi
Mathur
is focused on SVML.

If you're interested in using SIMDe but need some specific functions
to be implemented first, please file an
issue
and we may
be able to prioritize those functions.

Getting Involved

If you're interested in helping out please get in touch. We have a
chat room on Gitter

which is fairly active if you have questions, or of course you can
just dive right in on the issue
tracker
.