Add AVX2 support (#23035)

Followup to #22430. Each 256-bit AVX2 intrinsic is emulated on top of 128-bit intrinsics that wasm supports directly.
emscripten-core · Jan 23, 2025 · ee32d3a · ee32d3a
1 parent 2607cbf
commit ee32d3a
Show file tree

Hide file tree

Showing 10 changed files with 3,172 additions and 35 deletions.
diff --git a/ChangeLog.md b/ChangeLog.md
@@ -20,6 +20,9 @@ See docs/process.md for more on how version tagging works.
 
 4.0.1 (in development)
 ----------------------
+- Added support for compiling AVX2 intrinsics, 256-bit wide intrinsic is emulated
+  on top of 128-bit Wasm SIMD instruction set. (#23035). Pass `-msimd128 -mavx2`
+  to enable targeting AVX2.
 - The system JS libraries in `src/` were renamed from `library_foo.js` to
   `lib/libfoo.js`. They are still included via the same `-lfoo.js` flag so
   this should not be a user-visible change. (#23348)

diff --git a/emcc.py b/emcc.py
@@ -76,7 +76,7 @@
   'fetchSettings'
 ]
 
-SIMD_INTEL_FEATURE_TOWER = ['-msse', '-msse2', '-msse3', '-mssse3', '-msse4.1', '-msse4.2', '-msse4', '-mavx']
+SIMD_INTEL_FEATURE_TOWER = ['-msse', '-msse2', '-msse3', '-mssse3', '-msse4.1', '-msse4.2', '-msse4', '-mavx', '-mavx2']
 SIMD_NEON_FLAGS = ['-mfpu=neon']
 LINK_ONLY_FLAGS = {
     '--bind', '--closure', '--cpuprofiler', '--embed-file',
@@ -474,6 +474,9 @@ def array_contains_any_of(hay, needles):
   if array_contains_any_of(user_args, SIMD_INTEL_FEATURE_TOWER[7:]):
     cflags += ['-D__AVX__=1']
 
+  if array_contains_any_of(user_args, SIMD_INTEL_FEATURE_TOWER[8:]):
+    cflags += ['-D__AVX2__=1']
+
   if array_contains_any_of(user_args, SIMD_NEON_FLAGS):
     cflags += ['-D__ARM_NEON__=1']
 

diff --git a/site/source/docs/porting/simd.rst b/site/source/docs/porting/simd.rst
@@ -12,7 +12,7 @@ Emscripten supports the `WebAssembly SIMD <https://github.com/webassembly/simd/>
 1. Enable LLVM/Clang SIMD autovectorizer to automatically target WebAssembly SIMD, without requiring changes to C/C++ source code.
 2. Write SIMD code using the GCC/Clang SIMD Vector Extensions (``__attribute__((vector_size(16)))``)
 3. Write SIMD code using the WebAssembly SIMD intrinsics (``#include <wasm_simd128.h>``)
-4. Compile existing SIMD code that uses the x86 SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 or AVX intrinsics (``#include <*mmintrin.h>``)
+4. Compile existing SIMD code that uses the x86 SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX or AVX2 intrinsics (``#include <*mmintrin.h>``)
 5. Compile existing SIMD code that uses the ARM NEON intrinsics (``#include <arm_neon.h>``)
 
 These techniques can be freely combined in a single program.
@@ -153,6 +153,7 @@ Emscripten supports compiling existing codebases that use x86 SSE instructions b
 * **SSE4.1**: pass ``-msse4.1`` and ``#include <smmintrin.h>``. Use ``#ifdef __SSE4_1__`` to gate code.
 * **SSE4.2**: pass ``-msse4.2`` and ``#include <nmmintrin.h>``. Use ``#ifdef __SSE4_2__`` to gate code.
 * **AVX**: pass ``-mavx`` and ``#include <immintrin.h>``. Use ``#ifdef __AVX__`` to gate code.
+* **AVX2**: pass ``-mavx2`` and ``#include <immintrin.h>``. Use ``#ifdef __AVX2__`` to gate code.
 
 Currently only the SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and AVX instruction sets are supported. Each of these instruction sets add on top of the previous ones, so e.g. when targeting SSE3, the instruction sets SSE1 and SSE2 are also available.
 
@@ -1145,6 +1146,90 @@ The following table highlights the availability and expected performance of diff
 
 Only the 128-bit wide instructions from AVX instruction set are listed. The 256-bit wide AVX instructions are emulated by two 128-bit wide instructions.
 
+The following table highlights the availability and expected performance of different AVX2 intrinsics. Refer to `Intel Intrinsics Guide on AVX2 <https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX2>`_.
+
+.. list-table:: x86 AVX2 intrinsics available via #include <immintrin.h> and -mavx2
+   :widths: 20 30
+   :header-rows: 1
+
+   * - Intrinsic name
+     - WebAssembly SIMD support
+   * - _mm_broadcastss_ps
+     - 💡 emulated with a general shuffle
+   * - _mm_broadcastsd_pd
+     - 💡 emulated with a general shuffle
+   * - _mm_blend_epi32
+     - 💡 emulated with a general shuffle
+   * - _mm_broadcastb_epi8
+     - 💡 emulated with a general shuffle
+   * - _mm_broadcastw_epi16
+     - 💡 emulated with a general shuffle
+   * - _mm_broadcastd_epi32
+     - 💡 emulated with a general shuffle
+   * - _mm_broadcastq_epi64
+     - 💡 emulated with a general shuffle
+   * - _mm256_permutevar8x32_epi32
+     - ❌ scalarized
+   * - _mm256_permute4x64_pd
+     - 💡 emulated with two general shuffle
+   * - _mm256_permutevar8x32_ps
+     - ❌ scalarized
+   * - _mm256_permute4x64_epi64
+     - 💡 emulated with two general shuffle
+   * - _mm_maskload_epi32
+     - ❌ scalarized
+   * - _mm_maskload_epi64
+     - ❌ scalarized
+   * - _mm_maskstore_epi32
+     - ❌ scalarized
+   * - _mm_maskstore_epi64
+     - ❌ scalarized
+   * - _mm_sllv_epi32
+     - ❌ scalarized
+   * - _mm_sllv_epi64
+     - ❌ scalarized
+   * - _mm_srav_epi32
+     - ❌ scalarized
+   * - _mm_srlv_epi32
+     - ❌ scalarized
+   * - _mm_srlv_epi64
+     - ❌ scalarized
+   * - _mm_mask_i32gather_pd
+     - ❌ scalarized
+   * - _mm_mask_i64gather_pd
+     - ❌ scalarized
+   * - _mm_mask_i32gather_ps
+     - ❌ scalarized
+   * - _mm_mask_i64gather_ps
+     - ❌ scalarized
+   * - _mm_mask_i32gather_epi32
+     - ❌ scalarized
+   * - _mm_mask_i64gather_epi32
+     - ❌ scalarized
+   * - _mm_mask_i32gather_epi64
+     - ❌ scalarized
+   * - _mm_mask_i64gather_epi64
+     - ❌ scalarized
+   * - _mm_i32gather_pd
+     - ❌ scalarized
+   * - _mm_i64gather_pd
+     - ❌ scalarized
+   * - _mm_i32gather_ps
+     - ❌ scalarized
+   * - _mm_i64gather_ps
+     - ❌ scalarized
+   * - _mm_i32gather_epi32
+     - ❌ scalarized
+   * - _mm_i64gather_epi32
+     - ❌ scalarized
+   * - _mm_i32gather_epi64
+     - ❌ scalarized
+   * - _mm_i64gather_epi64
+     - ❌ scalarized
+
+All the 128-bit wide instructions from AVX2 instruction set are listed.
+Only a small part of the 256-bit AVX2 instruction set are listed, most of the
+256-bit wide AVX2 instructions are emulated by two 128-bit wide instructions.
 
 ====================================================== 
 Compiling SIMD code targeting ARM NEON instruction set