Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86 sse2/avx2: rewrite loongarch immediate operand shift instruction #1247

Closed
wants to merge 1 commit into from

Conversation

HecaiYuan
Copy link
Contributor

The x86 instructions like _mm_srli_epi8() can accept variable or immediate operand as the second parm, however, the corresponding loongarch instructions like __lsx_vsrli_b only accept immediate operand as the second parm, so we need to rewirite them to avoid compilation error.

@HecaiYuan HecaiYuan force-pushed the master branch 2 times, most recently from 3afebc4 to 1555b14 Compare December 12, 2024 08:48
@jinboson
Copy link
Contributor

Looks good to me.

@mr-c
Copy link
Collaborator

mr-c commented Dec 12, 2024

@jinboson @HecaiYuan Thanks. Do you know why the CI didn't catch this already?

@jinboson
Copy link
Contributor

jinboson commented Dec 13, 2024

@jinboson @HecaiYuan Thanks. Do you know why the CI didn't catch this already?

Hi, @mr-c . In the CI test , for example, when running test for simde_mm_slli_epi16(), the second shift parm passed is immediate value, like below:

test/x86/sse2.c:7866:test_simde_mm_slli_epi16(SIMDE_MUNIT_TEST_ARGS) {
test/x86/sse2.c:7908: simde__m128i r = simde_mm_slli_epi16(test_vec[i].a, 3);
test/x86/sse2.c:7911: r = simde_mm_slli_epi16(test_vec[i].a, 0);
test/x86/sse2.c:7914: r = simde_mm_slli_epi16(test_vec[i].a, 32);
test/x86/sse2.c:7917: r = simde_mm_slli_epi16(test_vec[i].a, 33);

however, we found that in some projects(like ,aom) that user would pass varibales to the second shift parm, like below:

aom_dsp/x86/highbd_loopfilter_sse2.c:35: *blt = _mm_slli_epi16(x, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:38: *lt = _mm_slli_epi16(x, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:41: *thr = _mm_slli_epi16(x, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:59: *blt_out = _mm_slli_epi16(x0, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:64: *lt_out = _mm_slli_epi16(x0, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:69: *thr_out = _mm_slli_epi16(x0, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:198: th = _mm_slli_epi16(th, bd - 8);
aom_dsp/x86/highbd_loopfilter_sse2.c:208: th = _mm_slli_epi16(th, bd - 8);

In this case loongarch implementation for simde_mm_slli_epi16() would emit error at compile time before the PR.

It seems other arch ran into the problem already, see here : #905 (comment)

simde/x86/sse2.h Outdated
@@ -6163,7 +6163,7 @@ simde_mm_sll_epi16 (simde__m128i a, simde__m128i count) {
#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
r_.neon_u16 = vshlq_u16(a_.neon_u16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, count_.u64[0])));
#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
r_.lsx_i64 = __lsx_vslli_h(a_.lsx_i64, count_.u64[0]);
r_.lsx_i64 = __lsx_vsll_h(a_.lsx_i64, __lsx_vreplgr2vr_h(count_.u64[0]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the implementation to the line 6160, Othewise the codepath will step into #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) branch.

simde/x86/sse2.h Outdated
@@ -6199,7 +6199,7 @@ simde_mm_sll_epi32 (simde__m128i a, simde__m128i count) {
#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
r_.neon_u32 = vshlq_u32(a_.neon_u32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, count_.u64[0])));
#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
r_.lsx_i64 = __lsx_vslli_w(a_.lsx_i64, count_.u64[0]);
r_.lsx_i64 = __lsx_vsll_w(a_.lsx_i64, __lsx_vreplgr2vr_w(count_.u64[0]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the implementation to the line 6196, Othewise the codepath will step into #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) branch.

@rosbif
Copy link
Collaborator

rosbif commented Dec 13, 2024

I haven't looked at SIMDe recently but I feel that I should chip in on this.
The initial premise is false: the Intel s[rl][al]i intrinsics should only accept an immediate constant for the shift count.
This is clear from the Intel Intrinsics Guide and the 'i' in the intrinsic stands for immediate.
It is even clearer from the assembler instruction which should be generated.
The problem is that unfortunately some compilers accept a variable.
This presents a dilemma for SIMDe: a variable shift count should be refused but in this case users will complain that their compiler accepts a variable but SIMDe doesn't.

@jinboson
Copy link
Contributor

I haven't looked at SIMDe recently but I feel that I should chip in on this. The initial premise is false: the Intel s[rl][al]i intrinsics should only accept an immediate constant for the shift count. This is clear from the Intel Intrinsics Guide and the 'i' in the intrinsic stands for immediate. It is even clearer from the assembler instruction which should be generated. The problem is that unfortunately some compilers accept a variable. This presents a dilemma for SIMDe: a variable shift count should be refused but in this case users will complain that their compiler accepts a variable but SIMDe doesn't.

That's really a dilemma, thank you for setting me straight.

@rosbif
Copy link
Collaborator

rosbif commented Dec 16, 2024

If one really must allow a variable shift count, I can see two possible workarounds.

  1. If the target architecture has a shift instruction with the shift counts in a vector, you can splat the shift count across a vector and use this. Obviously this has a performance penalty if the shift count really is a constant.
  2. Use one of the SIMDE_CONSTIFY macros. They are basically a switch over all reasonable shift counts. If the shift count really is a constant (or an expression evaluating to a constant), any decent compiler will optimize away the switch. Otherwise there will be a big impact on code size and performance.

But, for me, the real question is whether SIMDe should accept the invalid use of an intrinsic.
I think not, particularly in view of the potential performance penalty for those who use the intrinsic correctly.
Otherwise I suggest a macro such as SIMDE_ALLOW_VARIABLE_SHIFT_COUNTS to condition this.
Michael? Evan?

@jinboson
Copy link
Contributor

Hi, @mr-c . This is ready for review, then is there anything else needed to move this forward ?

@HecaiYuan
Copy link
Contributor Author

Hi, @mr-c . Regarding this issue, do you have any good suggestions? What should we do next?

@yinshiyou
Copy link
Contributor

I am discussing with the compiler team about allowing the slli instruction to accept variables as parameters, as well as allowing immediate values expanding the range(b 8, h 16, w 32) .
Currently, it is still recommended to merge this PR, at least it eliminates the need to check if the immediate value exceeds the limit.

The x86 instructions like _mm_srli_epi8() can accept variable or
immediate operand as the second parm, however, the corresponding
loongarch instructions like __lsx_vsrli_b only accept immediate
operand as the second parm, so we need to rewirite them to
avoid compilation error.
@jinboson
Copy link
Contributor

This can be closed now, see #1263 for details. Thank you every one.

@mr-c mr-c closed this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants