Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise LoadMaskBits for uint8_t on SVE #2453

Conversation

wbb-ccl
Copy link
Contributor

@wbb-ccl wbb-ccl commented Jan 27, 2025

SVE can load a uint8_t pointer directly into a mask through casting.

This technique is used by Arm and is known to work for GCC and Clang.
See https://gitlab.arm.com/networking/ral/-/blob/7ea3a5f7a03e7a3dd1fecd898db9802d4cd36e89/src/intrinsics.h#L422

Copy link

google-cla bot commented Jan 27, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@jan-wassenberg
Copy link
Member

Thanks for the pull request! Are you able to sign a CLA?

This is a nice improvement. I thought the only documented way was PMOV (SVE 2.1), can you point me to any related documentation?

I did see this but it doesn't specifically mention predicate regs:

" SVE LDR and STR ^
There are no separate ACLE intrinsics for SVE LDR and STR instructions, but the code generator can use these instructions to save and restore SVE registers, or to optimize calls to intrinsics like svld1 and svst1."

Also, it looks like GCC failed to handle this until about 2021. Please add a check for HWY_COMPILER_GCC_ACTUAL >= 1201 for 12.1 or whatever is required.

For clang also, it seems like version 19 is required, here's a godbolt for testing.

@wbb-ccl
Copy link
Contributor Author

wbb-ccl commented Jan 27, 2025

We're currently looking into the company's CLA.

I'm not convinced this method is strictly documented but has been used in Arm's public code and does appear to work. It looks to compile into an ldr instruction that directly loads into the predicate register.

I believe the PMOV instruction coming in SVE2.1 will formalise this into a proper intrinsic.

Whilst the above bug report lists this being fixed in GCC 12.0, godbolt appears to be happy to compile down to 10.2. I've stuck to GCC 12.0 for now.

@jan-wassenberg
Copy link
Member

Sounds good. Thanks for looking into the CLA and checking versions.
If you want, we can simplify the check to HWY_COMPILER_CLANG >= 1901 || HWY_COMPILER_GCC_ACTUAL >= 1200, but I'm OK to leave as-is.

@cc-wvda
Copy link

cc-wvda commented Jan 28, 2025

Thanks for the pull request! Are you able to sign a CLA?

@jan-wassenberg I have been informed that our organisation has indeed signed the CLA, but we are still failing the CLA check on the PRs. We have attempted to query this using the email address for CLA submissions, however we are receiving no response. Would it be possible to flag this internally?

@jan-wassenberg
Copy link
Member

Thanks for following up! I just checked the email address and it's "wil******er​@cambridgeconsultants.com>". Oh, we had a similar issue with Azim's series of recent patches.

What does https://cla.developers.google.com/clas say for you?

I manually confirmed that your corporate CLA is indeed signed. This can take a couple of days to propagate to the automated check. In the meantime, I can approve this PR manually.

jan-wassenberg
jan-wassenberg previously approved these changes Jan 28, 2025
hwy/ops/arm_sve-inl.h Outdated Show resolved Hide resolved
jan-wassenberg
jan-wassenberg previously approved these changes Jan 28, 2025
@copybara-service copybara-service bot merged commit 140f307 into google:master Jan 31, 2025
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants