Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize find_first_not_of/find_last_not_of member functions (multiple characters overloads) #5206

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AlexGuteniev
Copy link
Contributor

Two remaining in find_meow_of family,
Together with #5102 should complete basic_string vectorization coverage.

Surprisingly not trivial change. The not flavor does not have early return for the inner (needle) loop. This severely impacts the paths that do have this inner loop.

⚙️ Product code changes

Added the implementation of find_meow_not_of for 8 and 16 bit characters.

No 32-bit and 64-bit characters vectorization. We happen to support them in find_first_of, because it exists as a free function callable with integers or pointers, but supporting them in find_first_not_of would take severely altering the specific AVX2 algorithm, that doesn't need to be altered otherwise.

The implementation is added into existing functions via a template parameter, like in #5102. For bitmap algorithms and small needle path it is only a matter of results negation or bit mask inversion, which is done:

The fallback nested loop has a separate compile-time branch without early return.

For SSE4.2 large needle branch. in addition to the negation in the intrinsic parameter, need also to switch to no-early-return inner loop, and combine the results. The _Test_whole_needle lambda has changed to have different loop based on template parameter. It was also changed to return position, and having inner lambda _Step instead of them both. The lambda change can potentially affect codegen in non-not control path, but I don't expect it to be too much of impact, if any at all.

🏁 Benchmark code changes

The fill strategy was altered to:

  • Avoid limits for the needle length to benchmark any needle length
  • Provide the different coverage for not member functions which makes more sense for them

So the iota was dropped. Still incremental values are used to fill needle. because it is boring to just memset std::fill it.

💹 Performance expectations

The not function are expected to perform almost the same, as their positive counterpart. But sure we can't have supersymmetry here.

The noticeable distinct thing is SSE4.2 path with different instructions. It has less control flow, but it has PCMPESTRM instead of PCMPESTRI, Their performance is overall the same, but there is some small difference on some CPUs, Decent Intels tend to like PCMPESTRI, decent AMDs tend to make no difference, older AMDs and power-saving Intels tend to like PCMPESTRM.

See the comparison on uops.info.

Apparently we're good on big scale, and fine tuning cannot be addressed anyway, so I didn't attempt to look for new thresholds for not functions.

⏱️ Benchmark results

i5 1235U

Benchmark main this
bm<AlgType::str_member_first_not, char>/2/3 5.43 ns 5.67 ns
bm<AlgType::str_member_first_not, char>/6/81 33.2 ns 22.9 ns
bm<AlgType::str_member_first_not, char>/7/4 9.72 ns 15.6 ns
bm<AlgType::str_member_first_not, char>/9/3 5.64 ns 13.8 ns
bm<AlgType::str_member_first_not, char>/22/5 12.9 ns 14.1 ns
bm<AlgType::str_member_first_not, char>/58/2 5.51 ns 13.9 ns
bm<AlgType::str_member_first_not, char>/75/85 49.6 ns 50.9 ns
bm<AlgType::str_member_first_not, char>/102/4 10.1 ns 13.4 ns
bm<AlgType::str_member_first_not, char>/200/46 85.1 ns 43.0 ns
bm<AlgType::str_member_first_not, char>/325/1 4.83 ns 13.6 ns
bm<AlgType::str_member_first_not, char>/400/50 134 ns 56.4 ns
bm<AlgType::str_member_first_not, char>/1011/11 15.3 ns 21.9 ns
bm<AlgType::str_member_first_not, char>/1280/46 339 ns 124 ns
bm<AlgType::str_member_first_not, char>/1502/23 379 ns 142 ns
bm<AlgType::str_member_first_not, char>/2203/54 565 ns 217 ns
bm<AlgType::str_member_first_not, char>/3056/7 13.5 ns 18.2 ns
bm<AlgType::str_member_first_not, wchar_t>/2/3 5.54 ns 5.86 ns
bm<AlgType::str_member_first_not, wchar_t>/6/81 40.9 ns 47.4 ns
bm<AlgType::str_member_first_not, wchar_t>/7/4 11.8 ns 15.4 ns
bm<AlgType::str_member_first_not, wchar_t>/9/3 6.03 ns 15.9 ns
bm<AlgType::str_member_first_not, wchar_t>/22/5 11.5 ns 15.7 ns
bm<AlgType::str_member_first_not, wchar_t>/58/2 5.28 ns 15.6 ns
bm<AlgType::str_member_first_not, wchar_t>/75/85 75.1 ns 55.7 ns
bm<AlgType::str_member_first_not, wchar_t>/102/4 11.9 ns 21.5 ns
bm<AlgType::str_member_first_not, wchar_t>/200/46 106 ns 53.6 ns
bm<AlgType::str_member_first_not, wchar_t>/325/1 4.78 ns 15.5 ns
bm<AlgType::str_member_first_not, wchar_t>/400/50 179 ns 64.8 ns
bm<AlgType::str_member_first_not, wchar_t>/1011/11 15.6 ns 26.7 ns
bm<AlgType::str_member_first_not, wchar_t>/1280/46 488 ns 179 ns
bm<AlgType::str_member_first_not, wchar_t>/1502/23 564 ns 187 ns
bm<AlgType::str_member_first_not, wchar_t>/2203/54 819 ns 264 ns
bm<AlgType::str_member_first_not, wchar_t>/3056/7 13.1 ns 24.4 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2/3 5.41 ns 6.53 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/6/81 55.5 ns 28.5 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/7/4 15.8 ns 14.1 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/9/3 12.4 ns 14.7 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/22/5 19.3 ns 14.8 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/58/2 5.34 ns 14.0 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/75/85 378 ns 190 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/102/4 16.1 ns 14.5 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/200/46 951 ns 264 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/325/1 4.70 ns 15.1 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50 1905 ns 627 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1011/11 49.1 ns 17.2 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46 6032 ns 1605 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23 7054 ns 896 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54 10340 ns 3180 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/3056/7 27.7 ns 15.1 ns
bm<AlgType::str_member_last_not, char>/2/3 5.06 ns 5.26 ns
bm<AlgType::str_member_last_not, char>/6/81 32.0 ns 22.7 ns
bm<AlgType::str_member_last_not, char>/7/4 5.49 ns 13.9 ns
bm<AlgType::str_member_last_not, char>/9/3 5.58 ns 13.4 ns
bm<AlgType::str_member_last_not, char>/22/5 6.38 ns 13.4 ns
bm<AlgType::str_member_last_not, char>/58/2 4.61 ns 13.4 ns
bm<AlgType::str_member_last_not, char>/75/85 52.5 ns 45.8 ns
bm<AlgType::str_member_last_not, char>/102/4 5.33 ns 11.7 ns
bm<AlgType::str_member_last_not, char>/200/46 79.3 ns 38.1 ns
bm<AlgType::str_member_last_not, char>/325/1 4.53 ns 13.3 ns
bm<AlgType::str_member_last_not, char>/400/50 139 ns 54.2 ns
bm<AlgType::str_member_last_not, char>/1011/11 11.8 ns 15.8 ns
bm<AlgType::str_member_last_not, char>/1280/46 333 ns 131 ns
bm<AlgType::str_member_last_not, char>/1502/23 412 ns 144 ns
bm<AlgType::str_member_last_not, char>/2203/54 609 ns 215 ns
bm<AlgType::str_member_last_not, char>/3056/7 7.04 ns 15.0 ns
bm<AlgType::str_member_last_not, wchar_t>/2/3 4.98 ns 5.53 ns
bm<AlgType::str_member_last_not, wchar_t>/6/81 36.9 ns 42.2 ns
bm<AlgType::str_member_last_not, wchar_t>/7/4 5.19 ns 13.9 ns
bm<AlgType::str_member_last_not, wchar_t>/9/3 5.74 ns 14.4 ns
bm<AlgType::str_member_last_not, wchar_t>/22/5 5.46 ns 14.3 ns
bm<AlgType::str_member_last_not, wchar_t>/58/2 4.71 ns 14.1 ns
bm<AlgType::str_member_last_not, wchar_t>/75/85 78.0 ns 54.6 ns
bm<AlgType::str_member_last_not, wchar_t>/102/4 5.19 ns 19.5 ns
bm<AlgType::str_member_last_not, wchar_t>/200/46 113 ns 49.8 ns
bm<AlgType::str_member_last_not, wchar_t>/325/1 4.60 ns 15.6 ns
bm<AlgType::str_member_last_not, wchar_t>/400/50 185 ns 64.4 ns
bm<AlgType::str_member_last_not, wchar_t>/1011/11 13.3 ns 26.7 ns
bm<AlgType::str_member_last_not, wchar_t>/1280/46 496 ns 155 ns
bm<AlgType::str_member_last_not, wchar_t>/1502/23 559 ns 169 ns
bm<AlgType::str_member_last_not, wchar_t>/2203/54 823 ns 274 ns
bm<AlgType::str_member_last_not, wchar_t>/3056/7 9.67 ns 21.8 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2/3 4.77 ns 5.05 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/6/81 45.3 ns 29.9 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/7/4 4.94 ns 13.3 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/9/3 7.65 ns 13.5 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/22/5 5.16 ns 13.6 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/58/2 4.73 ns 13.2 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/75/85 303 ns 220 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/102/4 4.95 ns 14.2 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/200/46 723 ns 318 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/325/1 4.53 ns 14.4 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/400/50 1402 ns 719 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1011/11 26.0 ns 15.6 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1280/46 4439 ns 2008 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1502/23 5203 ns 1389 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2203/54 7621 ns 3879 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/3056/7 5.43 ns 13.9 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner December 25, 2024 12:59
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jan 4, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 4, 2025
@AlexGuteniev
Copy link
Contributor Author

Looks like there's significant regression in cases some cases, which needs to be looked into further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

2 participants