Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XOR swap and XCHG assembly optimization for integral types #5215

Closed
wants to merge 1 commit into from

Conversation

ohhmm
Copy link

@ohhmm ohhmm commented Jan 2, 2025

  • Add _ENABLE_XOR_SWAP macro with default value of 0
  • Implement XOR swap optimization for integral types
  • Add x86/x64 assembly XCHG optimization
  • Add comprehensive test coverage for swap operations

Tests are initially disabled to allow for review before enabling.

@ohhmm ohhmm requested a review from a team as a code owner January 2, 2025 12:51
stl/inc/utility Outdated Show resolved Hide resolved
stl/inc/utility Outdated Show resolved Hide resolved
tests/std/tests/XOR_swap_demo/test.cpp Outdated Show resolved Hide resolved
@ohhmm ohhmm force-pushed the devin/-xor-swap-optimization branch 4 times, most recently from 3c3058b to fc2ef56 Compare January 2, 2025 18:34
- Add _ENABLE_XOR_SWAP macro with default value of 0
- Implement XOR swap optimization for integral types
- Add x86/x64 assembly XCHG optimization
- Add comprehensive test coverage for swap operations

Tests are initially disabled to allow for review before enabling.

Co-Authored-By: Serg Kryvonos <[email protected]>
@ohhmm ohhmm force-pushed the devin/-xor-swap-optimization branch from fc2ef56 to 9362cb3 Compare January 2, 2025 20:02
@StephanTLavavej
Copy link
Member

This doesn't seem like a pure performance win, and would therefore fall under our non-goals. Notably, compilers and processors are good at recognizing temporary variables used for swaps, and I believe that processors can even accomplish it sometimes via register renaming, i.e. no actual instructions are spent. In contrast, the need for a branch to defend against self-swap, and the actual instructions used for xoring, are not an obvious win.

We'll talk about this at the next weekly maintainer meeting, but I am not inclined to proceed with this PR - although we always appreciate interest in improving the repo!

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Jan 4, 2025

This looks like a dis-optimization multiple reasons (ordered the most important to the least important):

  1. xchg instruction is implicitly atomic, without lock prefix. Note how _InterlockedExchange compiles to the same https://godbolt.org/z/vWGnTbv1M. The resulting perf is terrible
  2. By itself inline assembly in MSVC inhibits optimization, as it forces the compiler to serialize and spill in surrounding code. (It is not that bad for other compilers, albeit it is better to avoid inline assembly for them too)
  3. The xor trick, while seemingly eliminating variable, is not a win against the plain swap code, as @StephanTLavavej observed.
  4. Some compilers are able to vectorize swapping of an array when it is written as a plain loop. Not MSVC yet, though, that's why there's manual swap vectorization of swap, and Auto-vectorize arrays swap #4991. But we have other targets, besides MSVC (officially Clang and NVCC, unofficially Intel too). The xor trick is not frequent in real code, so unlikely to be recognized as a swap, and so may be not vectorized, or vectorized as xor sequence, which would be way worse compared to the usual vector swap.
  5. The self-swap check is an extra branch, as @StephanTLavavej observed, while it may be optimized away in some cases, it is likely to persist in others.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Jan 6, 2025
@StephanTLavavej
Copy link
Member

Closing for the reasons explained by @AlexGuteniev and myself. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants