Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Instruction set detection/dispatch #16

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

drbenmorgan
Copy link
Member

This is a small WIP on tools/examples on coding/packaging of instruction set specific code (SIMD etc). At present, it simply implements:

  1. A shell script to query the system (macOS/Linux only at present) and print SIMD instructions supported by host.
  2. A small C++14 program to do similar

I'm requesting an initial review now to solicit comments of the remaining items:

  • How to format SIMD flags for portability, e.g. macOS sysctl gives "SSE4.1", Linux /proc/cpuinfo gives "sse4_1"?
  • Demonstration of runtime dispatch/fat binaries. I think this is useful, but also needs documentation on performance penalties, and some actual benchmarks.

Let me know what you think.

Use /proc/cpuinfo (Linux), sysctl (macOS) to print list of available
capabilities of host CPU.
Vendor VectorClass v1.25 code for instruction set detection from upstream
library:

http://www.agner.org/optimize/#vectorclass

Implement minimal use of interface to print an integer representing
the highest instruction set provided by host system.

Add a basic CMake script to build program, and extend README to
document its use.
Vendor VectorClass v1.28 code for instruction set detection from upstream
library:

http://www.agner.org/optimize/#vectorclass

Implement minimal use of interface to print an integer representing
the highest instruction set provided by host system.

Add a basic CMake script to build program, and extend README to
document its use.
Add method to filter and print just the SIMD capabilities from the
full CPU caps listing.

Add CLI arguments to make script a friendlier program for querying
al or just SIMD capabilities. Implement usage/help arguments/functions.
Make it print out supported SIMD sets in human readable form.
@drbenmorgan
Copy link
Member Author

@amadio I couldn't add you as a reviewer, but your feedback would be very welcome here in light of the overlap with VecCore!

Implement dumb program to print message when SIMD preprocessor macros
like __SSE__ are defined. Compile the program into several exes,
distinguished by different values for the -march or -m flags.

Document behaviour and ability to compile "Illegal instruction" code.
Briefly outline "dispatch by configuration management" method.
@amadio
Copy link
Contributor

amadio commented Jun 26, 2018

Hi @drbenmorgan, interesting project. However, I don't understand the objective that well. Do you want to query SIMD properties of a machine to add proper build flags in the build system? Or do you want to have some way for testing at runtime what is supported to call the right code? I will go through the code with more time and add specific comments later.

For your reference, I gave a talk for the vectorization working group of the IXPUG a while ago, and you can check out the slides here. The IXPUG has lots of resources for this sort of thing. There is also another project made by a Gentoo dev that does part of what you are doing here. It's meant to detect what SIMD is supported by the CPU, so you can add the proper configuration to Portage. It currently supports Intel and ARM CPUs. I think the way it's implemented there is simpler than what is in VCL.

@drbenmorgan
Copy link
Member Author

Hi @amadio,

Hi @drbenmorgan, interesting project. However, I don't understand the objective that well. Do you want to query SIMD properties of a machine to add proper build flags in the build system? Or do you want to have some way for testing at runtime what is supported to call the right code? I will go through the code with more time and add specific comments later.

It's the later more than the former. Given that we'd like to distribute binary packages and these may run on a range of CPU families, what techniques are available to ensure the "compatible and most performant" code is run on a client CPU.

For your reference, I gave a talk for the vectorization working group of the IXPUG a while ago, and you can check out the slides here. The IXPUG has lots of resources for this sort of thing. There is also another project made by a Gentoo dev that does part of what you are doing here. It's meant to detect what SIMD is supported by the CPU, so you can add the proper configuration to Portage. It currently supports Intel and ARM CPUs. I think the way it's implemented there is simpler than what is in VCL.

Thanks, those are very useful! I think this PR as it stands though is more focussed on runtime than build time, and the later could be addressed separately (indeed, part of the project would be to not be smart about selecting flags!).

@amadio
Copy link
Contributor

amadio commented Jun 26, 2018

If your intent is to do runtime checks for SIMD features, I think that implementing something like the intrinsic _may_i_use_cpu_feature from ICC in a way that works for all compilers would be the best way to go. Also, we should map the CPU features to the ones used there. I think that in some places you simplified AVX512 support, which has different versions (e.g. KNL, Skylake) with different subsets supported.

As for selecting flags, if you want a multi-arch binary, you have to select them anyway, so a mechanism needs to be in place for it. Vc has a system to compile for multiple architectures, may be worth having a look.

@drbenmorgan
Copy link
Member Author

@amadio I think I oversold the intent of this PR, so I'll make a few changes to clarify the very limited nature of its aim as a minimal demo (but I agree with your points long term!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants