-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement RVV backend #372
Conversation
c/CMakeLists.txt
Outdated
set(CMAKE_GENERATOR Ninja) | ||
set(CMAKE_BUILD_TYPE Release) | ||
set(CMAKE_SYSTEM_NAME Linux) | ||
set(CMAKE_CROSSCOMPILING_EMULATOR qemu-riscv64-static) | ||
set(CMAKE_ASM_COMPILER clang-17) | ||
set(CMAKE_ASM_COMPILER_TARGET riscv64-unknown-linux-gnu) | ||
set(CMAKE_ASM_FLAGS_INIT "-march=rv64gcv1p0") | ||
set(CMAKE_C_COMPILER clang-17) | ||
set(CMAKE_C_COMPILER_TARGET riscv64-unknown-linux-gnu) | ||
set(CMAKE_C_FLAGS_INIT "-march=rv64gcv1p0") | ||
set(CMAKE_CXX_COMPILER clang++-17) | ||
set(CMAKE_CXX_COMPILER_TARGET riscv64-unknown-linux-gnu) | ||
set(CMAKE_CXX_FLAGS_INIT "-flto=thin-march=rv64gcv1p0") | ||
set(CMAKE_EXE_LINKER_FLAGS "-fuse-ld=lld-17") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using CMakePresets
as they are quite a bit more ergonomic.
af4a32f
to
99db257
Compare
99db257
to
5445a52
Compare
5445a52
to
9c46892
Compare
I have less free time for code reviews than I used to, so apologies in advance for taking a while to get to this. You might be interested in an RVV assembly implementation that I've been working on here: https://github.com/BLAKE3-team/BLAKE3/blob/guts_api/rust/guts/src/riscv_rva23u64.S. Unfortunately that branch is tied to a large refactoring, which makes it hard for me to land it in master. |
@oconnor663 Oh cool, I didn't realize there was already some implementation work for RVV. I'll probably give it a closer look soon but just out of curiosity, what state is it in? Any idea about the performance characteristics of it or anything else interesting to note? Also, have you done any work on any SVE backend? |
(I just pushed a commit to clean up some function names, so you might need to refresh the page if you still have that .S file open.) My implementation uses the Zbb and Zvbb extensions, so I don't think it will run on most real chips yet, even those that support V 1.0. I've been doing all the development under Qemu, so I've never done any real benchmarks, but it is passing tests. The missing work that makes it hard to land this is porting other SIMD implementations to this new API. I've done AVX-512 on that branch, but I need to do SSE2/4.1 and AVX2. There was also a minor perf regression in AVX512 that I'll need to track down. Then there are loose ends to tie up around e.g. MSVC-flavored assembly. Most of the heavy lifting in the parallel implementation (which is what really matters for performance) is in I haven't tried ARM SVE yet, no. (Also the NEON implementation in master almost certainly has some perf mistakes that someone more experienced could spot.) |
Interesting. Thanks for the information. I've also been doing most of my experimentation under qemu. I did recently get a hold of a Pioneer (SG2042) but it only supports their 0.71 RVV and I haven't even tried to get tooling to work with that yet (in fact I've barely just gotten it to boot, heh). But it might be interesting to try and adapt what you have (sans the Zbb/Zvbb and whatever else is missing).
I'd be interested in helping with that effort if you'd like. If you could give me some pointers on where to start or whatever, I'd certainly take a look.
Yeah, I noticed that. Seemed interesting. I'm also wondering how that will work out.
I was really kind of looking for an interesting project to try something VLA related but since it seems like you've mostly solved the RVV side, maybe I will give SVE a try instead.
I actually made an attempt to finish the missing parts for the NEON implementation at #369. I'm certainly not an expert though and this was my first real attempt using NEON for anything. Like you suggested though, implementing One thing I was thinking about though, for better performance on Apple Silicon at least, is to try an implementation using Metal, but making use of the unified memory modes to try and avoid the latency issues that made the Vulkan (and SYCL version I saw elsewhere) not very usable. Another thing I've been wondering about is whether it might be possible to use the AMX coprocessor for some parts of the algorithm, perhaps genlut in particular. Anyway, interesting stuff. Let me know if there's some way I can help with that branch or maybe if you have some suggestions for other ideas worth exploring. |
No description provided.