-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion for new level-1v/-1m-like operations #762
Comments
Not sure if these are in scope but, a function like einsum would be useful; ensum is a dependency function for gaussian mixture modeling (and adjacent algorithms). Generally, any functionality in BLIS that could reflect, or be utilized to implement, functionality found in numpy would be especially valuable: https://numpy.org/doc/stable/reference/routines.linalg.html |
@ct-clmsn some of the things I have in mind can be used as kernels for various combinations of einsum parameters. Providing einsum itself is almost certainly out-of-scope, but yes we'd definitely like to be able to provide all of the basic operations that einsum would use internally. Is there some taxonomy/list of these? |
Source implementation is here. Looks like dot product is the only thing happening under the hood There's an optimization effort for the functionality that's worth noting |
I found the performance of operators in 1f and 2 (axpyf, gemv, etc.) are lower than expected at SKX. It seems the current kernels are not designed under the context of AVX512. Shall we take AVX512 feature in the the planning level-1v/-1m? |
Fair question. I think for now we're only gathering candidate operations to add to BLIS. Microarchitectures and instruction sets for which to optimize those (or existing) operations is mostly orthogonal IMO.
Multithreading for non-level-3 operations is one of those action items we've wanted for a long time, but not badly enough relative to everything else we want. 😂 It will happen someday! |
Good initiative, I would add to the original list:
|
@hominhquan for 2./3., are these building blocks for something like |
@hominhquan 4./5. are on the radar! |
@realab-ai thanks for letting us know about this! Would you mind opening an issue for this with some basic performance comparison data? I think that actually using the AVX2 level1v/f kernels from Zen could help a lot. I've heard conflicting things about the benefits of AVX512 for such operations due to the more severe throttling. |
Kind of. I've seen some DSP algorithms on which they can be useful. They are like the next-gen of FMA operations with more flops. |
Ok. I'll do this recently. |
@hominhquan Cool. I ask because if the goal is N simultaneous element-wise multiplications where N > 1 but not necessary just 2 then this is more like a level1f kernel than level1. I'll add to the list. |
Native support for einsum would be awesome. Don't limit yourselves just because the creators of BLAS stopped at two indices in the 1980s. Why shouldn't BLIS try to solve directly one of the most widely used linear algebra APIs in all of computing? |
Another feature I'd like to see: for element-wise operation, since there is no reduction nor (backward) spatio-temporal dependency in memory access like level-3, to support user-defined custom ukernels to be applied on each (vectorized) chunk of inputs (like what @devinamatthews has done in level-3):
|
I introduced invscal in libflame, and that operation is now in BLIS as well. It performance x := 1/alpha x. Much cleaner than passing 1/alpha into scal. About 10 years ago while reading lapack code, I ran across some specialized code in an edge case. It is possible that 1/alpha underflows (meaning 1/alpha x yields the zero vector) even though dividing all elements of x by alpha would give you the expected result. What I saw them do was: Call a routine, SLAMCH, that returns the largest value that underflows when inverted. Check alpha against this value. If it is ok, then scal is called with 1/alpha Otherwise, a loop is executed to do individual divides. This suggests:
Now if only I could remember in what lapack routine this happens... |
Would sparse operations be on the table? NIST has an existing implementation available here that could be used as a template. |
I found all the band relevant operations (GB,SB,HB,TB) belonging to level-1/2 are only in f2c version instead of a blis implemented. What's the reason behind this situation? |
The banded operations, like the other level2 operations, are memory bandwidth limited. Also, because of the band structure they are somewhat more difficult to optimize (you get some flavor of triangular operations e.g.). So, they basically never became a target of more in-depth optimization and inclusion in the BLIS interface. Are you interested in BLIS-style banded operations (with separate row and column strides, conjugation without transposition, etc.)? |
Take a look at the attached file, in particular the bit at the end. I use this to vectorize chunks of "level1-like" code in C++ using lambdas, even quite complicated code with function calls etc. |
Successive band reduction algorithms based on two-sided orthogonal transformations can be reorganized to expose higher, "level3-like" arithmetic intensity. For Householder-based approaches, the resulting kernels can be efficiently expressed as a sequence of TRMM and SYMM/GEMM calls (compressed band layouts handled by stride manipulation) --- see Fig. 1. The last reduction step (to bidiagonal, tridiagonal, or Hessenberg form) will of course be level2-like. As a complementary optimization, you can chase a train of bulges to improve locality --- see Fig. 2. All of this can be handled at the libflame/LAPACK level. I don't really think there is much to be done within BLIS for this. (EDIT: Actually, IIRC, BLIS is missing "native TRMM", so maybe this is a gap?) For Givens-based approaches, it's a different story, and custom kernels would be beneficial. See, e.g., This comment is, of course, completely off-topic for the present discussion on level1 stuff. |
Definitely. Givens kernels (including fused Givens kernels) were always something that I anticipated we would revisit once BLIS "settled down." Of course, here we are, 10+ years later, and it's still settling. 😂 Thanks for reminding me of this, @nick-knight! |
Now I’m looking for a solution of custom band-like kernel which is hoped to bring benefits to Givens-based successive band reduction case. I would be appreciate if there's more references about this, @nick-knight. Sorry that the request is off-topic. Would you please share further information to : [email protected] |
Well, it seems a bit late to add to this discussion, but I'd have a lot of use for a level 3 routine that calculates |
@chillenb that's great to know. There's a high likelihood that we'll add this in a near future version. |
@chillenb You may want to watch Devin's talk at the most recent BLIS retreat: https://www.cs.utexas.edu/~flame/BLISRetreat2024/Talks.html#DevinTalk1 |
@devinamatthews @rvdg |
Let's discuss new operations that we might like to add to BLIS, specifically those that would fall into level-1v or level-1m families (and perhaps level-2):
C_ij = A_ij * b_j
c = diag(A*trans?(B))
(c_i = A_ij*B_ji or A_ij*B_ij
)We will add to this list as the discussion unfolds!
The text was updated successfully, but these errors were encountered: