-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187
Comments
Usually the hang is seen on a single core with a writer kernel being stuck waiting on data to be populated in its CB Ran the test without the workaround to reduce subblocks of matmul after reducing the aiclk to 400 MHz (from 800 MHz using syseng script to set plls) and now all cores are hanging in the same spot. This points to the bug not being didt related |
@yugaoTT could you please take a look and see if there's something that is handling non-1 parameters incorrectly? |
There's a good chance this is related to #12220 and the problematic change occurred between Aug. 21st 2024 and Sep. 4th 2024. |
From my end, SD hangs almost always at 5 iterations, but FD passed for 10K iterations.
|
This runs 10k iterations on fast dispatch without the workaround, issue looks to be related to slow dispatch only, removing didt suspected label |
on BH cmd_buffer is known to have issues, need to turn on cmd_buffer_fifo. |
@abhullar-tt Is there an issue tracking to fix the problems with the BH cmd_buffer or will they remain and cmd_buffer_fifo will always need to be on? |
It is tracked in #5174 to verify that cmd buffers are functional on new BH hw |
#11623 has been tracking ND hangs on BH CI.
Ran for small number of iterations locally on IRD machines without seeing hang but increasing test iterations was able to reproduce the hang on IRD (I used yyz-bh-26 and saw hang after ~20 iterations)
Reducing the sub-block dimensions as (merged to main in #12186) enables test to run for 1000 iterations (on yyz-bh-26).
To run test with workaround:
Without workaorund:
The text was updated successfully, but these errors were encountered: