-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Reciprocal unary op causes crash after all_reduce #16646
Comments
@dmakoviichuk-tt @davorchap FYI ^^ |
@eyonland can you please take a look ASAP |
hi @cmaryanTT @eyonland do you have any updates? It blocks our distributed training work. |
We will provide an update by COB. |
I've checked out ca2c867 as indicated in the bug report, and am unable to reproduce that issue. Can you provide a pipeline with the crash described or a proper reproducible example please? That would include, at the very least, a commit already containing the test you describe in the bug report, and a target to build and run. |
@patrickroberts it doesn't build for you? (based on image provided above) |
As you can see, it builds, but the JIT compiler fails in a completely unrelated test. Please double check your git commit hash, because other than the change I've shown in the screenshot, my git status is clean. |
@patrickroberts please confirm that you are using machine with n300 device. |
@dmakoviichuk-tt I have an n150 board, I'll have to reserve a machine with an n300. In the meantime can you provide more information about the crash? The build configuration (environment variables, CMake cache variables, etc.) would be helpful, as would a screenshot of the crash. Did it have a stack trace? |
I've updated commit (separate branch, includes test). Provided exact commands to build and run... |
I've got a core dump, will rebuild in RelWithDebInfo so I can investigate more thoroughly, but for now here's a backtrace at least:
|
💥
I might have identified the problem, I'm compiling an attempted fix to test, will let you know if it works. |
Hey @patrickroberts I've noticed a few things:
In overload it overloads vs complex parameters composite op. As result reciprocal is registered without auto launch. Overall it means we should make sure that no simple ops are registered without auto launch. There might be more ops where people forget to do it. |
Okay so I confirmed what the issue was.
After debugging, the error was triggered by |
FWIW this also passes 434d707 (this is without auto launch) |
Describe the bug
Crash while running this test. Substituting
ttnn::reciprocal
with any other unary op doesn't result in crash.Output:
.bashrc
To Reproduce
./build_metal.sh -b Release --build-tt-train
cd build/tt-train/
ctest .
(I didn't use./build/tt-train/tests/ttml_tests
)ctest -R N300UtilsTest.TestXTensorReplicateAllReduce_96_768
Expected behavior
Should work without crash.
Please complete the following environment information:
Additional context
There might be a deeper problem, so we would like to understand what's going on with this
random
operation. Thanks in advance.The text was updated successfully, but these errors were encountered: