Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Reciprocal unary op causes crash after all_reduce #16646

Closed
rfurko-tt opened this issue Jan 10, 2025 · 16 comments
Closed

[Bug Report] Reciprocal unary op causes crash after all_reduce #16646

rfurko-tt opened this issue Jan 10, 2025 · 16 comments

Comments

@rfurko-tt
Copy link
Contributor

rfurko-tt commented Jan 10, 2025

Describe the bug

Crash while running this test. Substituting ttnn::reciprocal with any other unary op doesn't result in crash.

TEST_F(N300UtilsTest, TestXTensorReplicateAllReduce_96_768) {
    auto* device = &ttml::autograd::ctx().get_device();
    auto mesh_shape = device->shape();
    std::cout << "I AM HERE IN TEST 0\n";
    xt::xarray<float> xtensor = xt::random::rand({96 * 768}, -0.05, 0.05).reshape({1, 1, 96, 768});

    ttml::core::XTensorToMeshVariant<float> replicate_composer = ttml::core::ReplicateXTensorToMesh<float>(mesh_shape);
    auto tensor = ttml::core::from_xtensor(xtensor, device, replicate_composer);
    std::cout << "I AM HERE IN TEST 1\n";
    tensor = ttnn::add(tensor, 100.0F);
    std::cout << "I AM HERE IN TEST 2\n";
    tensor = ttnn::reciprocal(tensor);
    std::cout << "I AM HERE IN TEST 3\n";
}

Output:

I AM HERE IN TEST 0
I AM HERE IN TEST 1
I AM HERE IN TEST 2
Segmentation fault (core dumped)
Image

.bashrc

export ARCH_NAME=wormhole_b0
export TT_METAL_HOME=/home/ubuntu/tt-metal
export PYTHONPATH=/home/ubuntu/tt-metal
export TT_METAL_ENV=dev

To Reproduce

  1. ./build_metal.sh -b Release --build-tt-train
  2. cd build/tt-train/
  3. ctest . (I didn't use./build/tt-train/tests/ttml_tests)
  4. you can run only one test with this commandctest -R N300UtilsTest.TestXTensorReplicateAllReduce_96_768
  5. See crash and prints before last print

Expected behavior
Should work without crash.

Please complete the following environment information:

  • OS: Ubuntu 20.04
  • Version of software: 9b84fb8
  • Device: N300

Additional context
There might be a deeper problem, so we would like to understand what's going on with this random operation. Thanks in advance.

@rfurko-tt rfurko-tt added bug Something isn't working P0 TT-train labels Jan 10, 2025
@rfurko-tt
Copy link
Contributor Author

@dmakoviichuk-tt @davorchap FYI ^^

@cmaryanTT
Copy link

@eyonland can you please take a look ASAP

@dmakoviichuk-tt
Copy link
Contributor

hi @cmaryanTT @eyonland do you have any updates? It blocks our distributed training work.

@eyonland
Copy link
Contributor

We will provide an update by COB.

@patrickroberts
Copy link
Contributor

I've checked out ca2c867 as indicated in the bug report, and am unable to reproduce that issue. Can you provide a pipeline with the crash described or a proper reproducible example please? That would include, at the very least, a commit already containing the test you describe in the bug report, and a target to build and run.

Image

Image

Image

Image

@rfurko-tt
Copy link
Contributor Author

@patrickroberts it doesn't build for you? (based on image provided above)

@patrickroberts
Copy link
Contributor

patrickroberts commented Jan 14, 2025

@patrickroberts it doesn't build for you? (based on image provided above)

As you can see, it builds, but the JIT compiler fails in a completely unrelated test. Please double check your git commit hash, because other than the change I've shown in the screenshot, my git status is clean.

@dmakoviichuk-tt
Copy link
Contributor

@patrickroberts please confirm that you are using machine with n300 device.

@patrickroberts
Copy link
Contributor

@dmakoviichuk-tt I have an n150 board, I'll have to reserve a machine with an n300. In the meantime can you provide more information about the crash? The build configuration (environment variables, CMake cache variables, etc.) would be helpful, as would a screenshot of the crash. Did it have a stack trace?

@rfurko-tt
Copy link
Contributor Author

I've updated commit (separate branch, includes test). Provided exact commands to build and run...

@rfurko-tt
Copy link
Contributor Author

Previously provided commit also works for me.
Image

@eyonland eyonland assigned patrickroberts and unassigned eyonland Jan 14, 2025
@patrickroberts
Copy link
Contributor

I've got a core dump, will rebuild in RelWithDebInfo so I can investigate more thoroughly, but for now here's a backtrace at least:

#0  0x00007fd1835d0650 in ttnn::operations::unary::UnaryDeviceOperation::compute_output_specs(ttnn::operations::unary::operation_attributes_t const&, ttnn::operations::unary::tensor_args_t const&) () from /home/proberts/tt-metal/build_Release/ttnn/_ttnn.so
#1  0x00007fd1835d0880 in ttnn::operations::unary::UnaryDeviceOperation::create_output_tensors(ttnn::operations::unary::operation_attributes_t const&, ttnn::operations::unary::tensor_args_t const&) () from /home/proberts/tt-metal/build_Release/ttnn/_ttnn.so
#2  0x000055fd22616436 in ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::launch_on_single_device<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&) ()
#3  0x000055fd22617d63 in ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::launch_on_multi_device<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&) ()
#4  0x000055fd226163fa in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<4ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::{lambda(auto:1&&)#1}>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::{lambda(auto:1&&)#1}>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) ()
#5  0x000055fd2261602f in ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&) ()
#6  0x00007fd182fe8872 in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 17ul>{std::__1::array<char, 17ul>{char [17]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)112, (char)114, (char)105, (char)109, (char)58, (char)58, (char)117, (char)110, (char)97, (char)114, (char)121}}}, ttnn::operations::unary::UnaryDeviceOperation, false>::invoke<tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam> > const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&>(unsigned char, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam> > const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&) const () from /home/proberts/tt-metal/build_Release/ttnn/_ttnn.so
#7  0x00007fd182fe8631 in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 17ul>{std::__1::array<char, 17ul>{char [17]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)112, (char)114, (char)105, (char)109, (char)58, (char)58, (char)117, (char)110, (char)97, (char)114, (char)121}}}, ttnn::operations::unary::UnaryDeviceOperation, false>::operator()<unsigned char&, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam> > const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&>(unsigned char&, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam> > const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&) const () from /home/proberts/tt-metal/build_Release/ttnn/_ttnn.so
#8  0x00007fd1835f3a63 in ttnn::operations::unary::detail::unary_impl(unsigned char, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam> > const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&, std::__1::optional<tt::tt_metal::Tensor> const&) () from /home/proberts/tt-metal/build_Release/ttnn/_ttnn.so
#9  0x00007fd1835f7ea0 in ttnn::operations::unary::ExecuteUnary<(ttnn::operations::unary::UnaryOpType)1>::invoke(tt::tt_metal::Tensor const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&, std::__1::optional<tt::tt_metal::Tensor> const&) () from /home/proberts/tt-metal/build_Release/ttnn/_ttnn.so
#10 0x000055fd2259be14 in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 16ul>{std::__1::array<char, 16ul>{char [16]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)114, (char)101, (char)99, (char)105, (char)112, (char)114, (char)111, (char)99, (char)97, (char)108}}}, ttnn::operations::unary::ExecuteUnary<(ttnn::operations::unary::UnaryOpType)1>, false>::invoke_composite<tt::tt_metal::Tensor&>(tt::tt_metal::Tensor&) const ()
#11 0x000055fd225890f8 in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 16ul>{std::__1::array<char, 16ul>{char [16]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)114, (char)101, (char)99, (char)105, (char)112, (char)114, (char)111, (char)99, (char)97, (char)108}}}, ttnn::operations::unary::ExecuteUnary<(ttnn::operations::unary::UnaryOpType)1>, false>::operator()<tt::tt_metal::Tensor&>(tt::tt_metal::Tensor&) const ()
#12 0x000055fd225858cb in N300UtilsTest_TestXTensorReplicateAllReduce_96_768_Test::TestBody() ()
#13 0x00007fd186326829 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () from /home/proberts/tt-metal/build_Release/lib/libgtest.so.1.13.0
#14 0x00007fd18630bdf8 in testing::Test::Run() () from /home/proberts/tt-metal/build_Release/lib/libgtest.so.1.13.0
#15 0x00007fd18630cedd in testing::TestInfo::Run() () from /home/proberts/tt-metal/build_Release/lib/libgtest.so.1.13.0
#16 0x00007fd18630db44 in testing::TestSuite::Run() () from /home/proberts/tt-metal/build_Release/lib/libgtest.so.1.13.0
#17 0x00007fd18631d6ad in testing::internal::UnitTestImpl::RunAllTests() () from /home/proberts/tt-metal/build_Release/lib/libgtest.so.1.13.0
#18 0x00007fd186327499 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) () from /home/proberts/tt-metal/build_Release/lib/libgtest.so.1.13.0
#19 0x00007fd18631d1bc in testing::UnitTest::Run() () from /home/proberts/tt-metal/build_Release/lib/libgtest.so.1.13.0
#20 0x00007fd18636a24e in main () from /home/proberts/tt-metal/build_Release/lib/libgmock_main.so.1.13.0
#21 0x00007fd18189e083 in __libc_start_main (main=0x7fd18636a210 <main>, argc=3, argv=0x7fffee73d598, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffee73d588) at ../csu/libc-start.c:308
#22 0x000055fd2254349e in _start ()

@patrickroberts
Copy link
Contributor

💥

I AM HERE IN TEST 0
I AM HERE IN TEST 1
I AM HERE IN TEST 2
=================================================================
==1818403==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f22fb6a3890 at pc 0x7f2301468a0e bp 0x7ffd06baadf0 sp 0x7ffd06baade8
READ of size 8 at 0x7f22fb6a3890 thread T0
    #0 0x7f2301468a0d in std::__1::shared_ptr<tt::tt_metal::Tensor::TensorAttributes>::operator->[abi:ue170006]() const /usr/lib/llvm-17/bin/../include/c++/v1/__memory/shared_ptr.h:884:16
    #1 0x7f2301468a0d in tt::tt_metal::Tensor::logical_shape() const /home/proberts/tt-metal/ttnn/cpp/ttnn/tensor/tensor.hpp:246:16
    #2 0x7f2301468a0d in ttnn::operations::unary::UnaryDeviceOperation::compute_output_specs(ttnn::operations::unary::operation_attributes_t const&, ttnn::operations::unary::tensor_args_t const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/operations/eltwise/unary/device/unary_device_operation.cpp:176:49
    #3 0x7f2301468e45 in ttnn::operations::unary::UnaryDeviceOperation::create_output_tensors(ttnn::operations::unary::operation_attributes_t const&, ttnn::operations::unary::tensor_args_t const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/operations/eltwise/unary/device/unary_device_operation.cpp:185:33
    #4 0x56228e25550c in ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::launch_on_single_device<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/device_operation.hpp:354:32
    #5 0x56228e297225 in ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::launch_on_multi_device<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/device_operation.hpp:463:31
    #6 0x56228e2553f5 in tt::tt_metal::Tensor ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&)::operator()<tt::tt_metal::MultiDeviceStorage const&>(ttnn::operations::unary::UnaryDeviceOperation&&) const /home/proberts/tt-metal/ttnn/cpp/ttnn/device_operation.hpp:494:24
    #7 0x56228e2553f5 in decltype(std::declval<ttnn::operations::unary::UnaryDeviceOperation>()(std::declval<tt::tt_metal::MultiDeviceStorage const&>())) std::__1::__invoke[abi:ue170006]<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&), tt::tt_metal::MultiDeviceStorage const&>(ttnn::operations::unary::UnaryDeviceOperation&&, tt::tt_metal::MultiDeviceStorage const&) /usr/lib/llvm-17/bin/../include/c++/v1/__type_traits/invoke.h:340:25
    #8 0x56228e2553f5 in decltype(auto) std::__1::__variant_detail::__visitation::__variant::__value_visitor<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&)>::operator()[abi:ue170006]<std::__1::__variant_detail::__alt<4ul, tt::tt_metal::MultiDeviceStorage> const&>(std::__1::__variant_detail::__alt<4ul, tt::tt_metal::MultiDeviceStorage> const&) const /usr/lib/llvm-17/bin/../include/c++/v1/variant:692:14
    #9 0x56228e2553f5 in decltype(std::declval<ttnn::operations::unary::UnaryDeviceOperation>()(std::declval<std::__1::__variant_detail::__alt<4ul, tt::tt_metal::MultiDeviceStorage> const&>())) std::__1::__invoke[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&)>, std::__1::__variant_detail::__alt<4ul, tt::tt_metal::MultiDeviceStorage> const&>(ttnn::operations::unary::UnaryDeviceOperation&&, std::__1::__variant_detail::__alt<4ul, tt::tt_metal::MultiDeviceStorage> const&) /usr/lib/llvm-17/bin/../include/c++/v1/__type_traits/invoke.h:340:25
    #10 0x56228e2553f5 in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<4ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&)>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>(ttnn::operations::unary::UnaryDeviceOperation, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) /usr/lib/llvm-17/bin/../include/c++/v1/variant:572:16
    #11 0x56228e254897 in decltype(auto) std::__1::__variant_detail::__visitation::__base::__visit_alt[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&)>, std::__1::__variant_detail::__impl<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>(ttnn::operations::unary::UnaryDeviceOperation&&, std::__1::__variant_detail::__impl<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) /usr/lib/llvm-17/bin/../include/c++/v1/variant:535:12
    #12 0x56228e254897 in decltype(auto) std::__1::__variant_detail::__visitation::__variant::__visit_alt[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&)>, std::__1::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>(ttnn::operations::unary::UnaryDeviceOperation&&, std::__1::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) /usr/lib/llvm-17/bin/../include/c++/v1/variant:642:12
    #13 0x56228e254897 in decltype(auto) std::__1::__variant_detail::__visitation::__variant::__visit_value[abi:ue170006]<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&), std::__1::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>(ttnn::operations::unary::UnaryDeviceOperation&&, std::__1::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) /usr/lib/llvm-17/bin/../include/c++/v1/variant:661:12
    #14 0x56228e254897 in decltype(auto) std::__1::visit[abi:ue170006]<ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&)::'lambda'(ttnn::operations::unary::UnaryDeviceOperation&&), std::__1::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, void>(ttnn::operations::unary::UnaryDeviceOperation&&, std::__1::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) /usr/lib/llvm-17/bin/../include/c++/v1/variant:1759:10
    #15 0x56228e254897 in ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/device_operation.hpp:488:32
    #16 0x7f23000a3039 in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 17ul>{std::__1::array<char, 17ul>{char [17]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)112, (char)114, (char)105, (char)109, (char)58, (char)58, (char)117, (char)110, (char)97, (char)114, (char)121}}}, ttnn::operations::unary::UnaryDeviceOperation, false>::invoke<tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam>> const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&>(unsigned char, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam>> const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&) const /home/proberts/tt-metal/ttnn/cpp/ttnn/decorators.hpp:229:16
    #17 0x7f23000a275f in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 17ul>{std::__1::array<char, 17ul>{char [17]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)112, (char)114, (char)105, (char)109, (char)58, (char)58, (char)117, (char)110, (char)97, (char)114, (char)121}}}, ttnn::operations::unary::UnaryDeviceOperation, false>::operator()<unsigned char&, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam>> const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&>(unsigned char&, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam>> const&, tt::tt_metal::DataType&, tt::tt_metal::MemoryConfig&, bool&, bool&, bool&, std::__1::optional<tt::tt_metal::Tensor> const&) const /home/proberts/tt-metal/ttnn/cpp/ttnn/decorators.hpp:315:23
    #18 0x7f23014f7352 in ttnn::operations::unary::detail::unary_impl(unsigned char, tt::tt_metal::Tensor const&, std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam>> const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&, std::__1::optional<tt::tt_metal::Tensor> const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/operations/eltwise/unary/unary.cpp:38:12
    #19 0x7f2301506123 in ttnn::operations::unary::ExecuteUnary<(ttnn::operations::unary::UnaryOpType)1>::invoke(tt::tt_metal::Tensor const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&, std::__1::optional<tt::tt_metal::Tensor> const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/operations/eltwise/unary/unary.cpp:67:12
    #20 0x56228e0974a2 in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 16ul>{std::__1::array<char, 16ul>{char [16]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)114, (char)101, (char)99, (char)105, (char)112, (char)114, (char)111, (char)99, (char)97, (char)108}}}, ttnn::operations::unary::ExecuteUnary<(ttnn::operations::unary::UnaryOpType)1>, false>::invoke_composite<tt::tt_metal::Tensor&>(tt::tt_metal::Tensor&) const /home/proberts/tt-metal/ttnn/cpp/ttnn/decorators.hpp:243:16
    #21 0x56228e043abe in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 16ul>{std::__1::array<char, 16ul>{char [16]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)114, (char)101, (char)99, (char)105, (char)112, (char)114, (char)111, (char)99, (char)97, (char)108}}}, ttnn::operations::unary::ExecuteUnary<(ttnn::operations::unary::UnaryOpType)1>, false>::invoke<tt::tt_metal::Tensor&>(tt::tt_metal::Tensor&) const /home/proberts/tt-metal/ttnn/cpp/ttnn/decorators.hpp:308:16
    #22 0x56228e043abe in auto ttnn::decorators::registered_operation_t<reflect::v1_1_1::fixed_string<char, 16ul>{std::__1::array<char, 16ul>{char [16]{(char)116, (char)116, (char)110, (char)110, (char)58, (char)58, (char)114, (char)101, (char)99, (char)105, (char)112, (char)114, (char)111, (char)99, (char)97, (char)108}}}, ttnn::operations::unary::ExecuteUnary<(ttnn::operations::unary::UnaryOpType)1>, false>::operator()<tt::tt_metal::Tensor&>(tt::tt_metal::Tensor&) const /home/proberts/tt-metal/ttnn/cpp/ttnn/decorators.hpp:315:23
    #23 0x56228e03794a in N300UtilsTest_TestXTensorReplicateAllReduce_96_768_Test::TestBody() /home/proberts/tt-metal/tt-train/tests/core/n300_utils_test.cpp:171:14
    #24 0x7f230a1a6828 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:2621:10
    #25 0x7f230a1a6828 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:2657:14
    #26 0x7f230a18bdf7 in testing::Test::Run() /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:2696:5
    #27 0x7f230a18cedc in testing::TestInfo::Run() /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:2845:11
    #28 0x7f230a18db43 in testing::TestSuite::Run() /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:3004:30
    #29 0x7f230a19d6ac in testing::internal::UnitTestImpl::RunAllTests() /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:5890:44
    #30 0x7f230a1a7498 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:2621:10
    #31 0x7f230a1a7498 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:2657:14
    #32 0x7f230a19d1bb in testing::UnitTest::Run() /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest.cc:5455:10
    #33 0x7f230a1ea24d in RUN_ALL_TESTS() /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/include/gtest/gtest.h:2314:73
    #34 0x7f230a1ea24d in main /home/proberts/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googlemock/src/gmock_main.cc:70:10
    #35 0x7f22fd859082 in __libc_start_main /build/glibc-LcI20x/glibc-2.31/csu/../csu/libc-start.c:308:16
    #36 0x56228de5453d in _start (/home/proberts/tt-metal/build_RelWithDebInfo/tt-train/tests/ttml_tests+0x21653d) (BuildId: 7eb7d5068278ad49db94c2808db0df02a9ced09d)

Address 0x7f22fb6a3890 is located in stack of thread T0 at offset 144 in frame
    #0 0x56228e296b6f in ttnn::operations::unary::UnaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::launch_on_multi_device<ttnn::operations::unary::UnaryDeviceOperation>(unsigned char, ttnn::operations::unary::UnaryDeviceOperation::operation_attributes_t const&, ttnn::operations::unary::UnaryDeviceOperation::tensor_args_t const&) /home/proberts/tt-metal/ttnn/cpp/ttnn/device_operation.hpp:423

  This frame has 8 object(s):
    [32, 96) 'ref.tmp.i.i.i.i.i.i.i'
    [128, 192) 'ref.tmp.i.i.i.i.i' <== Memory access at offset 144 is inside this variable
    [224, 232) 'get_shard.i' (line 362)
    [256, 320) 'first_tensor' (line 429)
    [352, 376) 'outputs' (line 435)
    [416, 432) 'ref.tmp57' (line 461)
    [448, 528) 'shard_tensor_args' (line 462)
    [560, 624) 'ref.tmp66' (line 463)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-use-after-scope /usr/lib/llvm-17/bin/../include/c++/v1/__memory/shared_ptr.h:884:16 in std::__1::shared_ptr<tt::tt_metal::Tensor::TensorAttributes>::operator->[abi:ue170006]() const
Shadow bytes around the buggy address:
  0x7f22fb6a3600: 00 00 00 00 f2 f2 f2 f2 f8 f8 f3 f3 00 00 00 00
  0x7f22fb6a3680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f22fb6a3700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f22fb6a3780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f22fb6a3800: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 f2 f2
=>0x7f22fb6a3880: f8 f8[f8]f8 f8 f8 f8 f8 f2 f2 f2 f2 f8 f2 f2 f2
  0x7f22fb6a3900: 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00 00 00 f2
  0x7f22fb6a3980: f2 f2 f2 f2 f8 f8 f2 f2 00 00 00 00 00 00 00 00
  0x7f22fb6a3a00: 00 00 f2 f2 f2 f2 00 00 00 00 00 00 00 00 f3 f3
  0x7f22fb6a3a80: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f22fb6a3b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1818403==ABORTING
<end of output>
Test time = 360.82 sec
----------------------------------------------------------
Test Failed.
"N300UtilsTest.TestXTensorReplicateAllReduce_96_768" end time: Jan 15 19:11 UTC
"N300UtilsTest.TestXTensorReplicateAllReduce_96_768" time elapsed: 00:06:00
----------------------------------------------------------

I might have identified the problem, I'm compiling an attempted fix to test, will let you know if it works.

@dmakoviichuk-tt
Copy link
Contributor

Hey @patrickroberts I've noticed a few things:

  1. reciprocal op is registered in a little bit different way than other ops:
REGISTER_UNARY_OPERATION(ltz, LTZ);
--
333 | REGISTER_UNARY_OPERATION(neg, NEG);
334 | REGISTER_UNARY_OPERATION(nez, NEZ);
335 | REGISTER_UNARY_OPERATION_OVERLOAD(reciprocal, RECIP);
336 | REGISTER_UNARY_OPERATION(relu, RELU);
337 | REGISTER_UNARY_OPERATION(relu6, RELU6);

In overload it overloads vs complex parameters composite op. As result reciprocal is registered without auto launch.
It might be a reason why yo get some race condition like access to the tensor which is not yet set.

Overall it means we should make sure that no simple ops are registered without auto launch. There might be more ops where people forget to do it.
I am not sure 100% that it is the reason of the crash.
cc @ayerofieiev-tt and @sminakov-tt because it might be related to the compute output specs.

@patrickroberts
Copy link
Contributor

Okay so I confirmed what the issue was. ttnn::operations::unary::tensor_args_t stores the input as a reference, which, because the tensor is sharded, ended up dangling because of the control flow in launch_on_multi_device that creates the shard_tensor_args from the overall tensor_args. Changing const Tensor& input to Tensor input fixed that issue, but then I ran into another one:

I AM HERE IN TEST 0
I AM HERE IN TEST 1
I AM HERE IN TEST 2
unknown file: Failure
C++ exception with description "unordered_map::at: key not found" thrown in the test body.

After debugging, the error was triggered by storage.get_buffer_for_device_id(shard_index) in launch_on_multi_device. I'm not sure how to resolve that, but I'm fairly confident this is nothing to do with the reciprocal op. It could be that the particular choice of reciprocal caused the optimizer to reorder some stack variables in a way that clobbered the dangling shared pointer and led to a segfault, whereas other ops may have arranged stack variables in just the right way to avoid clobbering the dangling reference. That's just a guess though. In any case, the problems start long before the control flow enters any operation-specific functions.

@patrickroberts
Copy link
Contributor

patrickroberts commented Jan 16, 2025

FWIW this also passes 434d707 (this is without auto launch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants