MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187

abhullar-tt · 2024-09-03T20:54:07Z

#11623 has been tracking ND hangs on BH CI.

Ran for small number of iterations locally on IRD machines without seeing hang but increasing test iterations was able to reproduce the hang on IRD (I used yyz-bh-26 and saw hang after ~20 iterations)

Reducing the sub-block dimensions as (merged to main in #12186) enables test to run for 1000 iterations (on yyz-bh-26).

To run test with workaround:

TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/unit_tests --gtest_filter=CommonFixture.MatmulMultiCoreMultiDRAMIn0MCastIn1MCast --gtest_repeat=<N>

Without workaorund:

TT_METAL_DISABLE_BH_ND_WORKAROUND=1 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/unit_tests --gtest_filter=CommonFixture.MatmulMultiCoreMultiDRAMIn0MCastIn1MCast --gtest_repeat=<N>

The text was updated successfully, but these errors were encountered:

abhullar-tt · 2024-09-05T00:56:38Z

Usually the hang is seen on a single core with a writer kernel being stuck waiting on data to be populated in its CB

Ran the test without the workaround to reduce subblocks of matmul after reducing the aiclk to 400 MHz (from 800 MHz using syseng script to set plls) and now all cores are hanging in the same spot. This points to the bug not being didt related

bbradelTT · 2024-09-05T13:09:22Z

Usually the hang is seen on a single core with a writer kernel being stuck waiting on data to be populated in its CB

Ran the test without the workaround to reduce subblocks of matmul after reducing the aiclk to 400 MHz (from 800 MHz using syseng script to set plls) and now all cores are hanging in the same spot. This points to the bug not being didt related

@yugaoTT could you please take a look and see if there's something that is handling non-1 parameters incorrectly?

bbradelTT · 2024-09-06T17:55:11Z

There's a good chance this is related to #12220 and the problematic change occurred between Aug. 21st 2024 and Sep. 4th 2024.

yugaoTT · 2024-09-06T19:00:50Z

From my end, SD hangs almost always at 5 iterations, but FD passed for 10K iterations.
From a separate matmul 2d test, the link bits are causing hangs, set it false allow the matmul pass.
here is the unit test that I run:


@pytest.mark.parametrize("has_bias", [False], ids=["no_bias"])
@pytest.mark.parametrize(
    "in1_in_dram, out_sharded, in0_sharded, M, K, N, activation, dtype, fidelity",
    [
        # 256 256 256
        (
            False,
            True,
            True,
            2048,
            2048,
            2048,
            None,
            ttnn.bfloat8_b,
            ttnn.MathFidelity.LoFi,
        ),
    ],
)
def test_single_core_matmul(
    device,
    dtype,
    fidelity,
    in0_sharded,
    out_sharded,
    in1_in_dram,
    has_bias,
    M,
    K,
    N,
    activation,
    function_level_defaults,
):
    in0_shape = [1, 1, M, K]
    in1_shape = [1, 1, K, N]
    bias_shape = [1, 1, N]
    grid_size = (8, 8)

    in0_block_w = K // grid_size[1] // 32  # 16
    in0_block_h = M // grid_size[0] // 32
    out_block_h = M // grid_size[0] // 32
    out_block_w = N // grid_size[1] // 32

    if out_block_w <= 8:
        out_subblock_w = out_block_w
        out_subblock_h = 8 // out_subblock_w
    else:
        out_subblock_h = 1
        out_subblock_w = 8 // out_subblock_h
        while out_block_w % out_subblock_w != 0:
            out_subblock_w = out_block_w // 2

    logger.debug("in0 block w h " + str(in0_block_w * 32) + " " + str(in0_block_h * 32))
    logger.debug("in1 block w h " + str(out_block_w * 32) + " " + str(in0_block_w * 32))
    logger.debug("out block w h " + str(out_block_w * 32) + " " + str(out_block_h * 32))
    logger.debug("out subblock w h " + str(out_subblock_w * 32) + " " + str(out_subblock_h * 32))

    interleaved_mem_config_L1 = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.INTERLEAVED,
        buffer_type=ttnn.BufferType.L1,
    )
    interleaved_mem_config_DRAM = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.INTERLEAVED,
        buffer_type=ttnn.BufferType.DRAM,
    )
    sharded_mem_config = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.BLOCK_SHARDED,
        buffer_type=ttnn.BufferType.L1,
    )

    in0 = torch.randn(in0_shape).bfloat16().float()
    in1 = torch.randn(in1_shape).bfloat16().float()
    bias = torch.randn(bias_shape).bfloat16().float()

    in0_t = torch2tt_tensor(in0, device, tt_memory_config=interleaved_mem_config_L1, tt_dtype=ttnn.bfloat8_b)

    in1_t = torch2tt_tensor(in1, device, tt_memory_config=interleaved_mem_config_L1, tt_dtype=ttnn.bfloat8_b)

    output_mem_config = sharded_mem_config if out_sharded else interleaved_mem_config_L1
    bias_t = pad_by_zero(bias, device, tt_memory_config=interleaved_mem_config_L1, tt_dtype=ttnn.bfloat8_b)[0]

    if in0_sharded:
        in0_t = ttnn.interleaved_to_sharded(
            in0_t,
            grid_size,
            [M // grid_size[0], K // grid_size[1]],
            ttnn.TensorMemoryLayout.BLOCK_SHARDED,
            ttnn.ShardOrientation.COL_MAJOR,
        )

    program_config = ttnn.MatmulMultiCoreReuseMultiCastProgramConfig(
        compute_with_storage_grid_size=grid_size,
        in0_block_w=in0_block_w,
        out_subblock_h=out_subblock_h,
        out_subblock_w=out_subblock_w,
        per_core_M=out_block_h,
        per_core_N=out_block_w,
        transpose_mcast=True,
        fused_activation=activation,
    )

    if has_bias:
        for _ in range(10000):
            output_t = ttnn.linear(
                in0_t,
                in1_t,
                bias=bias_t,
                program_config=program_config,
                memory_config=output_mem_config,
            )
    else:
        for _ in range(10000):
            output_t = ttnn.matmul(
                in0_t,
                in1_t,
                program_config=program_config,
                memory_config=output_mem_config,
            )

    if out_sharded:
        output_t = ttnn.sharded_to_interleaved(output_t, interleaved_mem_config_L1)

    pt_out = in0 @ in1

    if has_bias:
        pt_out = pt_out + bias

    if activation != None:
        pt_out = torch.nn.functional.gelu(pt_out)
    tt_out = tt2torch_tensor(output_t)

    passing, output = comp_pcc(pt_out, tt_out)
    logger.info(output)
    assert passing

bbradelTT · 2024-09-06T19:31:37Z

This is due to #11519 PR: #11520

abhullar-tt · 2024-09-09T04:42:35Z

This runs 10k iterations on fast dispatch without the workaround, issue looks to be related to slow dispatch only, removing didt suspected label

yugaoTT · 2024-09-10T17:50:36Z

on BH cmd_buffer is known to have issues, need to turn on cmd_buffer_fifo.
Instead of using noc_cmd_buf_ready and cmd_buffer for sending out mcast requests (as well as other read/write requests),
use cmd_buffer_fifo and CMD_BUF_AVAIL

…MCastIn1MCast

bbradelTT · 2024-09-11T11:53:33Z

@abhullar-tt Is there an issue tracking to fix the problems with the BH cmd_buffer or will they remain and cmd_buffer_fifo will always need to be on?

abhullar-tt · 2024-09-11T16:23:10Z

@abhullar-tt Is there an issue tracking to fix the problems with the BH cmd_buffer or will they remain and cmd_buffer_fifo will always need to be on?

It is tracked in #5174 to verify that cmd buffers are functional on new BH hw

abhullar-tt added blackhole didt_suspected labels Sep 3, 2024

abhullar-tt assigned ttmtrajkovic Sep 3, 2024

This was referenced Sep 3, 2024

[Blackhole] ND hangs on BH CI #11623

Closed

#11623: Adding workaround for ND BH hang for MatmulMultiCoreMultiDRAMIn0MCastIn1MCast #12186

Merged

abhullar-tt assigned abhullar-tt and unassigned abhullar-tt Sep 5, 2024

abhullar-tt added didt_suspected and removed didt_suspected labels Sep 5, 2024

abhullar-tt unassigned ttmtrajkovic Sep 9, 2024

abhullar-tt removed the didt_suspected label Sep 9, 2024

abhullar-tt added a commit that referenced this issue Sep 10, 2024

#12187: Enabling cmd buffer fifo on bh

5f3244f

abhullar-tt mentioned this issue Sep 10, 2024

#12187: Enabling cmd buffer fifo on bh #12475

Merged

2 tasks

abhullar-tt linked a pull request Sep 10, 2024 that will close this issue

#12187: Enabling cmd buffer fifo on bh #12475

Merged

2 tasks

abhullar-tt added a commit that referenced this issue Sep 10, 2024

#12187: Remove BH specific workaround for MatmulMultiCoreMultiDRAMIn0…

1a83951

…MCastIn1MCast

abhullar-tt added a commit that referenced this issue Sep 10, 2024

#12187: Enabling cmd buffer fifo on bh

3201de3

abhullar-tt added a commit that referenced this issue Sep 10, 2024

#12187: Remove BH specific workaround for MatmulMultiCoreMultiDRAMIn0…

7040c78

…MCastIn1MCast

abhullar-tt closed this as completed in #12475 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187

MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187

abhullar-tt commented Sep 3, 2024 •

edited

Loading

abhullar-tt commented Sep 5, 2024

bbradelTT commented Sep 5, 2024

bbradelTT commented Sep 6, 2024

yugaoTT commented Sep 6, 2024

bbradelTT commented Sep 6, 2024

abhullar-tt commented Sep 9, 2024

yugaoTT commented Sep 10, 2024

bbradelTT commented Sep 11, 2024

abhullar-tt commented Sep 11, 2024

MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187

MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187

Comments

abhullar-tt commented Sep 3, 2024 • edited Loading

abhullar-tt commented Sep 5, 2024

bbradelTT commented Sep 5, 2024

bbradelTT commented Sep 6, 2024

yugaoTT commented Sep 6, 2024

bbradelTT commented Sep 6, 2024

abhullar-tt commented Sep 9, 2024

yugaoTT commented Sep 10, 2024

bbradelTT commented Sep 11, 2024

abhullar-tt commented Sep 11, 2024

abhullar-tt commented Sep 3, 2024 •

edited

Loading