Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187

Closed
abhullar-tt opened this issue Sep 3, 2024 · 9 comments · Fixed by #12475
Closed

MatmulMultiCoreMultiDRAMIn0MCastIn1MCast ND hanging on BH #12187

abhullar-tt opened this issue Sep 3, 2024 · 9 comments · Fixed by #12475
Assignees

Comments

@abhullar-tt
Copy link
Contributor

abhullar-tt commented Sep 3, 2024

#11623 has been tracking ND hangs on BH CI.

Ran for small number of iterations locally on IRD machines without seeing hang but increasing test iterations was able to reproduce the hang on IRD (I used yyz-bh-26 and saw hang after ~20 iterations)

Reducing the sub-block dimensions as (merged to main in #12186) enables test to run for 1000 iterations (on yyz-bh-26).

To run test with workaround:

TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/unit_tests --gtest_filter=CommonFixture.MatmulMultiCoreMultiDRAMIn0MCastIn1MCast --gtest_repeat=<N>

Without workaorund:

TT_METAL_DISABLE_BH_ND_WORKAROUND=1 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/unit_tests --gtest_filter=CommonFixture.MatmulMultiCoreMultiDRAMIn0MCastIn1MCast --gtest_repeat=<N>
@abhullar-tt
Copy link
Contributor Author

Usually the hang is seen on a single core with a writer kernel being stuck waiting on data to be populated in its CB

Ran the test without the workaround to reduce subblocks of matmul after reducing the aiclk to 400 MHz (from 800 MHz using syseng script to set plls) and now all cores are hanging in the same spot. This points to the bug not being didt related

@bbradelTT
Copy link
Contributor

Usually the hang is seen on a single core with a writer kernel being stuck waiting on data to be populated in its CB

Ran the test without the workaround to reduce subblocks of matmul after reducing the aiclk to 400 MHz (from 800 MHz using syseng script to set plls) and now all cores are hanging in the same spot. This points to the bug not being didt related

@yugaoTT could you please take a look and see if there's something that is handling non-1 parameters incorrectly?

@bbradelTT
Copy link
Contributor

There's a good chance this is related to #12220 and the problematic change occurred between Aug. 21st 2024 and Sep. 4th 2024.

@yugaoTT
Copy link
Contributor

yugaoTT commented Sep 6, 2024

From my end, SD hangs almost always at 5 iterations, but FD passed for 10K iterations.
From a separate matmul 2d test, the link bits are causing hangs, set it false allow the matmul pass.
here is the unit test that I run:


@pytest.mark.parametrize("has_bias", [False], ids=["no_bias"])
@pytest.mark.parametrize(
    "in1_in_dram, out_sharded, in0_sharded, M, K, N, activation, dtype, fidelity",
    [
        # 256 256 256
        (
            False,
            True,
            True,
            2048,
            2048,
            2048,
            None,
            ttnn.bfloat8_b,
            ttnn.MathFidelity.LoFi,
        ),
    ],
)
def test_single_core_matmul(
    device,
    dtype,
    fidelity,
    in0_sharded,
    out_sharded,
    in1_in_dram,
    has_bias,
    M,
    K,
    N,
    activation,
    function_level_defaults,
):
    in0_shape = [1, 1, M, K]
    in1_shape = [1, 1, K, N]
    bias_shape = [1, 1, N]
    grid_size = (8, 8)

    in0_block_w = K // grid_size[1] // 32  # 16
    in0_block_h = M // grid_size[0] // 32
    out_block_h = M // grid_size[0] // 32
    out_block_w = N // grid_size[1] // 32

    if out_block_w <= 8:
        out_subblock_w = out_block_w
        out_subblock_h = 8 // out_subblock_w
    else:
        out_subblock_h = 1
        out_subblock_w = 8 // out_subblock_h
        while out_block_w % out_subblock_w != 0:
            out_subblock_w = out_block_w // 2

    logger.debug("in0 block w h " + str(in0_block_w * 32) + " " + str(in0_block_h * 32))
    logger.debug("in1 block w h " + str(out_block_w * 32) + " " + str(in0_block_w * 32))
    logger.debug("out block w h " + str(out_block_w * 32) + " " + str(out_block_h * 32))
    logger.debug("out subblock w h " + str(out_subblock_w * 32) + " " + str(out_subblock_h * 32))

    interleaved_mem_config_L1 = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.INTERLEAVED,
        buffer_type=ttnn.BufferType.L1,
    )
    interleaved_mem_config_DRAM = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.INTERLEAVED,
        buffer_type=ttnn.BufferType.DRAM,
    )
    sharded_mem_config = ttnn.MemoryConfig(
        memory_layout=ttnn.TensorMemoryLayout.BLOCK_SHARDED,
        buffer_type=ttnn.BufferType.L1,
    )

    in0 = torch.randn(in0_shape).bfloat16().float()
    in1 = torch.randn(in1_shape).bfloat16().float()
    bias = torch.randn(bias_shape).bfloat16().float()

    in0_t = torch2tt_tensor(in0, device, tt_memory_config=interleaved_mem_config_L1, tt_dtype=ttnn.bfloat8_b)

    in1_t = torch2tt_tensor(in1, device, tt_memory_config=interleaved_mem_config_L1, tt_dtype=ttnn.bfloat8_b)

    output_mem_config = sharded_mem_config if out_sharded else interleaved_mem_config_L1
    bias_t = pad_by_zero(bias, device, tt_memory_config=interleaved_mem_config_L1, tt_dtype=ttnn.bfloat8_b)[0]

    if in0_sharded:
        in0_t = ttnn.interleaved_to_sharded(
            in0_t,
            grid_size,
            [M // grid_size[0], K // grid_size[1]],
            ttnn.TensorMemoryLayout.BLOCK_SHARDED,
            ttnn.ShardOrientation.COL_MAJOR,
        )

    program_config = ttnn.MatmulMultiCoreReuseMultiCastProgramConfig(
        compute_with_storage_grid_size=grid_size,
        in0_block_w=in0_block_w,
        out_subblock_h=out_subblock_h,
        out_subblock_w=out_subblock_w,
        per_core_M=out_block_h,
        per_core_N=out_block_w,
        transpose_mcast=True,
        fused_activation=activation,
    )

    if has_bias:
        for _ in range(10000):
            output_t = ttnn.linear(
                in0_t,
                in1_t,
                bias=bias_t,
                program_config=program_config,
                memory_config=output_mem_config,
            )
    else:
        for _ in range(10000):
            output_t = ttnn.matmul(
                in0_t,
                in1_t,
                program_config=program_config,
                memory_config=output_mem_config,
            )

    if out_sharded:
        output_t = ttnn.sharded_to_interleaved(output_t, interleaved_mem_config_L1)

    pt_out = in0 @ in1

    if has_bias:
        pt_out = pt_out + bias

    if activation != None:
        pt_out = torch.nn.functional.gelu(pt_out)
    tt_out = tt2torch_tensor(output_t)

    passing, output = comp_pcc(pt_out, tt_out)
    logger.info(output)
    assert passing

@bbradelTT
Copy link
Contributor

This is due to #11519 PR: #11520

@abhullar-tt
Copy link
Contributor Author

This runs 10k iterations on fast dispatch without the workaround, issue looks to be related to slow dispatch only, removing didt suspected label

@yugaoTT
Copy link
Contributor

yugaoTT commented Sep 10, 2024

on BH cmd_buffer is known to have issues, need to turn on cmd_buffer_fifo.
Instead of using noc_cmd_buf_ready and cmd_buffer for sending out mcast requests (as well as other read/write requests),
use cmd_buffer_fifo and CMD_BUF_AVAIL

@bbradelTT
Copy link
Contributor

@abhullar-tt Is there an issue tracking to fix the problems with the BH cmd_buffer or will they remain and cmd_buffer_fifo will always need to be on?

@abhullar-tt
Copy link
Contributor Author

@abhullar-tt Is there an issue tracking to fix the problems with the BH cmd_buffer or will they remain and cmd_buffer_fifo will always need to be on?

It is tracked in #5174 to verify that cmd buffers are functional on new BH hw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants