Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Circular buffers fatal error on Wormhole N300s #17541

Closed
jasondavies opened this issue Feb 4, 2025 · 5 comments
Closed

[Bug Report] Circular buffers fatal error on Wormhole N300s #17541

jasondavies opened this issue Feb 4, 2025 · 5 comments
Assignees
Labels
bug Something isn't working community P1

Comments

@jasondavies
Copy link

Fatal error occurs whenever circular buffers are used.

To Reproduce

Run ./build/programming_examples/hello_world_datatypes_kernel

Expected behavior

No fatal error.

Error logs

$ ./build/programming_examples/hello_world_datatypes_kernel
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 2)
2025-02-04 15:26:47.860 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.31.0, IOMMU: disabled
2025-02-04 15:26:47.863 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-04 15:26:47.863 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {1}
2025-02-04 15:26:47.939 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.10.0 (Device 0)
2025-02-04 15:26:47.940 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.10.0 (Device 1)
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                 Always | FATAL    | Circular buffer indices overlap for KernelGroup 0 on programmable core type 0. Local end index 1, Remote start index 0
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_FATAL @ /home/jason/tt-metal/tt_metal/impl/program/program.cpp:635: max_local_cb_end_index <= min_remote_cb_start_index
info:
Circular buffer indices overlap for KernelGroup 0 on programmable core type 0. Local end index 1, Remote start index 0
backtrace:
 --- /home/jason/tt-metal/build_Release/tt_metal/libtt_metal.so(+0x14c905) [0x7fd1e6da8905]
 --- tt::tt_metal::detail::Program_::update_kernel_groups(unsigned int)
 --- tt::tt_metal::v0::Program::get_kernel_groups(unsigned int)
 --- void tt::tt_metal::program_dispatch::finalize_program_offsets<tt::tt_metal::v0::Program>(tt::tt_metal::v0::Program&, tt::tt_metal::v0::IDevice*)
 --- tt::tt_metal::HWCommandQueue::enqueue_program(tt::tt_metal::v0::Program&, bool)
 --- tt::tt_metal::v0::EnqueueProgram(tt::tt_metal::CommandQueue&, tt::tt_metal::v0::Program&, bool)
 --- ./build/programming_examples/hello_world_datatypes_kernel(+0x31e6) [0x558f9e8c51e6]
 --- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fd1e645a083]
 --- ./build/programming_examples/hello_world_datatypes_kernel(+0x2b6e) [0x558f9e8c4b6e]

Aborted (core dumped)

Environment information:

  • OS: Ubuntu 20.04.06 LTS
  • Version of software: ba41836 (note: I've tried various older tagged releases but to no avail)
{
    "time": "2025-02-04T15:28:10.227349",
    "host_info": {
        "OS": "Linux",
        "Distro": "Ubuntu 20.04.6 LTS",
        "Kernel": "5.4.0-205-generic",
        "Hostname": "tt",
        "Platform": "x86_64",
        "Python": "3.8.10",
        "Memory": "125.77 GB",
        "Driver": "TT-KMD 1.31"
    },
    "host_sw_vers": {
        "tt_smi": "3.0.6",
        "pyluwen": "0.6.1"
    },
    "device_info": [
        {
            "smbus_telem": {
                "BOARD_ID": "0x10001451182214e",
                "ENUM_VERSION": "0xba5e0001",
                "DEVICE_ID": "0x401e1e52",
                "ASIC_RO": "0x2ddea",
                "ASIC_IDD": "0xa88",
                "BOARD_ID_HIGH": "0x1000145",
                "BOARD_ID_LOW": "0x1182214e",
                "ARC0_FW_VERSION": "0x21d0000",
                "ARC1_FW_VERSION": "0x21d0000",
                "ARC2_FW_VERSION": null,
                "ARC3_FW_VERSION": "0x21d0000",
                "SPIBOOTROM_FW_VERSION": "0x30b0000",
                "ETH_FW_VERSION": "0x6a000",
                "M3_BL_FW_VERSION": "0x81020000",
                "M3_APP_FW_VERSION": "0x50a0000",
                "DDR_SPEED": null,
                "DDR_STATUS": "0x2222222",
                "ETH_STATUS0": "0x11111111",
                "ETH_STATUS1": "0x11111133",
                "PCIE_STATUS": "0x10110000",
                "FAULTS": null,
                "ARC0_HEALTH": "0xe1ba99",
                "ARC1_HEALTH": "0x55b5c6",
                "ARC2_HEALTH": null,
                "ARC3_HEALTH": "0x8cf1",
                "FAN_SPEED": "0xffffffff",
                "AICLK": "0x3e803e8",
                "AXICLK": "0x384",
                "ARCCLK": "0x21c",
                "THROTTLER": null,
                "VCORE": "0x389",
                "ASIC_TEMPERATURE": "0x2050200",
                "VREG_TEMPERATURE": null,
                "BOARD_TEMPERATURE": "0x1a1f1e",
                "TDP": "0x550012",
                "TDC": "0xa00014",
                "VDD_LIMITS": "0x3e802d0",
                "THM_LIMITS": "0x53004b",
                "WH_FW_DATE": "0x4b01121f",
                "ASIC_TMON0": "0x1e232223",
                "ASIC_TMON1": "0x1823",
                "MVDDQ_POWER": "0x190000",
                "GDDR_TRAIN_TEMP0": null,
                "GDDR_TRAIN_TEMP1": null,
                "BOOT_DATE": "0x52040e1c",
                "RT_SECONDS": "0xe51",
                "AUX_STATUS": null,
                "ETH_DEBUG_STATUS0": "0xccddddcc",
                "ETH_DEBUG_STATUS1": "0xccdddd00",
                "TT_FLASH_VERSION": "0x30101",
                "FW_BUNDLE_VERSION": "0x500e0000"
            },
            "board_info": {
                "bus_id": "0000:03:00.0",
                "board_type": "n300 L",
                "board_id": "10001451182214e",
                "coords": "(0, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": 2,
                "pcie_width": "16"
            },
            "telemetry": {
                "voltage": "0.91",
                "current": " 20.0",
                "power": " 18.0",
                "aiclk": "1000",
                "asic_temperature": "32.0"
            },
            "firmwares": {
                "fw_bundle_version": "80.14.0.0",
                "tt_flash_version": "0.3.1.1",
                "cm_fw": "2.29.0.0",
                "cm_fw_date": "2024-11-01",
                "eth_fw": "6.10.0",
                "bm_bl_fw": "129.2.0.0",
                "bm_app_fw": "5.10.0.0"
            },
            "limits": {
                "vdd_min": "0.72",
                "vdd_max": "1.00",
                "tdp_limit": " 85",
                "tdc_limit": "160",
                "asic_fmax": "1000",
                "therm_trip_l1_limit": "83",
                "thm_limit": "75",
                "bus_peak_limit": null
            }
        },
        {
            "smbus_telem": {
                "BOARD_ID": "0x10001451182214e",
                "ENUM_VERSION": "0xba5e0001",
                "DEVICE_ID": null,
                "ASIC_RO": "0x2c66c",
                "ASIC_IDD": "0x7d0",
                "BOARD_ID_HIGH": "0x1000145",
                "BOARD_ID_LOW": "0x1182214e",
                "ARC0_FW_VERSION": "0x21d0000",
                "ARC1_FW_VERSION": "0x21d0000",
                "ARC2_FW_VERSION": null,
                "ARC3_FW_VERSION": "0x21d0000",
                "SPIBOOTROM_FW_VERSION": "0x30b0000",
                "ETH_FW_VERSION": "0x6a000",
                "M3_BL_FW_VERSION": "0x81020000",
                "M3_APP_FW_VERSION": "0x50a0000",
                "DDR_SPEED": null,
                "DDR_STATUS": "0x2222222",
                "ETH_STATUS0": "0x11111122",
                "ETH_STATUS1": "0x11111111",
                "PCIE_STATUS": null,
                "FAULTS": null,
                "ARC0_HEALTH": "0xe1b9b7",
                "ARC1_HEALTH": "0x55b56a",
                "ARC2_HEALTH": null,
                "ARC3_HEALTH": "0x8cf1",
                "FAN_SPEED": "0xffffffff",
                "AICLK": "0x3e803e8",
                "AXICLK": "0x384",
                "ARCCLK": "0x21c",
                "THROTTLER": null,
                "VCORE": "0x39d",
                "ASIC_TEMPERATURE": "0x1e401e1",
                "VREG_TEMPERATURE": null,
                "BOARD_TEMPERATURE": "0x1a1f1e",
                "TDP": "0x550012",
                "TDC": "0xa00013",
                "VDD_LIMITS": "0x3e802d0",
                "THM_LIMITS": "0x53004b",
                "WH_FW_DATE": "0x4b01121f",
                "ASIC_TMON0": "0x2117211c",
                "ASIC_TMON1": "0x2021",
                "MVDDQ_POWER": "0x190000",
                "GDDR_TRAIN_TEMP0": null,
                "GDDR_TRAIN_TEMP1": null,
                "BOOT_DATE": "0x52040e1c",
                "RT_SECONDS": "0xe51",
                "AUX_STATUS": null,
                "ETH_DEBUG_STATUS0": "0xccdddd00",
                "ETH_DEBUG_STATUS1": "0xdddddddd",
                "TT_FLASH_VERSION": "0x30101",
                "FW_BUNDLE_VERSION": "0x500e0000"
            },
            "board_info": {
                "bus_id": "N/A",
                "board_type": "n300 R",
                "board_id": "10001451182214e",
                "coords": "(1, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": "N/A",
                "pcie_width": "N/A"
            },
            "telemetry": {
                "voltage": "0.93",
                "current": " 19.0",
                "power": " 18.0",
                "aiclk": "1000",
                "asic_temperature": "30.1"
            },
            "firmwares": {
                "fw_bundle_version": "80.14.0.0",
                "tt_flash_version": "0.3.1.1",
                "cm_fw": "2.29.0.0",
                "cm_fw_date": "2024-11-01",
                "eth_fw": "6.10.0",
                "bm_bl_fw": "129.2.0.0",
                "bm_app_fw": "5.10.0.0"
            },
            "limits": {
                "vdd_min": "0.72",
                "vdd_max": "1.00",
                "tdp_limit": " 85",
                "tdc_limit": "160",
                "asic_fmax": "1000",
                "therm_trip_l1_limit": "83",
                "thm_limit": "75",
                "bus_peak_limit": null
            }
        }
    ]
}

Additional context

Running ./build/programming_examples/hello_world_compute_kernel works fine.

I'm using a relatively old Xeon CPU with only AVX1 support, hence I've commented out AVX2 with the following patch:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7693419b4c..5351c62c08 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -196,7 +196,7 @@ add_compile_options(
     -Wswitch
     -Wuninitialized
     -Wno-unused-parameter
-    -mavx2
+    -mavx
     -fPIC
     -fvisibility-inlines-hidden
     -fno-lto # FIXME: This seems to be here for ttnn; it should go to TTNN, then.
diff --git a/tt_metal/api/tt-metalium/bfloat4.hpp b/tt_metal/api/tt-metalium/bfloat4.hpp
index 601a779618..54f20f85df 100644
--- a/tt_metal/api/tt-metalium/bfloat4.hpp
+++ b/tt_metal/api/tt-metalium/bfloat4.hpp
@@ -39,6 +39,8 @@ inline std::vector<float> unpack_bfp4_tiles_into_float_vec(
     bool row_major_output,
     bool is_exp_a,
     const std::optional<tt::tt_metal::Tile>& tile = std::nullopt) {
+    std::vector<float> float_vec;
+    /*
     ZoneScoped;
 
     uint32_t l1_alignment = tt::tt_metal::hal.get_alignment(tt::tt_metal::HalMemType::L1);
@@ -243,6 +245,7 @@ inline std::vector<float> unpack_bfp4_tiles_into_float_vec(
             }
         }
     }
+*/
     return float_vec;
 }
 
diff --git a/tt_metal/api/tt-metalium/bfloat8.hpp b/tt_metal/api/tt-metalium/bfloat8.hpp
index fb3f288d90..4da7f4da0b 100644
--- a/tt_metal/api/tt-metalium/bfloat8.hpp
+++ b/tt_metal/api/tt-metalium/bfloat8.hpp
@@ -112,6 +112,8 @@ inline std::vector<float> unpack_bfp8_tiles_into_float_vec(
     bool row_major_output,
     bool is_exp_a,
     const std::optional<tt::tt_metal::Tile>& tile = std::nullopt) {
+    std::vector<float> float_vec;
+    /*
     ZoneScoped;
 
     uint32_t l1_alignment = tt::tt_metal::hal.get_alignment(tt::tt_metal::HalMemType::L1);
@@ -255,6 +257,7 @@ inline std::vector<float> unpack_bfp8_tiles_into_float_vec(
             }
         }
     }
+*/
     return float_vec;
 }
@jasondavies jasondavies added the bug Something isn't working label Feb 4, 2025
@ayerofieiev-tt
Copy link
Member

@jasondavies please check on the latest main. We can’t reproduce the issue locally and I remember seeing a similar issue fixed recently.

@jasondavies
Copy link
Author

@jasondavies please check on the latest main. We can’t reproduce the issue locally and I remember seeing a similar issue fixed recently.

I checked using ba41836 (a commit on main 2 hours ago) and still no joy.

@tt-aho
Copy link
Contributor

tt-aho commented Feb 4, 2025

Hey @jasondavies , thanks for flagging. We think the issue is due to us calling some intrinsics with certain inputs that would result in undefined behaviour, which is why we weren't able to reproduce, and since you were using an older cpu, the resulting behaviour/return value could differ. Could you try out the changes in this PR and let us know if this resolves the issue for you?

#17549

@jasondavies
Copy link
Author

Excellent, that fixed it. Thanks for looking into it so quickly!

@tt-aho
Copy link
Contributor

tt-aho commented Feb 4, 2025

I've merged the fix to main, so closing the issue. Let me know if the issue still persists and we can reopen.

@tt-aho tt-aho closed this as completed Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community P1
Projects
None yet
Development

No branches or pull requests

4 participants