Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check context switch pointer for eth cores before resetting #17212

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

aliuTT
Copy link
Contributor

@aliuTT aliuTT commented Jan 28, 2025

Ticket

Link to Github Issue

Problem description

Today, we're assuming that tt-smi will randomize L1 for ethernet cores, so generally if we read that the launch msg isn't RUN_MSG_GO, it should tell us that the machine has been reset and is safe to assume erisc app FW is not live. But tt-smi doesn't guarantee randomization for future systems, and reboot does not do anything to L1, so I want to move the "just reset/reboot" check to read a flag from eth L1. On WH we can use the LAUNCH_ERISC_APP_FLAG , which is always set to zero on reset.

What's changed

Change reset_cores to check the erisc app flag, which is always cleared on reset/reboot.

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • (For models and ops writers) Full new models tests passes
  • New/Existing tests provide coverage for changes

(tt::Cluster::instance().is_ethernet_core(virtual_core, this->id())),
"Invalid core type for context switch check");
auto core_type_idx = hal.get_programmable_core_type_index(HalProgrammableCoreType::ACTIVE_ETH);
std::uint32_t launch_erisc_addr = tt::tt_metal::hal.get_jit_build_config(core_type_idx, 0, 0).fw_launch_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just drop a comment here describing why it's this specific address and what value we're looking for? Also, just to confirm - this flag is 0 after a reset, but upon running it at some point becomes nonzero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can add a comment. Flag is 0 after reset, when we launch erisc.cc app FW we write 1 to it.


// We should be able to remove the check for WORMHOLE_B0 once we assert eth cores for BH
if ((is_cooperative_eth and erisc_app_still_running(virtual_core) or
(not is_cooperative_eth and kernel_still_running(launch_msg, go_signal)))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do any of this logic for BH eth cores

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the logic here. Added a TODO to add BH eth cores to the assert reset in this function.

@aliuTT aliuTT force-pushed the aliu/check-cs-flag-for-eth-reset branch from ea1464a to eb29cdd Compare January 28, 2025 16:53
@aliuTT aliuTT merged commit 76aa97d into main Jan 28, 2025
193 of 195 checks passed
@aliuTT aliuTT deleted the aliu/check-cs-flag-for-eth-reset branch January 28, 2025 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants