-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check context switch pointer for eth cores before resetting #17212
Conversation
(tt::Cluster::instance().is_ethernet_core(virtual_core, this->id())), | ||
"Invalid core type for context switch check"); | ||
auto core_type_idx = hal.get_programmable_core_type_index(HalProgrammableCoreType::ACTIVE_ETH); | ||
std::uint32_t launch_erisc_addr = tt::tt_metal::hal.get_jit_build_config(core_type_idx, 0, 0).fw_launch_addr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just drop a comment here describing why it's this specific address and what value we're looking for? Also, just to confirm - this flag is 0 after a reset, but upon running it at some point becomes nonzero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can add a comment. Flag is 0 after reset, when we launch erisc.cc app FW we write 1 to it.
tt_metal/impl/device/device.cpp
Outdated
|
||
// We should be able to remove the check for WORMHOLE_B0 once we assert eth cores for BH | ||
if ((is_cooperative_eth and erisc_app_still_running(virtual_core) or | ||
(not is_cooperative_eth and kernel_still_running(launch_msg, go_signal)))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do any of this logic for BH eth cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the logic here. Added a TODO to add BH eth cores to the assert reset in this function.
ea1464a
to
eb29cdd
Compare
Ticket
Link to Github Issue
Problem description
Today, we're assuming that tt-smi will randomize L1 for ethernet cores, so generally if we read that the launch msg isn't
RUN_MSG_GO
, it should tell us that the machine has been reset and is safe to assume erisc app FW is not live. But tt-smi doesn't guarantee randomization for future systems, and reboot does not do anything to L1, so I want to move the "just reset/reboot" check to read a flag from eth L1. On WH we can use theLAUNCH_ERISC_APP_FLAG
, which is always set to zero on reset.What's changed
Change
reset_cores
to check the erisc app flag, which is always cleared on reset/reboot.Checklist