-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate why certain WH VMs are not resettable, go down, and are not coming back up #12626
Comments
Some more info: Quick sampling doesn't seem to be host specific. There are some VMs on the same host that are still up. Note that @TT-billteng says it could be multiple issues at play |
Checking out WH first.
Although vm-84 is dead, which is on same host. Probably coincidence
|
We seem a little more stable now. Downgrading to P2. |
filed cloud issue to host reboot remaining machines that are still broken https://github.com/tenstorrent/cloud/issues/3179 |
Next step for this is to
For example, on the reset step on subsequent jobs: https://github.com/tenstorrent/tt-metal/actions/runs/10996802864/job/30531878948 |
another example vm-46: https://github.com/tenstorrent/tt-metal/actions/runs/11110261767/job/30868260202 need to find what was running before. similar case as above, but annotations may not help too much. Logs are available in above job, but not in the lower one. |
Taking out VMs on f13cs07 as it keeps dying: https://github.com/tenstorrent/tt-metal/actions/runs/11140410581 The symptoms are a little different from before. VMs seem to just die in the middle of a run, seemingly not doing anything besides download artifacts. |
@ttmchiou @bkeith-TT @TT-billteng I merged an initial patch to the data pipeline to surface errors at the "Set up runner" level, which should capture what we're seeing right now. I've requested data team to help edit the dataset so we can see: https://tenstorrent.atlassian.net/browse/DATA-269 |
Still waiting on data team to get back to our ticket. Talking with @TTDRosen , we have some new leads on this. I described the symptoms to him, which were:
He proposed that since this has been happening on VMs specifically, it could be more instability with iommu. The theory is the iommu is overloaded or gets into some bad state. The way it gets into this state isn't clear, but it's possible, as the bus used to communicate with the device via iommu was described as "sensitive". If it gets into this bad state, iommu could sever the connection between the host and device, invalidating the memory mapping. If there is no proper memory mapping between the device and host anymore, and either the device or host continues to try to communicate with the other via iommu, the iommu will then issue a fault. An iommu fault could cause KVM could error out and crash the VM. To verify this, we need to check the KVM logs for a machine that died this way. I will:
|
cloud issue to investigate KVM: https://github.com/tenstorrent/cloud/issues/3270 |
2 jobs before it died: https://github.com/tenstorrent/tt-metal/actions/runs/11276227480/job/31360585517 Confirmed via CI dashboard by looking for Infra error for tt-metal-ci-vm-33. @asrinivasanTT and I tried looking at the KVM host logs for this VM on cloud host e12cs03. Besides some messages saying the host is trying a hard reboot on a VM, there isn't specific about PCIe devices or any other useful logs. It could still be iommu. One thing we could do is run the test in link 1 followed in tt-smi reset, in a loop. And see how it goes |
Custom dispatch with one set of eager tests, smi reset, another set of eager tests, smi reset, loop forever https://github.com/tenstorrent/tt-metal/actions/runs/11295895146 on vm-33. Hopefully it can reproduce so it can help us |
Looks like it died in the way we expected in about 7h. I'm gonna re-run to see if we can reproduce I also kicked off a run with sleeps and lsofs in between to see if ARC un-screws itself + to check that nothing is actually running on the card when resetting |
I have also updated the reset scripts with some extra sleeps and logging to see if we can sniff out this issue earlier |
got a failure after 19 hours with some sleeps in between: https://github.com/tenstorrent/tt-metal/actions/runs/11331083259 attempting again https://github.com/tenstorrent/tt-metal/actions/runs/11295895146/attempts/2 but this looks promising as a deterministic way to repeat.. repeating on GS as a control: https://github.com/tenstorrent/tt-metal/actions/runs/11347005175 |
We may have some more information. vm-96 died this way, surfaced via ci/cd analytics dashboard: https://github.com/tenstorrent/tt-metal/actions/runs/11362041107/job/31604079159
However, this doesn't seem to be the TT card. Some other port
This has happened to vm-40 via tt-smi reset stress test: https://github.com/tenstorrent/tt-metal/actions/runs/11356031852/attempts/1
@TTDRosen is in the loop |
@TTDRosen will take a closer look, and we will grant syseng access to CI systems if they need. THANK YOU |
https://github.com/tenstorrent/tt-metal/actions/runs/11356031852/attempts/2 Another tt-smi stress test - died in 20 minutes. seems to follow the same symptom - VM dies, comes back up, underlying hypervisor BM is ok |
Talked a bit with Andrew and we need some more info. Can you provide the VM config? I'm specifically looking for the Guest OS, Host OS, hypervisor and VM arguments (if using libvirt this would be the xml). |
Cross-repo issue: https://github.com/tenstorrent/cloud/issues/3319 |
Please check above for updates, will copy-paste here once we have a response |
Otherwise, for Guest OS: Ubuntu 20.04 |
Reposting comment from here: https://github.com/tenstorrent/cloud/issues/3319#issuecomment-2418413621 HostHostname: VMHostname: XML
|
Syseng ticket: https://tenstorrent.atlassian.net/browse/SYS-950 We have another issue where boards may fail to power back up. This example happened during a test: https://github.com/tenstorrent/cloud/issues/3323#issuecomment-2427039533 From Yuqing: it's known to us that will cause power stage failure (v80.10.0.0 FW) cc: @TT-billteng @ttmchiou wtfffffff |
@TT-billteng suggested to tt mod a board to 500MHz to help rule out high clock speed di/dt: https://github.com/tenstorrent/tt-metal/actions/runs/11561350277 custom test dispatch to test the above^ |
She crashed! Attempt 2: https://github.com/tenstorrent/tt-metal/actions/runs/11561907008 |
…river (#14488) Force-merging for documentation
This is the root issue with steps to reproduce: tenstorrent/tt-smi#54 |
… no --log-driver (tenstorrent#14488) Force-merging for documentation
2024-11-18: (as mentioned by RK/MC in linked ticket above) Looks like the FW update might have fixed things. Aiming for a midweek deploy to try to resolve this issue. |
No, unfortunately, in that ticket we're referring to the mistral7b-blowing-up-boards problem. This issue still exists. The SMI issue is still being looked at by @TTDRosen and team. We've confirmed that they should be able to reproduce on a VM similar to the cloud setup. |
Hi @TTDRosen - can you help unblock Metal team with this, thanks. |
The issue has been assigned, I think maybe the priority got lost in the Github -> jira friction. I bumped the priority again, and it should hit the top of the stack next week. |
We still see symptoms of this. We recently have not been able to reproduce the symptom however, which indicates a SW stack problem. This hasn't been as big of an issue recently. we will keep this open. |
We were CI runs die on tt-smi reset, for example: https://github.com/tenstorrent/tt-metal/actions/runs/10836165941/job/30070633219
However, it seems to only affect WH.
From John H, looks like dkms is properly recompiling the module upon reboot and kernel update:
There was an update at Sep 12, and we see that DKMS did the update properly.
This means the driver is likely not the problem. It does seem however that reset is not working per @ttmchiou . Specifically seems to be hitting WH machines
The text was updated successfully, but these errors were encountered: