Investigate why certain WH VMs are not resettable, go down, and are not coming back up #12626

tt-rkim · 2024-09-13T00:28:34Z

We were CI runs die on tt-smi reset, for example: https://github.com/tenstorrent/tt-metal/actions/runs/10836165941/job/30070633219
However, it seems to only affect WH.

From John H, looks like dkms is properly recompiling the module upon reboot and kernel update:

root@tt-metal-ci-vm-35:~# zgrep 'install ' /var/log/dpkg.log | sort | cut -f1,2,4 -d' ' | grep 'linux-image'
2024-09-12 16:23:47 linux-image-5.4.0-195-generic:amd64

...

root@tt-metal-ci-vm-35:~# ls -lah /lib/modules/*/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Jul 26 06:37 /lib/modules/5.4.0-190-generic/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Aug 10 06:31 /lib/modules/5.4.0-192-generic/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Aug 23 06:22 /lib/modules/5.4.0-193-generic/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Sep 12 16:27 /lib/modules/5.4.0-195-generic/updates/dkms/tenstorrent.ko
root@tt-metal-ci-vm-35:~#

There was an update at Sep 12, and we see that DKMS did the update properly.

This means the driver is likely not the problem. It does seem however that reset is not working per @ttmchiou . Specifically seems to be hitting WH machines

The text was updated successfully, but these errors were encountered:

tt-rkim · 2024-09-13T00:35:33Z

Some more info:

Quick sampling doesn't seem to be host specific. There are some VMs on the same host that are still up.

Note that @TT-billteng says it could be multiple issues at play

tt-rkim · 2024-09-16T13:37:57Z

Checking out WH first.

Hard rebooting VM 38 that failed on SMI reset / runner set up step based on https://github.com/tenstorrent/tt-metal/actions/runs/10850163866#:~:text=self%2Dhosted%20runner%3A-,tt%2Dmetal%2Dci%2Dvm%2D38,-lost%20communication%20with
Confirmed its host isn't dead, as a partner VM is alive:

ubuntu@tt-metal-dev-mcw-wh-31:~$ uptime
 15:51:47 up 36 days, 15:24,  1 user,  load average: 3.00, 3.01, 3.00

Although vm-84 is dead, which is on same host. Probably coincidence

Job before death job linked above was eager unit tests 1 n150:

[2024-09-13 13:39:27Z INFO MessageListener] _getMessagesTokenSource is already disposed.
[2024-09-13 13:39:27Z INFO JobDispatcher] Job request 4d4c4b9d-d48a-5354-9237-88dd16147fe4 processed succeed.
[2024-09-13 13:39:27Z INFO Terminal] WRITE LINE: 2024-09-13 13:39:27Z: Running job: fast-dispatch-unit-tests (wormhole_b0, N150) / eager unit tests 1 wormhole_b0 N150
[2024-09-13 13:39:27Z INFO JobDispatcher] Start renew job request 1033638 for job 5f473b7a-2640-5a10-54f6-3fe0565860c0.
[2024-09-13 13:39:27Z INFO JobDispatcher] Successfully renew job request 1033638, job is valid till 09/13/2024 13:49:27
[2024-09-13 13:39:27Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin.2.319.1'

tt-rkim · 2024-09-16T13:55:04Z

We seem a little more stable now. Downgrading to P2.

TT-billteng · 2024-09-20T00:45:36Z

filed cloud issue to host reboot remaining machines that are still broken https://github.com/tenstorrent/cloud/issues/3179

tt-rkim · 2024-09-30T14:11:57Z

Next step for this is to

capture a potentially bad VM
host reboot to clear it
find a workload which ran before it died
stress test that workload on that VM

For example, tt-metal-ci-vm-83 started dying after this job: https://github.com/tenstorrent/tt-metal/actions/runs/10995699152/job/30527901858

on the reset step on subsequent jobs: https://github.com/tenstorrent/tt-metal/actions/runs/10996802864/job/30531878948

tt-rkim · 2024-09-30T18:04:31Z

another example vm-46: https://github.com/tenstorrent/tt-metal/actions/runs/11110261767/job/30868260202

need to find what was running before.

similar case as above, but annotations may not help too much. Logs are available in above job, but not in the lower one.

tt-rkim · 2024-10-02T12:24:50Z

Taking out VMs on f13cs07 as it keeps dying: https://github.com/tenstorrent/tt-metal/actions/runs/11140410581

The symptoms are a little different from before. VMs seem to just die in the middle of a run, seemingly not doing anything besides download artifacts.

tt-rkim · 2024-10-02T17:59:15Z

@ttmchiou @bkeith-TT @TT-billteng I merged an initial patch to the data pipeline to surface errors at the "Set up runner" level, which should capture what we're seeing right now.

I've requested data team to help edit the dataset so we can see: https://tenstorrent.atlassian.net/browse/DATA-269

tt-rkim · 2024-10-03T20:45:39Z

Still waiting on data team to get back to our ticket.

Talking with @TTDRosen , we have some new leads on this. I described the symptoms to him, which were:

all kinds of tests seem to run successfully on a machine during a CI job.
we do not check the health of the card post-job.
when the runner picks up a new job, it fails on the tt-smi reset portion of the startup reset script for the job, and after trying for a while it craps out.
the VM then shuts off

He proposed that since this has been happening on VMs specifically, it could be more instability with iommu. The theory is the iommu is overloaded or gets into some bad state. The way it gets into this state isn't clear, but it's possible, as the bus used to communicate with the device via iommu was described as "sensitive". If it gets into this bad state, iommu could sever the connection between the host and device, invalidating the memory mapping. If there is no proper memory mapping between the device and host anymore, and either the device or host continues to try to communicate with the other via iommu, the iommu will then issue a fault. An iommu fault could cause KVM could error out and crash the VM.

To verify this, we need to check the KVM logs for a machine that died this way. I will:

open an issue on cloud team
send them some examples by looking at VMs that recently restarted via the cloud restart script that @teijo set up for us
Find the corresponding jobs by trying to match timestamps with jobs recorded on CI dashboard for those specific machines

tt-rkim · 2024-10-03T21:03:04Z

cloud issue to investigate KVM: https://github.com/tenstorrent/cloud/issues/3270

cc: @TT-billteng @ttmchiou

tt-rkim · 2024-10-11T16:08:52Z

tt-metal-ci-vm-33 died around October 10, 12:12 Toronto time.

2 jobs before it died: https://github.com/tenstorrent/tt-metal/actions/runs/11276227480/job/31360585517
Job before it died (hung on first test): https://github.com/tenstorrent/tt-metal/actions/runs/11275420020/job/31360873527 (link 1)
Job where it died: https://github.com/tenstorrent/tt-metal/actions/runs/11276538312/job/31362256439 (link 2)

Confirmed via CI dashboard by looking for Infra error for tt-metal-ci-vm-33.

@asrinivasanTT and I tried looking at the KVM host logs for this VM on cloud host e12cs03. Besides some messages saying the host is trying a hard reboot on a VM, there isn't specific about PCIe devices or any other useful logs.

It could still be iommu. One thing we could do is run the test in link 1 followed in tt-smi reset, in a loop. And see how it goes

tt-rkim · 2024-10-11T16:27:40Z

Custom dispatch with one set of eager tests, smi reset, another set of eager tests, smi reset, loop forever

https://github.com/tenstorrent/tt-metal/actions/runs/11295895146

on vm-33. Hopefully it can reproduce so it can help us

tt-rkim · 2024-10-14T16:18:22Z

Looks like it died in the way we expected in about 7h. I'm gonna re-run to see if we can reproduce

I also kicked off a run with sleeps and lsofs in between to see if ARC un-screws itself + to check that nothing is actually running on the card when resetting

tt-rkim · 2024-10-14T16:18:36Z

I have also updated the reset scripts with some extra sleeps and logging to see if we can sniff out this issue earlier

tt-rkim · 2024-10-15T14:14:46Z

got a failure after 19 hours with some sleeps in between: https://github.com/tenstorrent/tt-metal/actions/runs/11331083259
got a failure after 6 hours with no sleeps: https://github.com/tenstorrent/tt-metal/actions/runs/11295895146/attempts/1

attempting again https://github.com/tenstorrent/tt-metal/actions/runs/11295895146/attempts/2

but this looks promising as a deterministic way to repeat..

repeating on GS as a control: https://github.com/tenstorrent/tt-metal/actions/runs/11347005175

tt-rkim · 2024-10-16T15:15:01Z

We may have some more information.

vm-96 died this way, surfaced via ci/cd analytics dashboard: https://github.com/tenstorrent/tt-metal/actions/runs/11362041107/job/31604079159
checking time of death via prometheus and uptime, we see the following error messages from journalctl:

ubuntu@tt-metal-ci-vm-96:~$ sudo journalctl -k -p err
-- Logs begin at Sun 2024-10-13 04:21:30 UTC, end at Wed 2024-10-16 15:09:33 UTC. --
Oct 16 09:06:32 tt-metal-ci-vm-96 kernel: pcieport 0000:00:03.1: pciehp: Failed to check link status
Oct 16 12:25:53 tt-metal-ci-vm-96 kernel: WEKAFS: Finished FE-0 state update with generation 1, err=0. Encountered 0 inodes, 0 open inodes. Time consumed: 0 ms.
Oct 16 13:35:58 tt-metal-ci-vm-96 kernel: WEKAFS: Finished FE-0 state update with generation 1, err=0. Encountered 0 inodes, 0 open inodes. Time consumed: 0 ms.

However, this doesn't seem to be the TT card. Some other port

ubuntu@tt-metal-ci-vm-96:~$ lspci | grep 1e
07:00.0 Processing accelerators: Device 1e52:401e (rev 01)
ubuntu@tt-metal-ci-vm-96:~$ lspci | grep "03.1"
00:03.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
ubuntu@tt-metal-ci-vm-96:~$ uptime
 15:10:12 up  6:03,  1 user,  load average: 5.02, 4.85, 5.26

This has happened to vm-40 via tt-smi reset stress test: https://github.com/tenstorrent/tt-metal/actions/runs/11356031852/attempts/1

ubuntu@tt-metal-ci-vm-40:~$ sudo journalctl -k -p err
-- Logs begin at Sun 2024-10-13 04:21:10 UTC, end at Wed 2024-10-16 15:03:50 UTC. --
Oct 16 01:08:33 tt-metal-ci-vm-40 kernel: pcieport 0000:00:03.1: pciehp: Failed to check link status

@TTDRosen is in the loop

tt-rkim · 2024-10-16T15:37:18Z

@TTDRosen will take a closer look, and we will grant syseng access to CI systems if they need. THANK YOU

tt-rkim · 2024-10-16T16:37:42Z

https://github.com/tenstorrent/tt-metal/actions/runs/11356031852/attempts/2

Another tt-smi stress test - died in 20 minutes. seems to follow the same symptom - VM dies, comes back up, underlying hypervisor BM is ok

TTDRosen · 2024-10-16T17:00:30Z

Talked a bit with Andrew and we need some more info. Can you provide the VM config? I'm specifically looking for the Guest OS, Host OS, hypervisor and VM arguments (if using libvirt this would be the xml).

tt-rkim · 2024-10-16T17:44:58Z

Cross-repo issue: https://github.com/tenstorrent/cloud/issues/3319

tt-rkim · 2024-10-16T17:46:46Z

Please check above for updates, will copy-paste here once we have a response

tt-rkim · 2024-10-16T17:47:03Z

Otherwise, for Guest OS: Ubuntu 20.04

hmohiuddinTT · 2024-10-17T03:21:30Z

Reposting comment from here: https://github.com/tenstorrent/cloud/issues/3319#issuecomment-2418413621

Host

Hostname: f13cs02
OS: Ubuntu 20.04.4
Kernel Version: 5.4.0-166-generic
TTKMD: 1.28

VM

Hostname: tt-metal-ci-vm-49
OS: Ubuntu 20.04.4
TTKMD: 1.28

XML

<domain type='kvm' id='51'>
  <name>instance-00001621</name>
  <uuid>7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="22.3.1"/>
      <nova:name>tt-metal-ci-vm-49</nova:name>
      <nova:creationTime>2024-10-16 15:48:09</nova:creationTime>
      <nova:flavor name="1WH-14vCPU-48GB_RAM-200GB_DISK-N2">
        <nova:memory>49152</nova:memory>
        <nova:disk>200</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>14</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="da69cc4ff9844e9293e52bdafa463532">cloud_production</nova:user>
        <nova:project uuid="ff377f0073de4486a37477f266767054">cloud_production</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="5784d96c-eff5-46a7-aa50-d11bf4f23a2b"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>50331648</memory>
  <currentMemory unit='KiB'>50331648</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
      <page size='1048576' unit='KiB' nodeset='1'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>14</vcpu>
  <cputune>
    <shares>14336</shares>
    <vcpupin vcpu='0' cpuset='11'/>
    <vcpupin vcpu='1' cpuset='59'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='51'/>
    <vcpupin vcpu='4' cpuset='2'/>
    <vcpupin vcpu='5' cpuset='50'/>
    <vcpupin vcpu='6' cpuset='21'/>
    <vcpupin vcpu='7' cpuset='35'/>
    <vcpupin vcpu='8' cpuset='83'/>
    <vcpupin vcpu='9' cpuset='46'/>
    <vcpupin vcpu='10' cpuset='94'/>
    <vcpupin vcpu='11' cpuset='26'/>
    <vcpupin vcpu='12' cpuset='74'/>
    <vcpupin vcpu='13' cpuset='44'/>
    <emulatorpin cpuset='2-3,11,21,26,35,44,46,50-51,59,74,83,94'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memnode cellid='1' mode='strict' nodeset='1'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>OpenStack Foundation</entry>
      <entry name='product'>OpenStack Nova</entry>
      <entry name='version'>22.3.1</entry>
      <entry name='serial'>7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</entry>
      <entry name='uuid'>7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</entry>
      <entry name='family'>Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-Rome</model>
    <topology sockets='14' cores='1' threads='1'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='xsaves'/>
    <feature policy='require' name='topoext'/>
    <numa>
      <cell id='0' cpus='0-6' memory='25165824' unit='KiB' memAccess='shared'/>
      <cell id='1' cpus='7-13' memory='25165824' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
  <clock offset='utc'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' discard='unmap'/>
      <source file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/disk' index='2'/>
      <backingStore type='file' index='3'>
        <format type='raw'/>
        <source file='/var/lib/nova/instances/_base/ad559263c98227e4642981b2511e8b32c6950da9'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <source file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/disk.config' index='1'/>
      <backingStore/>
      <target dev='sda' bus='sata'/>
      <readonly/>
      <alias name='sata0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:72:26:03'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x7'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </interface>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:ed:7f:8f'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x6'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:94:ac:64'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x5'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </interface>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:88:ba:d1'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x4'/>
      </source>
      <alias name='hostdev3'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <log file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/console.log' append='off'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <log file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/console.log' append='off'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='vnc' port='5903' autoport='yes' listen='172.27.28.103'>
      <listen type='address' address='172.27.28.103'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0xc1' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev4'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <stats period='10'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='rng0'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</label>
    <imagelabel>libvirt-7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+64055:+108</label>
    <imagelabel>+64055:+108</imagelabel>
  </seclabel>
</domain>

tt-rkim · 2024-10-28T19:25:15Z

Syseng ticket: https://tenstorrent.atlassian.net/browse/SYS-950

We have another issue where boards may fail to power back up. This example happened during a test: https://github.com/tenstorrent/cloud/issues/3323#issuecomment-2427039533

From Yuqing: it's known to us that will cause power stage failure (v80.10.0.0 FW)

cc: @TT-billteng @ttmchiou wtfffffff

tt-rkim · 2024-10-28T19:37:16Z

@TT-billteng suggested to tt mod a board to 500MHz to help rule out high clock speed di/dt: https://github.com/tenstorrent/tt-metal/actions/runs/11561350277

custom test dispatch to test the above^

tt-rkim · 2024-10-28T20:15:44Z

She crashed! Attempt 2: https://github.com/tenstorrent/tt-metal/actions/runs/11561907008

…river (#14488) Force-merging for documentation

bkeith-TT · 2024-11-06T16:01:01Z

This is the root issue with steps to reproduce: tenstorrent/tt-smi#54

… no --log-driver (tenstorrent#14488) Force-merging for documentation

bkeith-TT · 2024-11-19T02:24:32Z

2024-11-18: (as mentioned by RK/MC in linked ticket above) Looks like the FW update might have fixed things. Aiming for a midweek deploy to try to resolve this issue.

tt-rkim · 2024-11-19T17:19:34Z

No, unfortunately, in that ticket we're referring to the mistral7b-blowing-up-boards problem. This issue still exists.

The SMI issue is still being looked at by @TTDRosen and team. We've confirmed that they should be able to reproduce on a VM similar to the cloud setup.

smehtaTT · 2024-12-10T17:49:41Z

Hi @TTDRosen - can you help unblock Metal team with this, thanks.

TTDRosen · 2024-12-10T18:19:26Z

The issue has been assigned, I think maybe the priority got lost in the Github -> jira friction. I bumped the priority again, and it should hit the top of the stack next week.

prajaramanTT · 2025-01-08T18:35:18Z

@tt-rkim @TTDRosen Is this still an open issue ? If not, can you please close this ? Thanks.

tt-rkim · 2025-01-08T19:51:56Z

We still see symptoms of this.

We recently have not been able to reproduce the symptom however, which indicates a SW stack problem.

This hasn't been as big of an issue recently. we will keep this open.

tt-rkim added infra-ci infrastructure and/or CI changes P1 ci-bug bugs found in CI machine-management labels Sep 13, 2024

tt-rkim assigned tt-rkim and ttmchiou Sep 13, 2024

tt-rkim added P2 and removed P1 labels Sep 16, 2024

tt-rkim mentioned this issue Oct 24, 2024

Stress test tt-smi reset on a VM with older FW (2024-04-11) and see if we get the same crash in #12626 #14192

Open

tt-rkim added a commit that referenced this issue Oct 25, 2024

#12626: Remove build artifact to quickly test reset script

ab923fe

tt-rkim mentioned this issue Oct 30, 2024

Running tt-smi reset on a WH passed through to a KVM guest machine (VM) in a loop will crash the guest tenstorrent/tt-smi#54

Open

tt-rkim added a commit that referenced this issue Oct 30, 2024

#12626: Add more detailed comment of why we need no --log-driver

fb154b9

tt-rkim mentioned this issue Oct 30, 2024

#12626: Add more detailed comment of why we need no --log-driver #14488

Merged

5 tasks

tt-rkim added a commit that referenced this issue Oct 30, 2024

#12626: [skip ci] Add more detailed comment of why we need no --log-d…

acceea5

…river (#14488) Force-merging for documentation

bkeith-TT added bug Something isn't working P1 and removed P2 labels Nov 5, 2024

ct-clmsn pushed a commit to ct-clmsn/tt-metal that referenced this issue Nov 12, 2024

tenstorrent#12626: [skip ci] Add more detailed comment of why we need…

e2e0e64

… no --log-driver (tenstorrent#14488) Force-merging for documentation

tt-rkim mentioned this issue Nov 18, 2024

Upgrade WH fleet to 80.13.0.0, update on INSTALLING.md, and re-enable mistral7b test #15174

Open

smehtaTT assigned TTDRosen and unassigned ttmchiou Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why certain WH VMs are not resettable, go down, and are not coming back up #12626

Investigate why certain WH VMs are not resettable, go down, and are not coming back up #12626

tt-rkim commented Sep 13, 2024 •

edited

Loading

tt-rkim commented Sep 13, 2024

tt-rkim commented Sep 16, 2024

tt-rkim commented Sep 16, 2024

TT-billteng commented Sep 20, 2024

tt-rkim commented Sep 30, 2024

tt-rkim commented Sep 30, 2024 •

edited

Loading

tt-rkim commented Oct 2, 2024 •

edited

Loading

tt-rkim commented Oct 2, 2024

tt-rkim commented Oct 3, 2024

tt-rkim commented Oct 3, 2024

tt-rkim commented Oct 11, 2024 •

edited

Loading

tt-rkim commented Oct 11, 2024

tt-rkim commented Oct 14, 2024

tt-rkim commented Oct 14, 2024

tt-rkim commented Oct 15, 2024

tt-rkim commented Oct 16, 2024 •

edited

Loading

tt-rkim commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

TTDRosen commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

hmohiuddinTT commented Oct 17, 2024

tt-rkim commented Oct 28, 2024

tt-rkim commented Oct 28, 2024

tt-rkim commented Oct 28, 2024

bkeith-TT commented Nov 6, 2024

bkeith-TT commented Nov 19, 2024 •

edited

Loading

tt-rkim commented Nov 19, 2024 •

edited

Loading

smehtaTT commented Dec 10, 2024

TTDRosen commented Dec 10, 2024

prajaramanTT commented Jan 8, 2025

tt-rkim commented Jan 8, 2025

Investigate why certain WH VMs are not resettable, go down, and are not coming back up #12626

Investigate why certain WH VMs are not resettable, go down, and are not coming back up #12626

Comments

tt-rkim commented Sep 13, 2024 • edited Loading

tt-rkim commented Sep 13, 2024

tt-rkim commented Sep 16, 2024

tt-rkim commented Sep 16, 2024

TT-billteng commented Sep 20, 2024

tt-rkim commented Sep 30, 2024

tt-rkim commented Sep 30, 2024 • edited Loading

tt-rkim commented Oct 2, 2024 • edited Loading

tt-rkim commented Oct 2, 2024

tt-rkim commented Oct 3, 2024

tt-rkim commented Oct 3, 2024

tt-rkim commented Oct 11, 2024 • edited Loading

tt-rkim commented Oct 11, 2024

tt-rkim commented Oct 14, 2024

tt-rkim commented Oct 14, 2024

tt-rkim commented Oct 15, 2024

tt-rkim commented Oct 16, 2024 • edited Loading

tt-rkim commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

TTDRosen commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

tt-rkim commented Oct 16, 2024

hmohiuddinTT commented Oct 17, 2024

Host

VM

XML

tt-rkim commented Oct 28, 2024

tt-rkim commented Oct 28, 2024

tt-rkim commented Oct 28, 2024

bkeith-TT commented Nov 6, 2024

bkeith-TT commented Nov 19, 2024 • edited Loading

tt-rkim commented Nov 19, 2024 • edited Loading

smehtaTT commented Dec 10, 2024

TTDRosen commented Dec 10, 2024

prajaramanTT commented Jan 8, 2025

tt-rkim commented Jan 8, 2025

tt-rkim commented Sep 13, 2024 •

edited

Loading

tt-rkim commented Sep 30, 2024 •

edited

Loading

tt-rkim commented Oct 2, 2024 •

edited

Loading

tt-rkim commented Oct 11, 2024 •

edited

Loading

tt-rkim commented Oct 16, 2024 •

edited

Loading

bkeith-TT commented Nov 19, 2024 •

edited

Loading

tt-rkim commented Nov 19, 2024 •

edited

Loading