Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why certain WH VMs are not resettable, go down, and are not coming back up #12626

Open
tt-rkim opened this issue Sep 13, 2024 · 33 comments
Assignees
Labels
bug Something isn't working ci-bug bugs found in CI infra-ci infrastructure and/or CI changes machine-management P1

Comments

@tt-rkim
Copy link
Collaborator

tt-rkim commented Sep 13, 2024

We were CI runs die on tt-smi reset, for example: https://github.com/tenstorrent/tt-metal/actions/runs/10836165941/job/30070633219
However, it seems to only affect WH.

From John H, looks like dkms is properly recompiling the module upon reboot and kernel update:

root@tt-metal-ci-vm-35:~# zgrep 'install ' /var/log/dpkg.log | sort | cut -f1,2,4 -d' ' | grep 'linux-image'
2024-09-12 16:23:47 linux-image-5.4.0-195-generic:amd64

...

root@tt-metal-ci-vm-35:~# ls -lah /lib/modules/*/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Jul 26 06:37 /lib/modules/5.4.0-190-generic/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Aug 10 06:31 /lib/modules/5.4.0-192-generic/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Aug 23 06:22 /lib/modules/5.4.0-193-generic/updates/dkms/tenstorrent.ko
-rw-r--r-- 1 root root 62K Sep 12 16:27 /lib/modules/5.4.0-195-generic/updates/dkms/tenstorrent.ko
root@tt-metal-ci-vm-35:~#

There was an update at Sep 12, and we see that DKMS did the update properly.

This means the driver is likely not the problem. It does seem however that reset is not working per @ttmchiou . Specifically seems to be hitting WH machines

@tt-rkim tt-rkim added infra-ci infrastructure and/or CI changes P1 ci-bug bugs found in CI machine-management labels Sep 13, 2024
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Sep 13, 2024

Some more info:

Quick sampling doesn't seem to be host specific. There are some VMs on the same host that are still up.

Note that @TT-billteng says it could be multiple issues at play

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Sep 16, 2024

Checking out WH first.

ubuntu@tt-metal-dev-mcw-wh-31:~$ uptime
 15:51:47 up 36 days, 15:24,  1 user,  load average: 3.00, 3.01, 3.00

Although vm-84 is dead, which is on same host. Probably coincidence

  • Job before death job linked above was eager unit tests 1 n150:
[2024-09-13 13:39:27Z INFO MessageListener] _getMessagesTokenSource is already disposed.
[2024-09-13 13:39:27Z INFO JobDispatcher] Job request 4d4c4b9d-d48a-5354-9237-88dd16147fe4 processed succeed.
[2024-09-13 13:39:27Z INFO Terminal] WRITE LINE: 2024-09-13 13:39:27Z: Running job: fast-dispatch-unit-tests (wormhole_b0, N150) / eager unit tests 1 wormhole_b0 N150
[2024-09-13 13:39:27Z INFO JobDispatcher] Start renew job request 1033638 for job 5f473b7a-2640-5a10-54f6-3fe0565860c0.
[2024-09-13 13:39:27Z INFO JobDispatcher] Successfully renew job request 1033638, job is valid till 09/13/2024 13:49:27
[2024-09-13 13:39:27Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin.2.319.1'

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Sep 16, 2024

We seem a little more stable now. Downgrading to P2.

@tt-rkim tt-rkim added P2 and removed P1 labels Sep 16, 2024
@TT-billteng
Copy link
Collaborator

filed cloud issue to host reboot remaining machines that are still broken https://github.com/tenstorrent/cloud/issues/3179

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Sep 30, 2024

Next step for this is to

  • capture a potentially bad VM
  • host reboot to clear it
  • find a workload which ran before it died
  • stress test that workload on that VM

For example, tt-metal-ci-vm-83 started dying after this job: https://github.com/tenstorrent/tt-metal/actions/runs/10995699152/job/30527901858

on the reset step on subsequent jobs: https://github.com/tenstorrent/tt-metal/actions/runs/10996802864/job/30531878948

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Sep 30, 2024

another example vm-46: https://github.com/tenstorrent/tt-metal/actions/runs/11110261767/job/30868260202

need to find what was running before.

similar case as above, but annotations may not help too much. Logs are available in above job, but not in the lower one.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 2, 2024

Taking out VMs on f13cs07 as it keeps dying: https://github.com/tenstorrent/tt-metal/actions/runs/11140410581

The symptoms are a little different from before. VMs seem to just die in the middle of a run, seemingly not doing anything besides download artifacts.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 2, 2024

@ttmchiou @bkeith-TT @TT-billteng I merged an initial patch to the data pipeline to surface errors at the "Set up runner" level, which should capture what we're seeing right now.

I've requested data team to help edit the dataset so we can see: https://tenstorrent.atlassian.net/browse/DATA-269

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 3, 2024

Still waiting on data team to get back to our ticket.

Talking with @TTDRosen , we have some new leads on this. I described the symptoms to him, which were:

  • all kinds of tests seem to run successfully on a machine during a CI job.
  • we do not check the health of the card post-job.
  • when the runner picks up a new job, it fails on the tt-smi reset portion of the startup reset script for the job, and after trying for a while it craps out.
  • the VM then shuts off

He proposed that since this has been happening on VMs specifically, it could be more instability with iommu. The theory is the iommu is overloaded or gets into some bad state. The way it gets into this state isn't clear, but it's possible, as the bus used to communicate with the device via iommu was described as "sensitive". If it gets into this bad state, iommu could sever the connection between the host and device, invalidating the memory mapping. If there is no proper memory mapping between the device and host anymore, and either the device or host continues to try to communicate with the other via iommu, the iommu will then issue a fault. An iommu fault could cause KVM could error out and crash the VM.

To verify this, we need to check the KVM logs for a machine that died this way. I will:

  • open an issue on cloud team
  • send them some examples by looking at VMs that recently restarted via the cloud restart script that @teijo set up for us
  • Find the corresponding jobs by trying to match timestamps with jobs recorded on CI dashboard for those specific machines

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 3, 2024

cloud issue to investigate KVM: https://github.com/tenstorrent/cloud/issues/3270

cc: @TT-billteng @ttmchiou

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 11, 2024

Screenshot 2024-10-11 at 11 59 39 AM

tt-metal-ci-vm-33 died around October 10, 12:12 Toronto time.

2 jobs before it died: https://github.com/tenstorrent/tt-metal/actions/runs/11276227480/job/31360585517
Job before it died (hung on first test): https://github.com/tenstorrent/tt-metal/actions/runs/11275420020/job/31360873527 (link 1)
Job where it died: https://github.com/tenstorrent/tt-metal/actions/runs/11276538312/job/31362256439 (link 2)

Confirmed via CI dashboard by looking for Infra error for tt-metal-ci-vm-33.

@asrinivasanTT and I tried looking at the KVM host logs for this VM on cloud host e12cs03. Besides some messages saying the host is trying a hard reboot on a VM, there isn't specific about PCIe devices or any other useful logs.

It could still be iommu. One thing we could do is run the test in link 1 followed in tt-smi reset, in a loop. And see how it goes

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 11, 2024

Custom dispatch with one set of eager tests, smi reset, another set of eager tests, smi reset, loop forever

https://github.com/tenstorrent/tt-metal/actions/runs/11295895146

on vm-33. Hopefully it can reproduce so it can help us

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 14, 2024

Looks like it died in the way we expected in about 7h. I'm gonna re-run to see if we can reproduce

I also kicked off a run with sleeps and lsofs in between to see if ARC un-screws itself + to check that nothing is actually running on the card when resetting

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 14, 2024

I have also updated the reset scripts with some extra sleeps and logging to see if we can sniff out this issue earlier

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 15, 2024

got a failure after 19 hours with some sleeps in between: https://github.com/tenstorrent/tt-metal/actions/runs/11331083259
got a failure after 6 hours with no sleeps: https://github.com/tenstorrent/tt-metal/actions/runs/11295895146/attempts/1

attempting again https://github.com/tenstorrent/tt-metal/actions/runs/11295895146/attempts/2

but this looks promising as a deterministic way to repeat..

repeating on GS as a control: https://github.com/tenstorrent/tt-metal/actions/runs/11347005175

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 16, 2024

We may have some more information.

vm-96 died this way, surfaced via ci/cd analytics dashboard: https://github.com/tenstorrent/tt-metal/actions/runs/11362041107/job/31604079159
checking time of death via prometheus and uptime, we see the following error messages from journalctl:

ubuntu@tt-metal-ci-vm-96:~$ sudo journalctl -k -p err
-- Logs begin at Sun 2024-10-13 04:21:30 UTC, end at Wed 2024-10-16 15:09:33 UTC. --
Oct 16 09:06:32 tt-metal-ci-vm-96 kernel: pcieport 0000:00:03.1: pciehp: Failed to check link status
Oct 16 12:25:53 tt-metal-ci-vm-96 kernel: WEKAFS: Finished FE-0 state update with generation 1, err=0. Encountered 0 inodes, 0 open inodes. Time consumed: 0 ms.
Oct 16 13:35:58 tt-metal-ci-vm-96 kernel: WEKAFS: Finished FE-0 state update with generation 1, err=0. Encountered 0 inodes, 0 open inodes. Time consumed: 0 ms.

However, this doesn't seem to be the TT card. Some other port

ubuntu@tt-metal-ci-vm-96:~$ lspci | grep 1e
07:00.0 Processing accelerators: Device 1e52:401e (rev 01)
ubuntu@tt-metal-ci-vm-96:~$ lspci | grep "03.1"
00:03.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
ubuntu@tt-metal-ci-vm-96:~$ uptime
 15:10:12 up  6:03,  1 user,  load average: 5.02, 4.85, 5.26

This has happened to vm-40 via tt-smi reset stress test: https://github.com/tenstorrent/tt-metal/actions/runs/11356031852/attempts/1

ubuntu@tt-metal-ci-vm-40:~$ sudo journalctl -k -p err
-- Logs begin at Sun 2024-10-13 04:21:10 UTC, end at Wed 2024-10-16 15:03:50 UTC. --
Oct 16 01:08:33 tt-metal-ci-vm-40 kernel: pcieport 0000:00:03.1: pciehp: Failed to check link status

@TTDRosen is in the loop

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 16, 2024

@TTDRosen will take a closer look, and we will grant syseng access to CI systems if they need. THANK YOU

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 16, 2024

https://github.com/tenstorrent/tt-metal/actions/runs/11356031852/attempts/2

Another tt-smi stress test - died in 20 minutes. seems to follow the same symptom - VM dies, comes back up, underlying hypervisor BM is ok

@TTDRosen
Copy link

Talked a bit with Andrew and we need some more info. Can you provide the VM config? I'm specifically looking for the Guest OS, Host OS, hypervisor and VM arguments (if using libvirt this would be the xml).

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 16, 2024

Cross-repo issue: https://github.com/tenstorrent/cloud/issues/3319

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 16, 2024

Please check above for updates, will copy-paste here once we have a response

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 16, 2024

Otherwise, for Guest OS: Ubuntu 20.04

@hmohiuddinTT
Copy link
Collaborator

Reposting comment from here: https://github.com/tenstorrent/cloud/issues/3319#issuecomment-2418413621

Host

Hostname: f13cs02
OS: Ubuntu 20.04.4
Kernel Version: 5.4.0-166-generic
TTKMD: 1.28

VM

Hostname: tt-metal-ci-vm-49
OS: Ubuntu 20.04.4
TTKMD: 1.28

XML

<domain type='kvm' id='51'>
  <name>instance-00001621</name>
  <uuid>7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="22.3.1"/>
      <nova:name>tt-metal-ci-vm-49</nova:name>
      <nova:creationTime>2024-10-16 15:48:09</nova:creationTime>
      <nova:flavor name="1WH-14vCPU-48GB_RAM-200GB_DISK-N2">
        <nova:memory>49152</nova:memory>
        <nova:disk>200</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>14</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="da69cc4ff9844e9293e52bdafa463532">cloud_production</nova:user>
        <nova:project uuid="ff377f0073de4486a37477f266767054">cloud_production</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="5784d96c-eff5-46a7-aa50-d11bf4f23a2b"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>50331648</memory>
  <currentMemory unit='KiB'>50331648</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
      <page size='1048576' unit='KiB' nodeset='1'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>14</vcpu>
  <cputune>
    <shares>14336</shares>
    <vcpupin vcpu='0' cpuset='11'/>
    <vcpupin vcpu='1' cpuset='59'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='51'/>
    <vcpupin vcpu='4' cpuset='2'/>
    <vcpupin vcpu='5' cpuset='50'/>
    <vcpupin vcpu='6' cpuset='21'/>
    <vcpupin vcpu='7' cpuset='35'/>
    <vcpupin vcpu='8' cpuset='83'/>
    <vcpupin vcpu='9' cpuset='46'/>
    <vcpupin vcpu='10' cpuset='94'/>
    <vcpupin vcpu='11' cpuset='26'/>
    <vcpupin vcpu='12' cpuset='74'/>
    <vcpupin vcpu='13' cpuset='44'/>
    <emulatorpin cpuset='2-3,11,21,26,35,44,46,50-51,59,74,83,94'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memnode cellid='1' mode='strict' nodeset='1'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>OpenStack Foundation</entry>
      <entry name='product'>OpenStack Nova</entry>
      <entry name='version'>22.3.1</entry>
      <entry name='serial'>7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</entry>
      <entry name='uuid'>7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</entry>
      <entry name='family'>Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-Rome</model>
    <topology sockets='14' cores='1' threads='1'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='xsaves'/>
    <feature policy='require' name='topoext'/>
    <numa>
      <cell id='0' cpus='0-6' memory='25165824' unit='KiB' memAccess='shared'/>
      <cell id='1' cpus='7-13' memory='25165824' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
  <clock offset='utc'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' discard='unmap'/>
      <source file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/disk' index='2'/>
      <backingStore type='file' index='3'>
        <format type='raw'/>
        <source file='/var/lib/nova/instances/_base/ad559263c98227e4642981b2511e8b32c6950da9'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <source file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/disk.config' index='1'/>
      <backingStore/>
      <target dev='sda' bus='sata'/>
      <readonly/>
      <alias name='sata0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:72:26:03'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x7'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </interface>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:ed:7f:8f'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x6'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:94:ac:64'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x5'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </interface>
    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:88:ba:d1'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xc2' slot='0x02' function='0x4'/>
      </source>
      <alias name='hostdev3'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <log file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/console.log' append='off'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <log file='/var/lib/nova/instances/7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c/console.log' append='off'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='vnc' port='5903' autoport='yes' listen='172.27.28.103'>
      <listen type='address' address='172.27.28.103'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0xc1' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev4'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <stats period='10'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='rng0'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</label>
    <imagelabel>libvirt-7c1d7eba-aef6-4c70-9f57-4f9b3fdb312c</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+64055:+108</label>
    <imagelabel>+64055:+108</imagelabel>
  </seclabel>
</domain>

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 28, 2024

Syseng ticket: https://tenstorrent.atlassian.net/browse/SYS-950

We have another issue where boards may fail to power back up. This example happened during a test: https://github.com/tenstorrent/cloud/issues/3323#issuecomment-2427039533

From Yuqing: it's known to us that will cause power stage failure (v80.10.0.0 FW)

cc: @TT-billteng @ttmchiou wtfffffff

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 28, 2024

@TT-billteng suggested to tt mod a board to 500MHz to help rule out high clock speed di/dt: https://github.com/tenstorrent/tt-metal/actions/runs/11561350277

custom test dispatch to test the above^

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 28, 2024

@bkeith-TT
Copy link

This is the root issue with steps to reproduce: tenstorrent/tt-smi#54

@bkeith-TT
Copy link

bkeith-TT commented Nov 19, 2024

2024-11-18: (as mentioned by RK/MC in linked ticket above) Looks like the FW update might have fixed things. Aiming for a midweek deploy to try to resolve this issue.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Nov 19, 2024

No, unfortunately, in that ticket we're referring to the mistral7b-blowing-up-boards problem. This issue still exists.

The SMI issue is still being looked at by @TTDRosen and team. We've confirmed that they should be able to reproduce on a VM similar to the cloud setup.

@smehtaTT smehtaTT assigned TTDRosen and unassigned ttmchiou Nov 26, 2024
@smehtaTT
Copy link

Hi @TTDRosen - can you help unblock Metal team with this, thanks.

@TTDRosen
Copy link

The issue has been assigned, I think maybe the priority got lost in the Github -> jira friction. I bumped the priority again, and it should hit the top of the stack next week.

@prajaramanTT
Copy link

@tt-rkim @TTDRosen Is this still an open issue ? If not, can you please close this ? Thanks.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 8, 2025

We still see symptoms of this.

We recently have not been able to reproduce the symptom however, which indicates a SW stack problem.

This hasn't been as big of an issue recently. we will keep this open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-bug bugs found in CI infra-ci infrastructure and/or CI changes machine-management P1
Projects
None yet
Development

No branches or pull requests

8 participants