Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade WH fleet to 80.13.0.0, update on INSTALLING.md, and re-enable mistral7b test #15174

Open
tt-rkim opened this issue Nov 18, 2024 · 10 comments
Labels
infra-ci infrastructure and/or CI changes LLM_bug machine-management P2

Comments

@tt-rkim
Copy link
Collaborator

tt-rkim commented Nov 18, 2024

as a result of #14440 / #12626

A small pool has been upgrade already: #14440 (comment)

@tt-rkim tt-rkim changed the title Upgrade WH fleet to 80.13.0.0 and update on INSTALLING.md Upgrade WH fleet to 80.13.0.0, update on INSTALLING.md, and re-enable mistral7b test Nov 18, 2024
@tt-rkim tt-rkim added infra-ci infrastructure and/or CI changes P2 machine-management LLM_bug labels Nov 18, 2024
@ttmchiou
Copy link
Contributor

ttmchiou commented Nov 18, 2024

I haven’t observed any new issues with the new fw yet.
Seems it's been stable for 2 weeks now?
Should we upgrade the whole fleet now?
@tt-rkim

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Nov 18, 2024

I'd be down to schedule a run to happen on some weekday at like 12am Pacific/3am Eastern

@ttmchiou
Copy link
Contributor

Wednesday deployment?

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Dec 2, 2024

I did the following in groups:

  • Turned off the actions runner using our playbook
  • Flashed with 80.13.0.0
  • Rebooted to update the kernel and immediately turned off the actions runner again
  • Installed the newer version of TTKMD
  • Check with "ls -hal /dev/tenstorrent/" to see if device is available and driver is detecting something
  • Verify installation by running a tt-metal test container: ansible -i INVENTORY GROUP_NAME -a "sudo bash /opt/tt_metal_infra/scripts/ci/wormhole_b0/reset.sh && docker run --log-driver none -v /dev/hugepages-1G:/dev/hugepages-1G --device /dev/tenstorrent --cap-add ALL ghcr.io/tt-rkim/sw-hello-world/tt-metal-hello-world:latest-wormhole_b0"
    • Run reset script to ensure hugepages and other things are set because our version of Weka will overwrite HP settings

Round 1

These VMs were upgraded:

tt-metal-ci-vm-109         : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-110         : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-111         : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
 
tt-metal-ci-vm-113         : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-115         : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-116         : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-120         : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Looks like these VMs did not go through:

fatal: [tt-metal-ci-vm-124]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.62 port 22: No route to host", "unreachable": true}
fatal: [tt-metal-ci-vm-112]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.105 port 22: No route to host", "unreachable": true}
fatal: [tt-metal-ci-vm-126]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.176 port 22: No route to host", "unreachable": true}

Round 2

The following were upgraded to both 80.13.0.0 FW and 1.29 TTKMD:

tt-metal-ci-vm-127         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

tt-metal-ci-vm-132         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-133         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-134         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-136         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-137         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-138         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

tt-metal-ci-vm-14          : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-140         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-141         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-142         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-143         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-144         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-145         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-146         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-149         : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-15          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-150         : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-19          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-20          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-22          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-23          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-24          : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-25          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-26          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

tt-metal-ci-vm-28          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-29          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-31          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-32          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-33          : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-34          : ok=3    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

The following could not be updated:

fatal: [tt-metal-ci-vm-129]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.206 port 22: No route to host", "unreachable": true}
fatal: [tt-metal-ci-vm-128]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.145 port 22: No route to host", "unreachable": true}
fatal: [tt-metal-ci-vm-27]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.44.116 port 22: No route to host", "unreachable": true}
fatal: [tt-metal-ci-vm-13]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.148 port 22: Connection timed out", "unreachable": true}
fatal: [tt-metal-ci-vm-130]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.58 port 22: Connection timed out", "unreachable": true}
fatal: [tt-metal-ci-vm-139]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.44.50 port 22: Connection timed out", "unreachable": true}

Round 3

We upgraded:

tt-metal-ci-vm-35          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-36          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-37          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-38          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-40          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-41          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-42          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-43          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-44          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-45          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-46          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-47          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-49          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-50          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-51          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-52          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-53          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-57          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-58          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-68          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-69          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-70          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-79          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-80          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-84          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-85          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
 
tt-metal-ci-vm-87          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-88          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-95          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-96          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-97          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tt-metal-ci-vm-98          : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0  

Looks like the only one that didn't go through is:

tt-metal-ci-vm-86 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.165 port 22: No route to host",
    "unreachable": true
}

I haven't done the T3Ks yet

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Dec 2, 2024

Submitted ticket to check on these VMs: https://github.com/tenstorrent/cloud/issues/3486

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Dec 3, 2024

Upgraded:

[tt-metal-ci-vm-124]
[tt-metal-ci-vm-112]
[tt-metal-ci-vm-126]
[tt-metal-ci-vm-129]
[tt-metal-ci-vm-128]

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Dec 3, 2024

Upgraded:

tt-metal-ci-vm-86 | CHANGED | rc=0 >>
"TTKMD 1.29"
tt-metal-ci-vm-13 | CHANGED | rc=0 >>
"TTKMD 1.29"
tt-metal-ci-vm-27 | CHANGED | rc=0 >>
"TTKMD 1.29"
tt-metal-ci-vm-139 | CHANGED | rc=0 >>
"TTKMD 1.29"
tt-metal-ci-vm-130 | CHANGED | rc=0 >>
"TTKMD 1.29"
tt-metal-ci-vm-86 | CHANGED | rc=0 >>
"2024-11-01"
tt-metal-ci-vm-27 | CHANGED | rc=0 >>
"2024-11-01"
tt-metal-ci-vm-13 | CHANGED | rc=0 >>
"2024-11-01"
tt-metal-ci-vm-139 | CHANGED | rc=0 >>
"2024-11-01"
tt-metal-ci-vm-130 | CHANGED | rc=0 >>
"2024-11-01"

Thank you to our wonderful cloud team

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Dec 8, 2024

HUZZAH!

All single-card WH machines (besides new ones that @ttmchiou may have provisioned) except for these stragglers

tt-metal-ci-vm-101 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.212 port 22: Network is unreachable",
    "unreachable": true
}
tt-metal-ci-vm-107 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.189 port 22: Network is unreachable",
    "unreachable": true
}
tt-metal-ci-vm-145 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.44.254 port 22: Network is unreachable",
    "unreachable": true
}
tt-metal-ci-vm-121 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.45.63 port 22: Operation timed out",
    "unreachable": true
}

have been upgraded to 80.13.0.0.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Dec 18, 2024

BTW, I also upgraded f10cs07
(not driver)
We should symlink the t3000 flash playbook to the common one

tt-rkim added a commit that referenced this issue Jan 3, 2025
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Jan 6, 2025

  • Single card BMs have been upgraded
(internal_python_env) rkim@e12cs07:~/metal-internal-workflows$ ansible -i inventory/ci/ci.yaml -a "cat /sys/class/tenstorrent/tenstorrent\!0/tt_fw_bundle_ver"  single_card_bms
e13cs03 | CHANGED | rc=0 >>
0.0.0.0
e13cs01 | CHANGED | rc=0 >>
0.0.0.0
e09cs01 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-large-bm-e04cs05 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-large-bm-e09cs02 | CHANGED | rc=0 >>
80.13.0.0
  • All T3Ks have been upgraded
(internal_python_env) rkim@e12cs07:~/metal-internal-workflows$ ansible -i inventory/ci/ci.yaml -a "cat /sys/class/tenstorrent/tenstorrent\!0/tt_fw_bundle_ver"  all_t3ks 
tt-metal-ci-vm-t3k-01 | CHANGED | rc=0 >>
80.13.0.0
f12cs02 | CHANGED | rc=0 >>
80.13.1.0
f10cs07 | CHANGED | rc=0 >>
80.13.0.0
f10cs08 | CHANGED | rc=0 >>
80.13.0.0
f10cs06 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-02 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-03 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-05 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-04 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-06 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-07 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-10 | FAILED | rc=1 >>
cat: '/sys/class/tenstorrent/tenstorrent!0/tt_fw_bundle_ver': No such file or directorynon-zero return code
tt-metal-ci-vm-t3k-08 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-11 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-12 | FAILED | rc=1 >>
cat: '/sys/class/tenstorrent/tenstorrent!0/tt_fw_bundle_ver': No such file or directorynon-zero return code
tt-metal-ci-vm-t3k-13 | CHANGED | rc=0 >>
80.13.0.0
tt-metal-ci-vm-t3k-15 | FAILED | rc=1 >>
cat: '/sys/class/tenstorrent/tenstorrent!0/tt_fw_bundle_ver': No such file or directorynon-zero return code
tt-metal-ci-vm-t3k-14 | FAILED | rc=1 >>
cat: '/sys/class/tenstorrent/tenstorrent!0/tt_fw_bundle_ver': No such file or directorynon-zero return code
tt-metal-ci-vm-t3k-09 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: connect to host 172.27.44.185 port 22: No route to host",
    "unreachable": true

Only Galaxies remian.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra-ci infrastructure and/or CI changes LLM_bug machine-management P2
Projects
None yet
Development

No branches or pull requests

2 participants