-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to upgrade from 1.29.x to 1.30.x #11500
Comments
I am unable to reproduce this - I just went through a full upgrade of a cluster from 1.29.3 and 1.30.2 to v1.30.4 without any issues. There is one major thing different on my cluster, though - I am running etcd on the control plane nodes (not on dedicated machines). The step you mentioned actually calls Was the cluster created using kubespray? Were there any significant configuration changes to the cluster (and/or etcd) between cluster creation and the upgrade attempt? |
@bogd Thank you for your answer, yes indeed cluster was created via Kubespray (We hadnt any issues till now) , last things i remember was we tried to upgade from 1.29 to 1.30, but there we hit this kubernetes/kubeadm#3084 (which was merged recently and now fixed) Our issue was that then we tried upgade, it went from 1.29 to 1.30 (on the first master) then failed and we had to downgrade back to 1.29 - since then we're getting this error i mention above, i read the playbooks and saw that i calls kubeadm upgrade, what more logs shall i provide, i can run ansible with debug v5 and post the result if any changes? UpdatE: I am now running the upgrade to another cluster (same version) which hadnt been impacted by: kubernetes/kubeadm#3084 , and soon will try to the affected and will post more of the output |
@mrBlackhat - that is exactly the same issue I encountered ( #11350 references kubernetes/kubeadm#3084 ), and the reason why my cluster was stuck on two versions (v1.30.2 on the first control plane, and v1.29.3 on the other control plane and worker nodes). However, in my case the upgrade actually completed successfully using the most recent kubespray version. A debug log from ansible will definitely help - and so will the results of running the upgrade on the other cluster. [ Edit, because I just noticed - how did you perform the downgrade? In my case, I just left the cluster on multiple versions - a difference of one minor version is fine, as per K8s version skew documentation ] |
@bogd , Hi again, the downgrade was via kubepsray just did kube_verstion 1.30 -> 1.29 and it downgraded fine, so i tried to run upgrade from 1.29 to 1.30 (on cluster which hadnt been update while the --config was problem) Issue still persist: https://pastebin.com/6gSMjswY , I ran the upgrade (My master nodes are master01-03-test) so in the output of the playbook i still see that it tries to search for master03.pem (while is searching directory located on master01) , the only thing i found is that after removing --config from kubeadm task, someone said that it becomes interactive, so i have to pass --yes to it in order to continue its normal behaivor, do i edit the task in the playbook to add --yes flag? Edit: I left the cluster in weird state, after the fail, master01 was 1.30 , other was 1.29 so i saw alarm in prometheus Kube_version_missmatch, but since i downgraded it went off, then tried upgrade again and still fails, as i said i just ran the upgrade on a fresh (1.29) cluster which hasnt been upgraded when --config options was problematic (so it never seen failed upgrade) but it is still failing as i posted the output of the task |
@bogd I am confused by this error: [master01-test] FATAL: failed to create etcd client for external etcd: open /etc/ssl/etcd/ssl/node-master03-test.pem: no such file or directory , why it searches for master03.pem file on master01 in /etc/ssl/etcd/ssl/node-master03-test.pem , when i search the directory i can see that there is only node-master01-test.pem (Which seems correct) but why is complaining about master03 when it is currently working on master01 edit: v5 output of the task: https://pastebin.com/5WseLz7R |
There is something strange there indeed.
On another kubespray-provisioned cluster, that is still on The issue was not visible on my end because my
I can afford to do this, since it is only a test cluster, and worst case scenario is "I just have to delete it and recreate it". Not sure how safe this workaround is for a production cluster, so proceed with caution... (BTW, was the API server on that |
@bogd Thanks for the reply, so on my [master01-test] i see the correct files: less /etc/kubernetes/manifests/kube-apiserver.yamlapiVersion: v1 API on Master01 is running yes, i checked, everything is well running, also on the master02/03 is also correct in (kupe-apiserver.yml) i see no wrong files there, only confusion comes from kubespray when is working on master01 but it is searching master03 cert in my kube-api.yaml on master01 i can see there are listed files, but only for master01 (which i assume is correct, since the same file on master02 is listing paths to master02 files) , there is no master03 references anywhere in master01 - which confuses me EDIT: The only difference between my setup and default is that i have etcd on separate machines (not on the master nodes) when i list Edit2: not sure if related in this case but my etcd_deployment type is : etcd_deployment_type: host (as i understand host means etcd is deployed on different host not on the master) |
One more possible source for the config (see here for details):
|
Yep, i just checked it seems wrong maybe?, shouldn't it be master01.pem && master01-key.pem: kubectl -n kube-system get cm kubeadm-config -o yaml
Edit: After few more checks i see the following:
And in this case i assume it is some loop going thru etcd group and since master03 is the last one in the group it is how ended it there, if i understand correctly , shouldn't this kube_etcd_cert_file: node-{{ inventory_hostname }}.pem be changed to something like |
Not really - unfortunately, the cluster-wide config (stored in the ConfigMap) is... cluster wide - and as far as I can tell, K8s hasn't yet implemented per-node configs You are absolutely right in assuming that the config looks this way because of the last control plane node being processed. What I cannot figure out is how kubespray normally works around this problem, and why I have all the certs for all the control plane nodes in my
Edit: I think I have the answer to the second question - I am running etcd on the control-plane nodes, and all the certificates are copied to all the etcd nodes here. Unfortunately, I do not have a cluster with external etcd that I can test with... |
I have exact same files (on my etcd01-test node) - which seems fine to me but etcd is external,still not sure why kubespray is searching those files on the master01 when etcd is external and by my logic it should search them there, on my master01-test i have
But since etcd is deployed on a different machine (not on the master nodes) i am assuming that kubespray is searching for the file on the wrong place?, when i have external etcd shouldnt it check its etcd group and therefore their /etc/ssl/etcd/ssl folder? |
Not kubespray - this seems to be |
So when i inspect all of the masters Master01
Master02
Master03
I am not exactly sure why they look like this, master1 has its own certificate, master2 has its own + master1 and etc, EDIT: If i understand it correctly, in the manifest.go i can see This If i am using external etcd (like mine case) it will use any provided certificate, since i think it comes from the config itself, and since in the config in my case is wrriten the last one being processed, it tries to use the certificate (which is missing) so isn't this the kubespray issue, since it is creating the config, so my point is if kubespray knows that my master01 is my first master, because i think i saw some set_facts that sets who first master is, shouldn't it ensure that in the config is written the node-first-master.pem and not the last of the loop? And in thise case if i edit the configmap to point it to master01 like this
i guess it may do the upgrade |
@bogd Some updates, so i've inspected my master nodes, and found this:
, so in kube-apiserver, every master is using it's own certificate (which to me seems right), but if i do: kubectl get cm kubeadm-config -n kube-system -o yaml , i can see:
So my obvious solutions was to do: kubectl edit cm kubeadm-config -n kube-system and change:
and since master01 certificate is present in every other master
The upgrade is going as expected, i successfully upgraded to 1.30.4 , Clean install (Configmap was right, was pointing to master01), |
Indeed, the problem was that somehow configmap was wrong,not sure how exactly it happend but it did, somehow master03 ended up there, since the certificate file is missing from other masters - upgrade failed, Solution to copy the cert from master03 to 02 and 01 is not recomended, the best one that worked was to edit configmap kubeadm-config in kube-system, to point it to master01 |
What happened?
Running the upgrade (cluster-upgrade.yaml)
Failing at task:
Upgrade first master (1 retries left)
[upgrade/apply] Kubeadm | Upgrade first master FATAL: failed to create etcd client for external etcd: open /etc/ssl/etcd/ssl/node-master03.pem
What did you expect to happen?
To upgrade the master from 1.29 to 1.30
How can we reproduce it (as minimally and precisely as possible)?
run upgrade-cluster.yaml from 1.29 to 1.30
OS
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Version of Ansible
ansible-playbook [core 2.16.10]
jinja version = 3.0.3
Version of Python
3.10.12
Version of Kubespray (commit)
e744a11
Network plugin used
weave
Full inventory with variables
[all]
etcd01 ansible_host=10.10.20.50 etcd_member_name=etcd01
etcd02 ansible_host=10.10.20.51 etcd_member_name=etcd02
etcd03 ansible_host=10.10.20.52 etcd_member_name=etcd03
master01 ansible_host=10.10.30.10
master02 ansible_host=10.10.30.11
master03 ansible_host=10.10.30.12
worker1 ansible_host=10.10.30.21
worker2 ansible_host=10.10.30.22
worker3 ansible_host=10.10.30.23
worker4 ansible_host=10.10.30.24
worker5 ansible_host=10.10.30.25
[kube_control_plane]
master01
master02
master03
[etcd]
etcd01
etcd02
etcd03
[kube_node]
worker1
worker2
worker3
worker4
worker5
[calico_rr]
[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr
Command used to invoke ansible
ansible-playbook upgrade-cluster.yaml -i /path/to/my/inventory -e ansible_user=root
Output of ansible run
Upgrade first master (1 retries left)
[upgrade/apply] FATAL: failed to create etcd client for external etcd: open /etc/ssl/etcd/ssl/node-master03.pem
Anything else we need to know
Not sure if related, but i see that when trying to upgade master01 - it seraches for master03 certificate in /etc/ssl/etcd/ssl, but since is master01 and there is only master01.cert - it fails, any help would be appreciated
The text was updated successfully, but these errors were encountered: