Run all the commands listed below from the root of this repository.
This is a known issue for AMI instances on v4.0.0 and v4.0.1 as an updated configMap is required:
# Step 1: Install the configMap override file for prometheus
kubectl apply -f ./install/prometheus-override.ConfigMap.yaml
# Step 2: Delete the existing prometheus pod so that a new one will be created with the new configMap
kubectl delete pod <prometheus pod>
kubectl get pods -A
If facing networking challenges, it can be helpful to run ping etc. from inside a container in the cluster rather than on the host machine:
kubectl run -it --rm busybox --image=busybox sh
If there are disk / persistent volume issues, you can check the storage provisioner like so:
kubectl -n local-path-storage logs -f -l app=local-path-provisioner
# Stop
sudo systemctl stop k3s
# Start
sudo systemctl start k3s
# Restart
sudo systemctl restart k3s
By default, k3s allows high availability during upgrades, so the K3s containers continue running when the K3s service is stopped/restarted via systemctl
.
To stop all of the K3s containers and reset the containerd state, the k3s-killall.sh
script (on the PATH already) can be used:
bash k3s-killall.sh
Once ran, sudo k get systemctl start k3s
can bring back k3s (no Sourcegraph data would be lost)
If you encounter an error about cluster outages:
FATA[0000] /var/lib/rancher/k3s/server/tls/client-ca.key, /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, /var/lib/rancher/k3s/server/tls/etcd/peer-ca.key, /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt, /var/lib/rancher/k3s/server/tls/etcd/server-ca.key, /var/lib/rancher/k3s/server/tls/request-header-ca.crt, /var/lib/rancher/k3s/server/tls/server-ca.key, /var/lib/rancher/k3s/server/tls/client-ca.crt, /var/lib/rancher/k3s/server/tls/request-header-ca.key, /var/lib/rancher/k3s/server/tls/server-ca.crt, /var/lib/rancher/k3s/server/tls/service.key, /var/lib/rancher/k3s/server/cred/ipsec.psk newer than datastore and could cause a cluster outage. Remove the file(s) from disk and restart to be recreated from datastore.
To fix this, remove the tls certs and creds that were used in the old instance with:
rm -rf /var/lib/rancher/k3s/server/cred/ /var/lib/rancher/k3s/server/tls/
Unable to perform initial Kubernetes service initialization: Service "kubernetes" is invalid: spec.clusterIPs: Invalid value: []string{"10.43.0.1"}: failed to allocate IP 10.43.0.1: cannot allocate resources of type serviceipallocations at this time
A: Remove all the cluster related data with the command below. Data inside the volumes will not be removed.
sh /usr/local/bin/k3s-killall.sh
A: Check the logs for clues:
sudo tail -f /var/log/cloud-init-output.log
# Update 4.0.0 with your version number when needed
helm upgrade -i -f /home/ec2-user/deploy/install/override.yaml --version 4.0.0 sourcegraph /home/ec2-user/deploy/install/sourcegraph-charts.tgz --kubeconfig /etc/rancher/k3s/k3s.yaml
# v.s. using remote helm charts
helm upgrade -i -f /home/ec2-user/deploy/install/override.yaml --version 4.0.0 sourcegraph sourcegraph/sourcegraph --kubeconfig /etc/rancher/k3s/k3s.yaml
# 1: Port forward the Prometheus deployment to port 9090
kubectl port-forward deploy/prometheus 9090:9090
# 2: Connect the AWS VM port 9090 to your localhost port 9090
ssh -i ~/.ssh/<ssh-key> -L 9090:localhost:9090 ec2-user@<instance-ip>
You now have access to Prometheus on http://localhost:9090 in your browser
# sourcegraph-0 is the name of the node
kubectl get --raw /api/v1/nodes/sourcegraph-0/proxy/metrics/cadvisor --kubeconfig /etc/rancher/k3s/k3s.yaml
helm --kubeconfig /etc/rancher/k3s/k3s.yaml upgrade -i -f /home/sourcegraph/deploy/install/override.yaml sourcegraph sourcegraph/sourcegraph