Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Nodes don't scale down on GKE and AKS #2507

Open
Adam-D-Lewis opened this issue Jun 11, 2024 · 6 comments · May be fixed by #2605
Open

[BUG] - Nodes don't scale down on GKE and AKS #2507

Adam-D-Lewis opened this issue Jun 11, 2024 · 6 comments · May be fixed by #2605
Assignees

Comments

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Jun 11, 2024

Describe the bug

I noticed that GKE won't autoscale all nodes down to 0 in some cases. I saw that metrics-server deployment and the event-exporter-gke replicaset nodeSelector only has

nodeSelector:                                                                                                                                                                          
    kubernetes.io/os: linux                                                                                                                                                              

meaning it can be scheduled on any of the nodes preventing them from scaling down.

Options to fix this might be

  1. Disable metrics collection - https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#enable_components
  2. Set a taint on user and worker nodes (and any custom nodes groups created) to force metrics-server pod to run on general node group

I think AWS doesn't have metrics-server enabled by default so I think it's reasonable to disable it.

Expected behavior

nodes should autoscale down

OS and architecture in which you are running Nebari

Linux x86-64

How to Reproduce the problem?

see above

Command output

No response

Versions and dependencies used.

No response

Compute environment

GCP

Integrations

No response

Anything else?

No response

@Adam-D-Lewis Adam-D-Lewis added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Jun 11, 2024
@Adam-D-Lewis
Copy link
Member Author

While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.

@Adam-D-Lewis Adam-D-Lewis removed the needs: triage 🚦 Someone needs to have a look at this issue and triage label Jun 18, 2024
@viniciusdc
Copy link
Contributor

viniciusdc commented Jun 21, 2024

While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.

That's a good point, we relly need to check out those taints

@viniciusdc
Copy link
Contributor

I think as an overall change, your 2 points seems more reasonable (to all providers). For the AWS specifically, I think the metrics is a service that you need to enable if you want to use, and costs an extra expense to keep. I also agree to disable it in such case, or make it optional....

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jun 27, 2024

also I think the GKE deployed kubedns replicaset has the same issue. I think the solution is to put taints on the user nodes and worker nodes.

@Adam-D-Lewis
Copy link
Member Author

I also saw the metrics server and jupyterhub's user scheduler cause the same problem on AKS.

@Adam-D-Lewis Adam-D-Lewis linked a pull request Aug 1, 2024 that will close this issue
11 tasks
@Adam-D-Lewis Adam-D-Lewis changed the title [BUG] - Nodes don't scale down on GKE [BUG] - Nodes don't scale down on GKE and AKS Aug 1, 2024
@Adam-D-Lewis Adam-D-Lewis self-assigned this Aug 1, 2024
@Adam-D-Lewis Adam-D-Lewis moved this from New 🚦 to In progress 🏗 in 🪴 Nebari Project Management Aug 1, 2024
@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 16, 2024

The solution I propose is to add a taints section to each node group class. Thus you could specify the a taint on the user node via something like the following:

  node_groups:
    user:
      instance: Standard_D4_v3
      taints:
        - dedicated=user:NoSchedule

Then, we go and make sure the corresponding toleration is added to the jupyterhub user pod so that those pods will be able to run on the user node group. This should also work with pods started via argo-jupyter-scheduler.

This would not be supported for local deployments since local deployments only deploy a single node cluster atm. For existing deployments, it wouldn't affect the node group, but we would apply the specified toleration to the jupyterlab user pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress 🏗
Development

Successfully merging a pull request may close this issue.

3 participants