Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask workers can be scheduled on hub pods with default config #59

Open
scottyhq opened this issue Jul 15, 2019 · 9 comments
Open

dask workers can be scheduled on hub pods with default config #59

scottyhq opened this issue Jul 15, 2019 · 9 comments

Comments

@scottyhq
Copy link
Member

scottyhq commented Jul 15, 2019

Our current setup allows for dask pods on hub nodes:
https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml

This seems to be due to 'prefer' rather than 'require' when scheduling:
https://github.com/dask/dask-kubernetes/blob/ec4666a4af5acad03c24b84aca4fcf8ccd791b4f/dask_kubernetes/objects.py#L177

which results in the following for pods:

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: k8s.dask.org/node-purpose
            operator: In
            values:
            - worker
        weight: 100

not sure how we modify the config file to get the stricter 'require' condition like we have for notebook pods:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: k8s.dask.org/node-purpose
            operator: In
            values:
            - worker

@jhamman , @TomAugspurger

@jhamman
Copy link
Member

jhamman commented Jul 16, 2019

If you want to keep non-core pods off your core (hub) pool, you need to add a taint that only core pods can tolerate. I tend to just size the core pool to the smallest possible size to fit the hub pods. If you don't leave space, things wont try to schedule there. You can also up the node purpose scheduling requirements for dask pods, but in my experience, this is unnecessary.

For posterity, I should also link to this blog post that describes all of this in more detail: https://medium.com/pangeo/pangeo-cloud-cluster-design-9d58a1bf1ad3

@scottyhq
Copy link
Member Author

@jhamman - i'm thinking we might want the core pool to autoscale eventually if we try to consolidate multiple hubs on a single EKS cluster. If we add a taint to the core pool, it seems like pods in the kube-system namespace might have trouble (for example aws-node, tiller-deploy, cluster-autoscaler).

Another approach is to expose match_node_purpose="require" in https://github.com/dask/dask-kubernetes/blob/ec4666a4af5acad03c24b84aca4fcf8ccd791b4f/dask_kubernetes/objects.py#L177

@TomAugspurger
Copy link
Member

@jhamman is there a downside to the hard affinity (at least optionally)? It couldn't be the default, but it seems useful as an option.

@TomAugspurger
Copy link
Member

FYI, rather than exposing it as a config / parameter in KubeCluster, we could document how to achieve it.

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-bokeh, --memory-limit, 6GB, --death-timeout, '60']
    name: dask
    resources:
      limits:
        cpu: "2"
        memory: 6G
      requests:
        cpu: "2"
        memory: 6G
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          key: k8s.dask.org/node-purpose
          operator: In
          values:
            - worker

On master, that'll result in both the preferred and required affinity types being applied.

>>> a.pod_template.spec.affinity.node_affinity
{'preferred_during_scheduling_ignored_during_execution': [{'preference': {'match_expressions': [{'key': 'k8s.dask.org/node-purpose',
                                                                                                 'operator': 'In',
                                                                                                 'values': ['worker']}],
                                                                          'match_fields': None},
                                                           'weight': 100}],
 'required_during_scheduling_ignored_during_execution': {'node_selector_terms': [{'match_expressions': None,
                                                                                  'match_fields': None}]}}

I'm not sure how Kubernetes will handle that (presumably it's fine, just not the cleanest). Right now my preference would be to add a config option / argument to KubeCluster that's passed through to clean_pod_template, but I may be missing some context.

@jhamman
Copy link
Member

jhamman commented Jul 16, 2019

@jhamman is there a downside to the hard affinity (at least optionally)?

Not really. I think this is a fine approach. Of course, there is not way to enforce that users follow this pattern so dask workers may still end up in your core pool with this approach.

@jhamman
Copy link
Member

jhamman commented Aug 2, 2019

In thinking about this a little more, it may be easier for some to simply add a taint to the core pool that the hub and ingress pods can tolerate.

@scottyhq
Copy link
Member Author

In thinking about this a little more, it may be easier for some to simply add a taint to the core pool that the hub and ingress pods can tolerate.

@jhamman are you doing this now on the google clusters?

@jhamman
Copy link
Member

jhamman commented Sep 17, 2019

No. Not yet, but we could.

@bgroenks96
Copy link

bgroenks96 commented Dec 8, 2019

If you don't feel like modifying all of the JupyterHub services' configurations to include the toleration, this can also be accomplished by 1) adding a taint to the worker pools to prevent scheduling from core services, with corresponding tolerances added to worker pods and 2) adding a node selector to the worker pods with corresponding labels on the worker nodes. This will pretty much guarantee that everything ends up on the right nodes without having to taint/tolerate the core services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants