-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUMounter-worker error in k8s v1.23.1 #19
Comments
Thanks for your feedback. I will try to fix it. |
ok, this bug is sloved. use environment and version:
i use nvidia k8s-device-plugin, {
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
} in code of "/pkg/util/util.go", always pass "cgroupfs" to cgroupDriver when call function GetCgroupName, then will error. so need detect for what use cgroup method of now. and need edit title? if need will can direct edit. |
@cool9203 Happy Spring Festival! Thanks for your efforts. Sorry for waiting so long time.
|
@pokerfaceSad Happy Spring Festival!!
|
i got another problem.
|
#19 (comment) https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/ update my test result:
can see in this test, maybe don't use |
* add environment variable `CGROUP_DRIVER`(default: cgroupfs) to set cgroup driver * fix log format error in allocator.go L84
@cool9203 |
…Sad#19#issuecomment-1033637663 * add environment variable `GPU_POOL_NAMESPACE`(not have default value, must set this env var) to set slave pod namespace on create on worker
…Sad#19 (comment) * add environment variable `GPU_POOL_NAMESPACE`(not have default value, must set this env var) to set slave pod namespace on create on worker
…Sad#19 (comment) * add environment variable `GPU_POOL_NAMESPACE`(not have default value, must set this env var) to set slave pod namespace on create on worker
@pokerfaceSad
thanks your fixed, pass a environment variable in worker.yaml is good idea!
i show one solve method in #19 (comment) |
@cool9203 However, in a multi-tenant cluster scenario, cluster administrator may use resourse quota feature to limit the resource usage of users. If GPUMounter create the slave pods in owner pod namespaces, slave pods will consume the resource quota of the user. |
GPUMounter-master.log:
2022-01-16T11:24:14.610Z INFO GPUMounter-master/main.go:25 access add gpu service
2022-01-16T11:24:14.610Z INFO GPUMounter-master/main.go:30 Pod: test Namespace: default GPU Num: 1 Is entire mount: false
2022-01-16T11:24:14.627Z INFO GPUMounter-master/main.go:66 Found Pod: test in Namespace: default on Node: rtxws
2022-01-16T11:24:14.634Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-7dsdf Node: rtxws
2022-01-16T11:24:19.648Z ERROR GPUMounter-master/main.go:98 Failed to call add gpu service
2022-01-16T11:24:19.648Z ERROR GPUMounter-master/main.go:99 rpc error: code = Unknown desc = Service Internal Error
GPUMounter-worker.log:
2022-01-16T11:24:14.635Z INFO gpu-mount/server.go:35 AddGPU Service Called
2022-01-16T11:24:14.635Z INFO gpu-mount/server.go:36 request: pod_name:"test" namespace:"default" gpu_num:1
2022-01-16T11:24:14.645Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster
2022-01-16T11:24:14.645Z INFO allocator/allocator.go:159 Get pod default/test mount type
2022-01-16T11:24:14.645Z INFO collector/collector.go:91 Updating GPU status
2022-01-16T11:24:14.646Z INFO collector/collector.go:136 GPU status update successfully
2022-01-16T11:24:14.657Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: test-slave-pod-2f66ed for Owner Pod: test
2022-01-16T11:24:14.657Z INFO allocator/allocator.go:238 Checking Pods: test-slave-pod-2f66ed state
2022-01-16T11:24:14.661Z INFO allocator/allocator.go:264 Pod: test-slave-pod-2f66ed creating
2022-01-16T11:24:19.442Z INFO allocator/allocator.go:277 Pods: test-slave-pod-2f66ed are running
2022-01-16T11:24:19.442Z INFO allocator/allocator.go:84 Successfully create Slave Pod: %s, for Owner Pod: %s test-slave-pod-2f66edtest
2022-01-16T11:24:19.442Z INFO collector/collector.go:91 Updating GPU status
2022-01-16T11:24:19.444Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: test-slave-pod-2f66ed in Namespace gpu-pool
2022-01-16T11:24:19.444Z INFO collector/collector.go:136 GPU status update successfully
2022-01-16T11:24:19.444Z INFO gpu-mount/server.go:81 Start mounting, Total: 1 Current: 1
2022-01-16T11:24:19.444Z INFO util/util.go:19 Start mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-7fe47fc1-b21e-e675-f6ff-edd91910f8a7","State":"GPU_ALLOCATED_STATE","PodName":"test-slave-pod-2f66ed","Namespace":"gpu-pool"} to Pod: test
2022-01-16T11:24:19.444Z INFO util/util.go:24 Pod :test container ID: e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3
2022-01-16T11:24:19.444Z INFO util/util.go:30 Successfully get cgroup path: /kubepods/burstable/podc815ee4b-bea0-44ed-8ef4-239e69516ba2/e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3 for Pod: test
2022-01-16T11:24:19.445Z ERROR cgroup/cgroup.go:140 Exec "echo 'c 195:0 rw' > /sys/fs/cgroup/devices/kubepods/burstable/podc815ee4b-bea0-44ed-8ef4-239e69516ba2/e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3/devices.allow" failed
2022-01-16T11:24:19.445Z ERROR cgroup/cgroup.go:141 Output: sh: 1: cannot create /sys/fs/cgroup/devices/kubepods/burstable/podc815ee4b-bea0-44ed-8ef4-239e69516ba2/e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3/devices.allow: Directory nonexistent
2022-01-16T11:24:19.445Z ERROR cgroup/cgroup.go:142 exit status 2
2022-01-16T11:24:19.445Z ERROR util/util.go:33 Add GPU {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-7fe47fc1-b21e-e675-f6ff-edd91910f8a7","State":"GPU_ALLOCATED_STATE","PodName":"test-slave-pod-2f66ed","Namespace":"gpu-pool"}failed
2022-01-16T11:24:19.445Z ERROR gpu-mount/server.go:84 Mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-7fe47fc1-b21e-e675-f6ff-edd91910f8a7","State":"GPU_ALLOCATED_STATE","PodName":"test-slave-pod-2f66ed","Namespace":"gpu-pool"} to Pod: test in Namespace: default failed
2022-01-16T11:24:19.445Z ERROR gpu-mount/server.go:85 exit status 2
環境與版本
在k8s v1.23裡, "/sys/fs/cgroup/devices/kubepods/burstable/pod[pod-id]/[container-id]/devices.allow" 改為 "/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod[pod-id]/docker-[container-id].scope/devices.allow"
所以當前GPUMounter在v1.23裡無法正常運作
是否可以更新至可符合k8s v1.23版,謝謝
The text was updated successfully, but these errors were encountered: