-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for allocating GPUs in Passthrough-Mode #183
base: main
Are you sure you want to change the base?
Conversation
/cc @klueska |
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
0fc7174
to
0978db9
Compare
type GpuDriver string | ||
|
||
const ( | ||
NvidiaDriver GpuDriver = "nvidia" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another name for this driver that isn't nvidia
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. The driver is simply called nvidia
} | ||
|
||
// Validate ensures that GpuDriverConfig has a valid set of values. | ||
func (c *GpuDriverConfig) Validate() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is c == nil
valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. gpuConfig.Normalize()
would always ensure GpuDriverConfig
is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question should probably have been: "Should we add an explict nil
check and retunr an error if this is the case?"
case VfioPciDriver: | ||
break | ||
default: | ||
return fmt.Errorf("invalid driver specified in gpu driver configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include the invalid value here?
if c.DriverConfig.Driver != NvidiaDriver { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does adding a SupportsSharing()
function to the GpuDriver
type make this clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the expectation that we always call Validate()
after Normalize()
so that we check whether there is no sharing actually configured? Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?
Removing the check here won't work. There's 2 possibilities here:
- On NVIDIA driver, normalize sharing by initializing it if its nil, and then set config based on strategy.
- On VFIO-PCI driver, don't initialize it.
If we allow normalization of c.Sharing
in case (2) by relaxing the check, then we're going to fail subsequent validation.
allocatable: allocatable, | ||
config: config, | ||
nvdevlib: nvdevlib, | ||
checkpointManager: checkpointManager, | ||
} | ||
|
||
// Initialize the vfio-pci driver manager. | ||
vfioPciManager.Init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we init this here? Where are other managers initialised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsManager/mpsManager don't have any "initializations". But checkpointManager is initialized underneath: L117-L131.
What's the concern here?
switch castConfig := config.(type) { | ||
case *configapi.GpuConfig: | ||
return s.applySharingConfig(ctx, castConfig.Sharing, claim, results) | ||
configState.GpuConfig = castConfig | ||
err = s.applyGpuConfig(ctx, castConfig, claim, results, &configState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not still return if err != nil
in each of the branches? alternatively, does it make sense to have s.applyGpuConfig
return (*DeviceConfigState, error)
like the other apply functions and keep the other function unchanged?
if config.Sharing != nil { | ||
err := s.applySharingConfig(ctx, config.Sharing, claim, results, configState) | ||
if err != nil { | ||
return err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: We unconditionally call applySharingConfig
for mig devices. Since we now have the case where config.Sharing
can be nil, could we either update applySharingConfig
to ignore nil configs, or introduce a Disabled
config so that we can set that for cases where we don't expect sharing to be allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MIG devices, config.Sharing
can never be nil right - MIGs would always support Sharing
so Normalize
would initialize that config.
} | ||
|
||
func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) { | ||
func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I think we should keep this signature the same as before. I would rather avoid passing an argument by reference for modification.
return nil | ||
} | ||
|
||
func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error { | |
func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) { |
cmd/nvidia-dra-plugin/nvlib.go
Outdated
@@ -199,6 +199,10 @@ func (l deviceLib) enumerateImexChannels(config *Config) (AllocatableDevices, er | |||
return devices, nil | |||
} | |||
|
|||
func getPciAddressFromNvmlPciInfo(info nvml.PciInfo) string { | |||
return fmt.Sprintf("%04x:%02x:%02x.0", info.Domain, info.Bus, info.Device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be the same if its returning the address in the format 0000:0a:00.0
. Wasn't aware of that implementation. I'll switch to that
cmd/nvidia-dra-plugin/nvlib.go
Outdated
if ret != nvml.SUCCESS { | ||
return nil, fmt.Errorf("error getting PCI info for device %d: %w", index, err) | ||
} | ||
pciAddress := getPciAddressFromNvmlPciInfo(pciInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace this with device.GetBusID()
?
cmd/nvidia-dra-plugin/vfio-device.go
Outdated
} | ||
|
||
func (vm *VfioPciManager) loadVfioPciModule() error { | ||
cmd := exec.Command("modprobe", vm.vfioPciModule) //nolint:gosec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we chroot
to the host root mounted into the container instead of expecting /sys/
to be writable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar Do you mean something like: exec.Command("nsenter", "--mount=/proc/1/ns/mnt", "--", "modprobe", vm.vfioPciModule)
?
} | ||
|
||
func (vm *VfioPciManager) getIommuGroupForVfioPciDevice(pciAddress string) string { | ||
iommuGroup, err := os.Readlink(filepath.Join(vm.pciDevicesRoot, pciAddress, "iommu_group")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have similar functionality in go-nvlib
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have it tracked in NvidiaPciDevice
here: https://github.com/NVIDIA/go-nvlib/blob/main/pkg/nvpci/nvpci.go. I'm guessing nvpci
is the sysfs way of discovering GPUs. But nvlib
doesn't have it.
Btw one of the implicit assumptions I havent captured here is that iommu
needs to be enabled in the kernel for us to be able to do gpu-passthrough. In that case we would hit the err in L171 and return empty iommu group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the NvidiaPciDevice
from go-nvlib/pkg/nvpci
? It seems as if it is set here https://github.com/NVIDIA/go-nvlib/blob/1482a942fb6d52a023cff85c2d76ed4127af661a/pkg/nvpci/nvpci.go#L292-L304.
- name: sysfs | ||
mountPath: /sys | ||
readOnly: false | ||
- name: dev-vfio | ||
mountPath: /dev/vfio | ||
readOnly: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be optionally mounted instead?
# Usage: ./bind_to_driver.sh <ssss:bb:dd.f> <driver> | ||
# Bind the GPU specified by the PCI_ID=ssss:bb:dd.f to the given driver. | ||
|
||
bind_to_driver() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement this in go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atleast for unbind
, in case unbindLock
is in use, we'd need the same process that acquires the unbindLock
to be the one doing the driver unbind
. I want to scope that unbindLock
to just the task that's doing the unbind
rather than the whole kubeletplugin
binary.
(FYI unbindLock
is in play when the NVIDIA Grid-vGPU driver is used on the node)
bind
is a script just to be consistent with unbind
.
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
0978db9
to
49bda9f
Compare
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
49bda9f
to
aa3c2bf
Compare
@@ -66,3 +66,5 @@ nodes: | |||
# on the kind nodes. | |||
- hostPath: /usr/bin/nvidia-ctk | |||
containerPath: /usr/bin/nvidia-ctk | |||
- hostPath: /sys | |||
containerPath: /sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline
- name: dev-vfio | ||
mountPath: /dev/vfio | ||
readOnly: false | ||
{{- end -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline
fi | ||
} | ||
|
||
bind_to_driver "$1" "$2" || exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline.
return 0 | ||
} | ||
|
||
unbind_from_driver "$1" || exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline
This PR introduces a new DeviceClass
vfiopci.nvidia.com
that will allocate a full GPU in PassThrough-mode (PT) by binding the GPU to vfio-pci driver.The primary usecase for this new DeviceClass are Kata containers and KubeVirt VMs that require the gpu to be in PT-mode and made available to a pod which then would spin up a guest with the gpu.
Note: Regular pod workloads will not benefit from this DeviceClass and shouldn't try to use this.
As part of this change, I've introduced some (but not all) modifications to the kind cluster config that are needed for this DeviceClass to work. Host-level modifications needed:
Validated on a kind cluster with a Quattro P2000 GPU:
Open items: