title | authors | reviewers | approvers | api-approvers | creation-date | last-updated | tracking-link | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hide-container-mountpoints |
|
|
|
|
2021-01-18 |
2021-05-19 |
The current implementation of Kubelet and CRI-O both use the top-level namespace for all container and Kubelet mountpoints. However, moving these container-specific mountpoints into a private namespace reduced systemd overhead with no difference in functionality.
systemd scans and re-scans mountpoints many times, adding a lot to the CPU utilization of systemd and overall overhead of the host OS running OpenShift. Changing systemd to reduce its scanning overhead is tracked in BZ 1819868, but we can work around this exclusively within OpenShift. Using a separate mount namespace for both CRI-O and Kubelet can completely segregate all container-specific mounts away from any systemd or other host OS interaction whatsoever.
As an OpenShift system administrator, I want systemd to consume less resources so that I can run more workloads on my system.
As an OpenShift system administrator, I want to disable the mount-namespace-hiding feature so that I can fall back to the previous system behavior.
As an OpenShift developer or support engineer, I want to inspect kubernetes-specific mountpoints as part of debugging issues.
- Mounts originating in CRI-O, OCI hooks, Kubelet, and container volumeMounts
with
mountPropagation: Bidirectional
are no longer visible to systemd or the host OS - Mounts originating in the host OS are still visible to CRI-O, OCI hooks,
Kubelet, and container volumeMounts with
MountPropagation: HostToContainer
(or Bidirectional) - Mounts originating in CRI-O, OCI hooks, Kubelet, and container volumeMounts
with
mountPropagation: Bidirectional
are still visible to each other and container volumeMounts withMountPropagation: HostToContainer
- Restarting either
crio.service
orkubelet.service
does not result in the mount visibility getting out-of-sync
- Fix systemd mountpoint scanning overhead
Generally speaking, the end-user experience should not be affected in any way by this proposal, as there is no outward API changes. There is some supportability difference, since anyone attempting to inspect the CRI-O or Kubelet mountpoints externally would need to be aware that these are now available in a different namespace than the default top-level systemd mount namespace.
For Tech Preview, the feature must be enabled by adding a MachineConfig which
enables the new kubens.service
systemd unit which drives the whole feature.
For GA, the feature would be enabled by default, and may be disabled by adding
a MachineConfig that disables the kubens.service
systemd unit.
For any containers running in the system, there should be no observable difference in the behavior of the system.
For any administrative shells or processes running outside of containers on the
host, the Kubernetes-specific mountpoints will no longer be visible by default.
Entering the new mount namespace via the kubensenter
script will make these
mountpoints visible again.
No API changes required.
The existing APIs available within MachineConfig objects can be used to
enable/disable the kubens.service
which in turn enables/disables this
feature.
We will create a separate mount namespace and cause both CRI-O and Kubelet to launch within it to hide their many many mounts from systemd by:
-
Selecting a well-known location to pin a Kubernetes-specific mount namespace:
/run/kubens/mnt
-
Adding a mechanism to CRI-O to enter a pre-existing mount namespace if pinned in the well-known location.
- This can be overridden from the commandline or environment
$KUBENSMNT
to opt-out or use a different mount namespace location.
- This can be overridden from the commandline or environment
-
Adding a mechanism to Kubelet to enter a pre-existing mount namespace if pinned in the well-known location.
- This can be overridden from the commandline or environment
$KUBENSMNT
to opt-out or use a different mount namespace location.
- This can be overridden from the commandline or environment
-
Adding a systemd service called
kubens.service
which spawns a separate namespace and pins it to this well-known location.- This can be overridden from the environment
$KUBENSMNT
use a different mount namespace location. - We don't want to create the namespace in
crio.service
orkubelet.service
, since if either one restarts they would lose each other's namespaces. - Implemented in this way, disabling the
kubens.service
(and restarting both Kubelet and CRI-O) fully disables this proposed feature, falling back to the current not-hidden mode of operation.
- This can be overridden from the environment
-
A convenience wrapper to enter this well-known mount namespace,
kubensenter
for other tools, administrative and support actions which need access to this namespace.- This will operate identically to
nsenter
except that it defaults to entering this well-known namespace location (if present)
- This will operate identically to
-
An update to the debug container's MOTD that mentions about the new need for
kubensenter
in addition to the current chroot instructions so it's clear how the debug shell can gain access to the hidden Kubernetes mountpoints. -
Both the new systemd service and convenience wrapper will be installed as part of Kubelet.
With this proposal in place, both Kubelet and CRI-O create their mounts in the new shared (with each other) but private (from systemd) namespace, and this feature can be easily enabled/disabled by enabling/disabling a single systemd service.
The current OpenShift and Kubernetes implementations guarantee 3 things about mountpoint visibility:
- Mounts originating in the host OS are visible to CRI-O, OCI hooks, Kubelet,
and container volumeMounts with
MountPropagation: HostToContainer
(or Bidirectional) - Mounts originating in CRI-O, OCI hooks, Kubelet, and container volumeMounts
with
mountPropagation: Bidirectional
are visible to each other and container volumeMounts withMountPropagation: HostToContainer
- Mounts originating in CRI-O, OCI hooks, Kubelet, and container volumeMounts
with
mountPropagation: Bidirectional
are visible to the host OS
The first 2 guarantees are not changed by this proposal:
- The new mount namespace uses 'slave' propagation, so any mounts originating
in the host OS top-level mount namespace are still propagated down into the
new 2nd-level namespace where they are visible to CRIO-O, OCI hooks,
Kubelet, and container volumeMounts with
MountPropagation: HostToContainer
(or Bidirectional), just as before. - CRI-O, OCI hooks, Kubelet, and any containers created by CRI-O are all within the same 2nd-level namespace, so any mountpoints created by any of these entities are visible to all others within that same mount namespace. Additionally, any 3rd-level namespaces created below this point will have the same relationship with the 2nd-level namespace that they previously had with the higher-level namespace.
The 3rd guarantee is explicitly removed by this proposal.
This means that:
- Administrators who have connected to the host OS and want to inspect the mountpoints originating from CRI-O, OCI hooks, Kubelet, or containers will not be able to see them unless they explicitly enter the 2nd-level namespace.
- Any external or 3rd-party tools which run in the host mount namespace but
expect to see mountpoints created by CRI-O, OCI hooks, Kubelet, or containers
would need to be changed to enter the specific container namespace in order
to see them.
- This could mean that security scanning solutions that expect to see Kubernetes mounts will no longer see them. They can be modified to join the new mount namespace. We have verified that StackRox is not substantially affected by this change.
We will mitigate this by adding a helper application to easily enter the right mount namespace, and adding an easy mechanism to disable this feature and fallback to the original mode of operation.
If the namespace service restarts and then either CRI-O or Kubelet restarts,
there will be a mismatch between the mount namespaces and containers will start
to fail. Could be mitigated by changing the namespace service to NOT cleanup
its pinned namespace but instead idempotently re-use a previously-created
namespace. However, given that the namespace service as implemented today is a
systemd oneshot
with no actual process that needs keeping alive, the risk of
this terminating unexpectedly is very low.
Hiding the Kubernetes mounts from systemd may confuse administrators and support personnel who are used to seeing them.
- With the feature enabled:
- Ensure that running 'mount' in default mount namespace does not show any of the Kubernetes-specific mountpoints.
- Ensure that entering the mount namespace and running 'mount' shows all the Kubernetes-specific mountpoints.
- All existing e2e tests at a similar rate.
- With the feature disabled:
- Ensure that running 'mount' in default mount namespace shows all of the Kubernetes-specific mountpoints.
- All existing e2e tests at a similar rate.
The main graduation consideration for this feature is when it is enabled by default.
This feature is already in Dev Preview, with the current MachineConfig-based proof-of-concept solution part of the telco-specific DU profile installed by ZTP, and also available here
- Reimplement according to this proposal, but the feature is disabled by default
- Add a CI lane that runs all current e2e tests with this feature enabled
- Update the ZTP DU profile to use the new mechanism instead of the current MachineConfig-based proof-of-concept
- User-facing documentation:
- Feature overview (what it is and what happens when it's enabled)
- How to enable the feature
- How to inspect the container mounts when the feature is enabled
- How to set up a system service to enter the mount namespace
- Enable the feature by default
- Remove the CI lane that enables the feature, as it is enabled by default
- User-facing documentation changes:
- Mention the feature is on by default
- Change "how to enable" instructions to "how to disable"
Not applicable.
Not applicable.
Not applicable. The mount namespace is fully-contained and isolated within each node of a cluster. There is no impact of having the feature enabled on some nodes and disabled on others.
Not applicable; no API extensions; but there are operational impacts of this change, detailed in the 'Risks and Mitigations' section above.
-
If either the Kubelet or CRI-O services end up in different namespaces from one another, containers started by CRI-O will not see mounts made by Kubelet, such as secrets or configmaps.
-
If the namespace is not configured correctly to allow mounts from the OS to be shared into Kubelet or CRI-O, system mountpoints will not be visible to either Kubelet or CRI-O or runing containers.
When this feature is enabled, a shell on a node and the 'oc debug' container will not have visibility of the Kubernetes mountpoints.
-
To start a shell within the container mount namespace, execute the
kubensenter
script. -
To disable this feature, inject a MachineConfig that disables the 'kubens.service':
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-custom-disable-kubens-worker
spec:
config:
ignition:
version: 2.2.0
systemd:
units:
- enabled: false
name: kubens.service
This proposal differs from the original proof-of-concept by:
- Moving the responsibility of entering the namespace to the tools that run in it (CRI-O and Kubelet), instead of a fragile systemd ExecStart-patching drop-in
- Building in the simple off/on switch of enabling/disabling a single systemd service, instead of having it tied to a monolithic MachineConfig object.
Original work here.
It has MC objects that create:
- The new
container-mount-namespace.service
service - Override files for both
crio.service
andkubelet.service
which add the appropriate systemd dependencies upon thecontainer-mount-namespace.service
and wrap ExecStart inside ofnsenter
- A convenience utility called
/usr/local/bin/nsenterCmns
which can be used by administrators or other software on the host to enter the new namespace.
It also passed e2e tests at a fairly high rate on a 4.6.4 cluster.
This was then productized as a dev preview for the Telco RAN installations here. It uses the same MachineConfig-based drop-in mechanism as the original proof-of-concept.
This is installed and enabled by the ZTP DU profile, and is used in production on many Telco customers' systems, both for SingleNode OpenShift and standard clusters, with no reported issues.
- Enhance systemd to support unsharing namespaces at the slice level, then put
crio.service
andkubelet.service
in the same slice