Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update k8s max version #2903

Merged
merged 11 commits into from
Jan 16, 2025
Merged

Update k8s max version #2903

merged 11 commits into from
Jan 16, 2025

Conversation

dcmcand
Copy link
Contributor

@dcmcand dcmcand commented Jan 9, 2025

Reference Issues or PRs

closes #2870

What does this implement/fix?

Ups the max kubernetes version to 1.31 which is the highest version that cloud providers currently support.

Upgrades the terraform kubernetes provider

Changes the max kubernetes version to use Major and Minor versions instead of Major, Minor, Patch.

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

How to test this PR?

Deploy to one of the cloud providers using a kubernetes version of 1.29 (the previous max version).

Change your kubernetes version to 1.30 and redeploy

Change your kubernetes version to 1.31 and redeploy

Test functionality.

Any other comments?

@dcmcand dcmcand added needs: review 👀 This PR is complete and ready for reviewing area: dependencies 📦 All things dependencies area: k8s ⎈ area: tech-debt ⛓️ Items related to paying down tech debt labels Jan 9, 2025
Copy link
Member

@marcelovilla marcelovilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcmcand, changes look good! 🚀

You tested this on the three cloud providers and it worked on all of them, so I'm approving the PR.

However, I ran the cloud integration tests from this branch, and AWS is failing but I haven't checked why that is. It might not be related with these changes as I saw it failed from another branch too but I haven't checked whether the errors are the same. We can double check before merging this PR

@marcelovilla
Copy link
Member

@dcmcand I'm running into different issues when testing on AWS.

Deploying from scratch

When deploying from scratch using:

amazon_web_services:
  kubernetes_version: '1.31'
  ...

I'm getting the following error during the deployment:

[tofu]: │ Error: Waiting for rollout to finish: 1 replicas wanted; 0 replicas Ready
[tofu]: │
[tofu]: │   with module.kubernetes-ingress.kubernetes_deployment.main,
[tofu]: │   on modules/kubernetes/ingress/main.tf line 190, in resource "kubernetes_deployment" "main":
[tofu]: │  190: resource "kubernetes_deployment" "main" {
[tofu]: │
[tofu]: ╵
╭─────────────────────── Traceback (most recent call last) ────────────────────────╮
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/subcommands/depl │
│ oy.py:92 in deploy                                                               │
│                                                                                  │
│   89 │   │   │   msg = "Digital Ocean support is currently being deprecated and  │
│   90 │   │   │   typer.confirm(msg)                                              │
│   91 │   │                                                                       │
│ ❱ 92 │   │   deploy_configuration(                                               │
│   93 │   │   │   config,                                                         │
│   94 │   │   │   stages,                                                         │
│   95 │   │   │   disable_prompt=disable_prompt,                                  │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/deploy.py:55 in  │
│ deploy_configuration                                                             │
│                                                                                  │
│   52 │   │   │   │   s: hookspecs.NebariStage = stage(                           │
│   53 │   │   │   │   │   output_directory=pathlib.Path.cwd(), config=config      │
│   54 │   │   │   │   )                                                           │
│ ❱ 55 │   │   │   │   stack.enter_context(s.deploy(stage_outputs, disable_prompt) │
│   56 │   │   │   │                                                               │
│   57 │   │   │   │   if not disable_checks:                                      │
│   58 │   │   │   │   │   s.check(stage_outputs, disable_prompt)                  │
│                                                                                  │
│ /nix/store/03q8gn91mj95y5bqbcl90hyvmpqpz738-python3-3.11.7/lib/python3.11/contex │
│ tlib.py:517 in enter_context                                                     │
│                                                                                  │
│   514 │   │   except AttributeError:                                             │
│   515 │   │   │   raise TypeError(f"'{cls.__module__}.{cls.__qualname__}' object │
│   516 │   │   │   │   │   │   │   f"not support the context manager protocol") f │
│ ❱ 517 │   │   result = _enter(cm)                                                │
│   518 │   │   self._push_cm_exit(cm, _exit)                                      │
│   519 │   │   return result                                                      │
│   520                                                                            │
│                                                                                  │
│ /nix/store/03q8gn91mj95y5bqbcl90hyvmpqpz738-python3-3.11.7/lib/python3.11/contex │
│ tlib.py:137 in __enter__                                                         │
│                                                                                  │
│   134 │   │   # they are only needed for recreation, which is not possible anymo │
│   135 │   │   del self.args, self.kwds, self.func                                │
│   136 │   │   try:                                                               │
│ ❱ 137 │   │   │   return next(self.gen)                                          │
│   138 │   │   except StopIteration:                                              │
│   139 │   │   │   raise RuntimeError("generator didn't yield") from None         │
│   140                                                                            │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/stages/base.py:2 │
│ 98 in deploy                                                                     │
│                                                                                  │
│   295 │   │   │   deploy_config["tofu_import"] = True                            │
│   296 │   │   │   deploy_config["state_imports"] = state_imports                 │
│   297 │   │                                                                      │
│ ❱ 298 │   │   self.set_outputs(stage_outputs, opentofu.deploy(**deploy_config))  │
│   299 │   │   self.post_deploy(stage_outputs, disable_prompt)                    │
│   300 │   │   yield                                                              │
│   301                                                                            │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/opentof │
│ u.py:71 in deploy                                                                │
│                                                                                  │
│    68 │   │   │   │   )                                                          │
│    69 │   │                                                                      │
│    70 │   │   if tofu_apply:                                                     │
│ ❱  71 │   │   │   apply(directory, var_files=[f.name])                           │
│    72 │   │                                                                      │
│    73 │   │   if tofu_destroy:                                                   │
│    74 │   │   │   destroy(directory, var_files=[f.name])                         │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/opentof │
│ u.py:152 in apply                                                                │
│                                                                                  │
│   149 │   │   + ["-var-file=" + _ for _ in var_files]                            │
│   150 │   )                                                                      │
│   151 │   with timer(logger, "tofu apply"):                                      │
│ ❱ 152 │   │   run_tofu_subprocess(command, cwd=directory, prefix="tofu")         │
│   153                                                                            │
│   154                                                                            │
│   155 def output(directory=None):                                                │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/opentof │
│ u.py:120 in run_tofu_subprocess                                                  │
│                                                                                  │
│   117 │   logger.info(f" tofu at {tofu_path}")                                   │
│   118 │   exit_code, output = run_subprocess_cmd([tofu_path] + processargs, **kw │
│   119 │   if exit_code != 0:                                                     │
│ ❱ 120 │   │   raise OpenTofuException("OpenTofu returned an error")              │
│   121 │   return output                                                          │
│   122                                                                            │
│   123                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────╯
OpenTofuException: OpenTofu returned an error

It seems to be the same error in the AWS integration tests I ran from this branch.

Upgrading an existing cluster

I successfully deployed Nebari using:

amazon_web_services:
  kubernetes_version: '1.29'
  ...

I then changed it to:

amazon_web_services:
  kubernetes_version: '1.30'
  ...

and when re-deploying, ran into:

[tofu]: Planning failed. OpenTofu encountered an error while generating this plan.
[tofu]:
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.ingress_route,
[tofu]: │   on modules/traefik_crds/main.tf line 1, in resource "kubernetes_manifest" "ingress_route":
[tofu]: │    1: resource "kubernetes_manifest" "ingress_route" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.ingress_route_tcp,
[tofu]: │   on modules/traefik_crds/main.tf line 200, in resource "kubernetes_manifest" "ingress_route_tcp":
[tofu]: │  200: resource "kubernetes_manifest" "ingress_route_tcp" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.ingress_route_udp,
[tofu]: │   on modules/traefik_crds/main.tf line 348, in resource "kubernetes_manifest" "ingress_route_udp":
[tofu]: │  348: resource "kubernetes_manifest" "ingress_route_udp" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.middleware,
[tofu]: │   on modules/traefik_crds/main.tf line 423, in resource "kubernetes_manifest" "middleware":
[tofu]: │  423: resource "kubernetes_manifest" "middleware" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.middlewaretcp,
[tofu]: │   on modules/traefik_crds/main.tf line 1073, in resource "kubernetes_manifest" "middlewaretcp":
[tofu]: │ 1073: resource "kubernetes_manifest" "middlewaretcp" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.serverstransports,
[tofu]: │   on modules/traefik_crds/main.tf line 1132, in resource "kubernetes_manifest" "serverstransports":
[tofu]: │ 1132: resource "kubernetes_manifest" "serverstransports" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: cannot get OpenAPI foundry: failed
[tofu]: │ get OpenAPI spec: the server has asked for the client to provide
[tofu]: │ credentials
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.tls_option,
[tofu]: │   on modules/traefik_crds/main.tf line 1209, in resource "kubernetes_manifest" "tls_option":
[tofu]: │ 1209: resource "kubernetes_manifest" "tls_option" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.tls_stores,
[tofu]: │   on modules/traefik_crds/main.tf line 1287, in resource "kubernetes_manifest" "tls_stores":
[tofu]: │ 1287: resource "kubernetes_manifest" "tls_stores" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
[tofu]: ╷
[tofu]: │ Error: Plugin error
[tofu]: │
[tofu]: │   with module.traefik-crds.kubernetes_manifest.traefik_service,
[tofu]: │   on modules/traefik_crds/main.tf line 1334, in resource "kubernetes_manifest" "traefik_service":
[tofu]: │ 1334: resource "kubernetes_manifest" "traefik_service" {
[tofu]: │
[tofu]: │ The plugin returned an unexpected error from
[tofu]: │ plugin.(*GRPCProvider).UpgradeResourceState: rpc error: code = Unknown desc
[tofu]: │ = failed to determine resource type ID: failed to look up GVK
[tofu]: │ [apiextensions.k8s.io/v1, Kind=CustomResourceDefinition] among available
[tofu]: │ CRDs: Unauthorized
[tofu]: ╵
╭─────────────────────── Traceback (most recent call last) ────────────────────────╮
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/subcommands/depl │
│ oy.py:92 in deploy                                                               │
│                                                                                  │
│   89 │   │   │   msg = "Digital Ocean support is currently being deprecated and  │
│   90 │   │   │   typer.confirm(msg)                                              │
│   91 │   │                                                                       │
│ ❱ 92 │   │   deploy_configuration(                                               │
│   93 │   │   │   config,                                                         │
│   94 │   │   │   stages,                                                         │
│   95 │   │   │   disable_prompt=disable_prompt,                                  │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/deploy.py:55 in  │
│ deploy_configuration                                                             │
│                                                                                  │
│   52 │   │   │   │   s: hookspecs.NebariStage = stage(                           │
│   53 │   │   │   │   │   output_directory=pathlib.Path.cwd(), config=config      │
│   54 │   │   │   │   )                                                           │
│ ❱ 55 │   │   │   │   stack.enter_context(s.deploy(stage_outputs, disable_prompt) │
│   56 │   │   │   │                                                               │
│   57 │   │   │   │   if not disable_checks:                                      │
│   58 │   │   │   │   │   s.check(stage_outputs, disable_prompt)                  │
│                                                                                  │
│ /nix/store/03q8gn91mj95y5bqbcl90hyvmpqpz738-python3-3.11.7/lib/python3.11/contex │
│ tlib.py:517 in enter_context                                                     │
│                                                                                  │
│   514 │   │   except AttributeError:                                             │
│   515 │   │   │   raise TypeError(f"'{cls.__module__}.{cls.__qualname__}' object │
│   516 │   │   │   │   │   │   │   f"not support the context manager protocol") f │
│ ❱ 517 │   │   result = _enter(cm)                                                │
│   518 │   │   self._push_cm_exit(cm, _exit)                                      │
│   519 │   │   return result                                                      │
│   520                                                                            │
│                                                                                  │
│ /nix/store/03q8gn91mj95y5bqbcl90hyvmpqpz738-python3-3.11.7/lib/python3.11/contex │
│ tlib.py:137 in __enter__                                                         │
│                                                                                  │
│   134 │   │   # they are only needed for recreation, which is not possible anymo │
│   135 │   │   del self.args, self.kwds, self.func                                │
│   136 │   │   try:                                                               │
│ ❱ 137 │   │   │   return next(self.gen)                                          │
│   138 │   │   except StopIteration:                                              │
│   139 │   │   │   raise RuntimeError("generator didn't yield") from None         │
│   140                                                                            │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/stages/base.py:2 │
│ 98 in deploy                                                                     │
│                                                                                  │
│   295 │   │   │   deploy_config["tofu_import"] = True                            │
│   296 │   │   │   deploy_config["state_imports"] = state_imports                 │
│   297 │   │                                                                      │
│ ❱ 298 │   │   self.set_outputs(stage_outputs, opentofu.deploy(**deploy_config))  │
│   299 │   │   self.post_deploy(stage_outputs, disable_prompt)                    │
│   300 │   │   yield                                                              │
│   301                                                                            │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/opentof │
│ u.py:71 in deploy                                                                │
│                                                                                  │
│    68 │   │   │   │   )                                                          │
│    69 │   │                                                                      │
│    70 │   │   if tofu_apply:                                                     │
│ ❱  71 │   │   │   apply(directory, var_files=[f.name])                           │
│    72 │   │                                                                      │
│    73 │   │   if tofu_destroy:                                                   │
│    74 │   │   │   destroy(directory, var_files=[f.name])                         │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/opentof │
│ u.py:152 in apply                                                                │
│                                                                                  │
│   149 │   │   + ["-var-file=" + _ for _ in var_files]                            │
│   150 │   )                                                                      │
│   151 │   with timer(logger, "tofu apply"):                                      │
│ ❱ 152 │   │   run_tofu_subprocess(command, cwd=directory, prefix="tofu")         │
│   153                                                                            │
│   154                                                                            │
│   155 def output(directory=None):                                                │
│                                                                                  │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/opentof │
│ u.py:120 in run_tofu_subprocess                                                  │
│                                                                                  │
│   117 │   logger.info(f" tofu at {tofu_path}")                                   │
│   118 │   exit_code, output = run_subprocess_cmd([tofu_path] + processargs, **kw │
│   119 │   if exit_code != 0:                                                     │
│ ❱ 120 │   │   raise OpenTofuException("OpenTofu returned an error")              │
│   121 │   return output                                                          │
│   122                                                                            │
│   123                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────╯
OpenTofuException: OpenTofu returned an error

Regardless, I went to the EKS console and upgraded the node groups manually, as outlined in https://www.nebari.dev/docs/how-tos/kubernetes-version-upgrade/

Then, I updated the config to:

amazon_web_services:
  kubernetes_version: '1.31'
  ...

and when re-deploying, I ran into:

Attempt 1 failed connecting to keycloak master realm
Attempt 2 failed connecting to keycloak master realm
Attempt 3 failed connecting to keycloak master realm
Attempt 4 failed connecting to keycloak master realm
Attempt 5 failed connecting to keycloak master realm
Attempt 6 failed connecting to keycloak master realm
Attempt 7 failed connecting to keycloak master realm
Attempt 8 failed connecting to keycloak master realm
Attempt 9 failed connecting to keycloak master realm
Attempt 10 failed connecting to keycloak master realm
ERROR: unable to connect to keycloak master realm at url=https://mvilla.quansight.dev/auth/ with root credentials

I had to manually upgrade the node groups again, and only after that, keycloak was accessible.

@dcmcand
Copy link
Contributor Author

dcmcand commented Jan 16, 2025

For some reason the PV for the ingress cert isn't being created.

@dcmcand
Copy link
Contributor Author

dcmcand commented Jan 16, 2025

Starting with 1.30, Amazon EKS no longer includes the default annotation on the gp2 StorageClass resource applied to newly created clusters. This has no impact if you are referencing this storage class by name. You must take action if you were relying on having a default StorageClass in the cluster. You should reference the StorageClass by the name gp2. Alternatively, you can deploy the Amazon EBS recommended default storage class by setting the defaultStorageClass.enabled parameter to true when installing v1.31.0 or later of the aws-ebs-csi-driver add-on.

The minimum required IAM policy for the Amazon EKS cluster IAM role has changed. The action ec2:DescribeAvailabilityZones is required. For more information, see Amazon EKS cluster IAM role.

https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions-standard.html

@dcmcand
Copy link
Contributor Author

dcmcand commented Jan 16, 2025

@marcelovilla aws deploy worked with 1.31 for me locally and deployment test now passes https://github.com/nebari-dev/nebari/actions/runs/12810668708/job/35718247163. Do you want to retest before I merge?

@marcelovilla
Copy link
Member

Nice find @dcmcand! I'm fine with merging it now

@dcmcand dcmcand merged commit 06f2030 into main Jan 16, 2025
27 checks passed
@dcmcand dcmcand deleted the update-k8s-max-version branch January 16, 2025 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: dependencies 📦 All things dependencies area: k8s ⎈ area: tech-debt ⛓️ Items related to paying down tech debt needs: review 👀 This PR is complete and ready for reviewing
Projects
Status: Done 💪🏾
Development

Successfully merging this pull request may close these issues.

[ENH] - update supported versions of K8s
2 participants