Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jhony/mcdaniel integrating Karpenter #41

Merged
merged 45 commits into from
Dec 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
421cc03
feat: add aws vpc and k8s reference resources for Karpenter
lpm0073 Mar 6, 2023
f30937e
docs: add infra-examples/aws/README.md
lpm0073 Mar 6, 2023
fdac8b7
docs: edit infra-examples/aws/README.md
lpm0073 Mar 6, 2023
e04a28f
docs: edit infra-examples/aws/README.md
lpm0073 Mar 6, 2023
0d8b6cf
docs: edit infra-examples/aws/k8s-cluster/README.rst
lpm0073 Mar 6, 2023
ec58ddd
refactor: input variables should match those of Terraform example for…
lpm0073 Mar 6, 2023
cbda630
docs: add update-kubeconfig to README.md
lpm0073 Mar 6, 2023
9330588
feat: add a randomized hash to AmazonEKS_EBS_CSI_DriverRole to avoid …
lpm0073 Jun 7, 2023
2782f08
feat: add an aws provider so that local kubeconfig is not explicitly …
lpm0073 Jun 7, 2023
bf161fb
feat: add an aws provider so that local kubeconfig is not explicitly …
lpm0073 Jun 7, 2023
97cffc3
style: removing per suggestion from Gabor
lpm0073 Jun 7, 2023
3e2fe75
feat: ensure that aws availability zone is actually available
lpm0073 Jun 7, 2023
8abad5d
refactor: add missing defaults and types
lpm0073 Jun 7, 2023
915bc43
feat: add multi-charts
lpm0073 Jun 7, 2023
83b43fc
refactor: add a complete block to define 'local'
lpm0073 Jun 7, 2023
d2f2fdc
chore: bump hashicorp/local to 2.4
lpm0073 Jun 7, 2023
9aed15b
chore: bump hashicorp/aws to 4.65
lpm0073 Jun 7, 2023
bb2cc89
chore: bump terraform-aws-modules/vpc/aws to 4.0
lpm0073 Jun 7, 2023
a9f74a9
style: remove a blank line
lpm0073 Jun 7, 2023
9682b8c
chore: bump terraform-aws-modules/eks/aws to 19.13
lpm0073 Jun 7, 2023
ad07f2a
chore: bump all add-ons to latest stable
lpm0073 Jun 7, 2023
8e118e5
style: remove a blank line
lpm0073 Jun 7, 2023
37e9809
chore: bump k8s defalt version to 1.27
lpm0073 Jun 7, 2023
2134602
chore: bump hashicorp/random to 3.5
lpm0073 Jun 7, 2023
fb003ce
chore: bump hashicorp/aws to 4.65
lpm0073 Jun 7, 2023
ec1ff53
chore: bump hashicorp/helm to 2.9
lpm0073 Jun 7, 2023
a986b12
chore: bump hashicorp/kubernetes to 2.20
lpm0073 Jun 7, 2023
63dfab7
feat: adding files to chart from old chart folder as they are
jfavellar90 Jul 10, 2023
2c54780
chore: adjusting chart changes
jfavellar90 Jul 11, 2023
6abf8d1
chore: addressing some changes from old PR
jfavellar90 Jul 11, 2023
3989c5a
feat: adjustments to get Terraform modules working properly
jfavellar90 Aug 30, 2023
feec55a
chore: relying on EKS terraform module to set add-on versions
jfavellar90 Sep 19, 2023
e245c7d
chore: removing Prometheus from Karpenter scope
jfavellar90 Sep 19, 2023
1e39785
feat: adding karpenter module to create necessary infra resources
jfavellar90 Oct 3, 2023
1a8da69
chore: adding an updated version of the Karpenter chart
jfavellar90 Oct 3, 2023
4d7bb52
chore: adding provisioner and nodetemplate resources for Karpenter
jfavellar90 Oct 3, 2023
de72fdf
chore: adding new hook for Karpenter extra resources
jfavellar90 Oct 3, 2023
e9744c5
chore: updating helm lock file
jfavellar90 Oct 3, 2023
9ebd9e4
chore: removing not needed opensearch Helm artifact
jfavellar90 Oct 3, 2023
4fa0ed3
chore: cleaning a bit the Helm chart values file
jfavellar90 Oct 3, 2023
69b312d
chore: adding relevant Karpenter outputs to K8S module
jfavellar90 Oct 3, 2023
d488a8c
fix: variable interpolation
jfavellar90 Oct 3, 2023
7649372
chore: adding docs on how to install Karpenter
jfavellar90 Oct 17, 2023
1b7ea1d
fix: not using problematic template provider
jfavellar90 Nov 14, 2023
92fe852
docs: fix typo
gabor-boros Dec 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,9 @@ infra-*/terraform.tfstate
infra-*/terraform.tfstate*
infra-*/.terraform*
infra-*/secrets.auto.tfvars
*kubeconfig
*terraform.tfstate*
*terraform.lock.*
.terraform
*secrets.auto.tfvars
my-notes
105 changes: 101 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ In particular, this project aims to provide the following benefits to Open edX o
## Technology stack and architecture

1. At the base is a Kubernetes cluster, which you must provide (e.g. using Terraform to provision Amazon EKS).
* Any cloud provider such as AWS or Digital Ocean should work. There is an example Terraform setup in `infra-example` but it is just a starting point and not recommended for production use.
* Any cloud provider such as AWS or Digital Ocean should work. There are Terraform examples in the `infra-examples` folder but it is just a starting point and not recommended for production use.
2. On top of that, this project's helm chart will install the shared resources you need - an ingress controller, monitoring, database clusters, etc. The following are included but can be disabled/replaced if you prefer an alternative:
* Ingress controller: [ingress-nginx](https://kubernetes.github.io/ingress-nginx/)
* Automatic HTTPS cert provisioning: [cert-manager](https://cert-manager.io/)
Expand Down Expand Up @@ -89,6 +89,75 @@ still present in your cluster.
[pod-autoscaling plugin](https://github.com/eduNEXT/tutor-contrib-pod-autoscaling) enables the implementation of HPA and
VPA to start scaling an installation workloads. Variables for the plugin configuration are documented there.

#### Node-autoscaling with Karpenter in EKS Clusters.

This section provides a guide on how to install and configure [Karpenter](https://karpenter.sh/) in a EKS cluster. We'll use
infrastructure examples included in this repo for such purposes.

> Prerequisites:
- An aws accound id
- Kubectl 1.27
- Terraform 1.5.x or higher
- Helm

1. Clone this repository and navigate to `./infra-examples/aws`. You'll find Terraform modules for `vpc` and `k8s-cluster`
resources. Proceed creating the `vpc` resources first, followed by the `k8s-cluster` resources. Make sure to have the target
AWS account ID available, and then execute the following commands on every folder:

```
terraform init
terraform plan
terraform apply -auto-approve
```

It will create an EKS cluster in the new VPC. Required Karpenter resources will also be created.

2. Once the `k8s-cluster` is created, run the `terraform output` command on that module and copy the following output variables:

- cluster_name
- karpenter_irsa_role_arn
- karpenter_instance_profile_name

These variables will be required in the next steps.

3. Karpenter is a dependency of the harmony chart that can be enabled or disabled. To include Karpenter in the Harmony Chart,
**it is crucial** to configure these variables in your `values.yaml` file:

- `karpenter.enabled`: true
- `karpenter.serviceAccount.annotations.eks\.amazonaws\.com/role-arn`: "<`karpenter_irsa_role_arn` value from module>"
- `karpenter.settings.aws.defaultInstanceProfile`: "<`karpenter_instance_profile_name` value from module>"
- `karpenter.settings.aws.clusterName`: "<`cluster_name` value from module>"

Find below an example of the Karpenter section in the `values.yaml` file:

```
karpenter:
enabled: true
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "<karpenter_irsa_role_arn>"
settings:
aws:
# -- Cluster name.
clusterName: "<cluster_name"
# -- Cluster endpoint. If not set, will be discovered during startup (EKS only)
# From version 0.25.0, Karpenter helm chart allows the discovery of the cluster endpoint. More details in
# https://github.com/aws/karpenter/blob/main/website/content/en/docs/upgrade-guide.md#upgrading-to-v0250
# clusterEndpoint: "https://XYZ.eks.amazonaws.com"
# -- The default instance profile name to use when launching nodes
defaultInstanceProfile: "<karpenter_instance_profile_name>"
```

4. Now, install the Harmony Chart in the new EKS cluster using [these instructions](#usage-instructions). This will provide a
very basic Karpenter configuration with one [provisioner](https://karpenter.sh/docs/concepts/provisioners/) and one
[node template](https://karpenter.sh/docs/concepts/node-templates/). Please refer to the official documentation to
get further details.

> **NOTE:**
> This Karpenter installation does not support multiple provisioners or node templates for now.

5. To test Karpenter, you can proceed with the instructions included in the
[official documentation](https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/#first-use).


<br><br><br>
Expand Down Expand Up @@ -238,18 +307,46 @@ Just run `helm uninstall --namespace harmony harmony` to uninstall this.
### How to create a cluster for testing on DigitalOcean

If you use DigitalOcean, you can use Terraform to quickly spin up a cluster, try this out, then shut it down again.
Here's how. First, put the following into `infra-tests/secrets.auto.tfvars` including a valid DigitalOcean access token:
Here's how. First, put the following into `infra-examples/secrets.auto.tfvars` including a valid DigitalOcean access token:
```
cluster_name = "harmony-test"
do_token = "digital-ocean-token"
```
Then run:
```
cd infra-example
cd infra-examples/digitalocean
terraform init
terraform apply
cd ..
export KUBECONFIG=`pwd`/infra-example/kubeconfig
export KUBECONFIG=`pwd`/infra-examples/kubeconfig
```
Then follow steps 1-4 above. When you're done, run `terraform destroy` to clean
up everything.

## Appendix C: how to create a cluster for testing on AWS

Similarly, if you use AWS, you can use Terraform to spin up a cluster, try this out, then shut it down again.
Here's how. First, put the following into `infra-examples/aws/vpc/secrets.auto.tfvars` and `infra-examples/aws/k8s-cluster/secrets.auto.tfvars`:

```terraform
account_id = "012345678912"
aws_region = "us-east-1"
name = "tutor-multi-test"
```

Then run:

```bash
aws sts get-caller-identity # to verify that awscli is properly configured
cd infra-examples/aws/vpc
terraform init
terraform apply # run time is approximately 1 minute
cd ../k8s-cluster
terraform init
terraform apply # run time is approximately 30 minutes

# to configure kubectl
aws eks --region us-east-1 update-kubeconfig --name tutor-multi-test --alias tutor-multi-test
```

Then follow steps 1-4 above. When you're done, run `terraform destroy` in both the `aws` and `k8s-cluster` modules to clean up everything.
7 changes: 5 additions & 2 deletions charts/harmony-chart/Chart.lock
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,8 @@ dependencies:
- name: opensearch
repository: https://opensearch-project.github.io/helm-charts
version: 2.13.3
digest: sha256:11b69b1ea771337b1e7cf8497ee342a25b095b86899b8cee716be8cc9f955559
generated: "2023-07-01T19:23:29.18815+03:00"
- name: karpenter
repository: oci://public.ecr.aws/karpenter
version: v0.29.2
digest: sha256:453b9f734e2d770948d3cbd36529d98da284b96de051581ea8d11a3c05e7a78e
generated: "2023-10-03T10:52:43.453442762-05:00"
7 changes: 6 additions & 1 deletion charts/harmony-chart/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes to the chart and its
# templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.2.0
version: 0.3.0
# This is the version number of the application being deployed. This version number should be incremented each time you
# make changes to the application. Versions are not expected to follow Semantic Versioning. They should reflect the
# version the application is using. It is recommended to use it with quotes.
Expand Down Expand Up @@ -47,3 +47,8 @@ dependencies:
version: "2.13.3"
condition: opensearch.enabled
repository: https://opensearch-project.github.io/helm-charts

- name: karpenter
version: "v0.29.2"
repository: oci://public.ecr.aws/karpenter
condition: karpenter.enabled
15 changes: 15 additions & 0 deletions charts/harmony-chart/templates/karpenter/node-template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{{- if .Values.karpenter.enabled -}}
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: {{ .Values.karpenter.nodeTemplate.name }}
annotations:
"helm.sh/hook": post-install,post-upgrade
spec:
subnetSelector:
karpenter.sh/discovery: {{ .Values.karpenter.settings.aws.clusterName }}
securityGroupSelector:
karpenter.sh/discovery: {{ .Values.karpenter.settings.aws.clusterName }}
tags:
karpenter.sh/discovery: {{ .Values.karpenter.settings.aws.clusterName }}
{{- end }}
23 changes: 23 additions & 0 deletions charts/harmony-chart/templates/karpenter/provisioner.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{{- if .Values.karpenter.enabled -}}
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: {{ .Values.karpenter.provisioner.name }}
annotations:
"helm.sh/hook": post-install,post-upgrade
spec:
{{- if .Values.karpenter.provisioner.spec.requirements }}
requirements: {{ toYaml .Values.karpenter.provisioner.spec.requirements | nindent 4 }}
{{- end }}
{{- if .Values.karpenter.provisioner.spec.limits.resources }}
limits:
resources:
{{- range $key, $value := .Values.karpenter.provisioner.spec.limits.resources }}
{{ $key }}: {{ $value | quote }}
{{- end }}
{{- end }}
providerRef:
name: {{ .Values.karpenter.nodeTemplate.name }}
ttlSecondsUntilExpired: {{ .Values.karpenter.provisioner.spec.ttlSecondsUntilExpired }}
ttlSecondsAfterEmpty: {{ .Values.karpenter.provisioner.spec.ttlSecondsAfterEmpty }}
{{- end }}
53 changes: 53 additions & 0 deletions charts/harmony-chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -183,3 +183,56 @@ opensearch:
".opendistro-notebooks",
".opendistro-asynchronous-search-response*",
]

karpenter:
# add Karpenter node management for AWS EKS clusters. See: https://karpenter.sh/
enabled: false
serviceAccount:
name: "karpenter"
annotations:
eks.amazonaws.com/role-arn: ""
settings:
aws:
# -- Cluster name.
clusterName: ""
# -- Cluster endpoint. If not set, will be discovered during startup (EKS only)
# From version 0.25.0, Karpenter helm chart allows the discovery of the cluster endpoint. More details in
# https://github.com/aws/karpenter/blob/main/website/content/en/docs/upgrade-guide.md#upgrading-to-v0250
# clusterEndpoint: ""
# -- The default instance profile name to use when launching nodes
defaultInstanceProfile: ""
# -- interruptionQueueName is disabled if not specified. Enabling interruption handling may
# require additional permissions on the controller service account.
interruptionQueueName: ""
# ---------------------------------------------------------------------------
# Provide sensible defaults for resource provisioning and lifecycle
# ---------------------------------------------------------------------------
# Requirements for the provisioner API.
# More details in https://karpenter.sh/docs/concepts/provisioners/
provisioner:
name: "default"
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
# - key: node.kubernetes.io/instance-type
# operator: In
# values: ["t3.large", "t3.xlarge", "t3.2xlarge", "t2.xlarge", "t2.2xlarge"]
# - key: kubernetes.io/arch
# operator: In
# values: ["amd64"]
# The limits section controls the maximum amount of resources that the provisioner will manage.
# More details in https://karpenter.sh/docs/concepts/provisioners/#speclimitsresources
limits:
resources:
cpu: "200" # 50 nodes * 4 cpu
memory: "800Gi" # 50 nodes * 16Gi
# TTL in seconds. If nil, the feature is disabled, nodes will never terminate
ttlSecondsUntilExpired: 2592000
# TTL in seconds. If nil, the feature is disabled, nodes will never scale down
# due to low utilization.
ttlSecondsAfterEmpty: 30
# Node template reference. More details in https://karpenter.sh/docs/concepts/node-templates/
nodeTemplate:
name: "default"
Binary file removed harmony-chart/charts/opensearch-2.11.4.tgz
Binary file not shown.
33 changes: 33 additions & 0 deletions infra-examples/aws/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Reference Architecture for AWS

This module includes Terraform modules to create AWS reference resources that are preconfigured to support Open edX as well as [Karpenter](https://karpenter.sh/) for management of [AWS EC2 spot-priced](https://aws.amazon.com/ec2/spot/) compute nodes and enhanced pod bin packing.

## Virtual Private Cloud (VPC)

There are no explicit requirements for Karpenter within this VPC defintion. However, there *are* several requirements for EKS which might vary from the VPC module defaults now or in the future. These include:

- defined sets of subnets for both private and public networks
- a NAT gateway
- enabling DNS host names
- custom resource tags for public and private subnets
- explicit assignments of AWS region and availability zones

See additional details here: [AWS VPC README](./vpc/README.rst)

## Elastic Kubernetes Service (EKS)

AWS EKS has grown more complex over time. This reference implementation is preconfigured as necessary to ensure that a.) you and others on your team can access the Kubernetes cluster both from the AWS Console as well as from kubectl, b.) it will work for an Open edX deployment, and c.) it will work with Karpenter. With these goals in mind, please note the following configuration details:

- requirements detailed in the VPC section above are explicitly passed in to this module as inputs
- cluster endpoints for private and public access are enabled
- IAM Roles for Service Accounts (IRSA) is enabled
- Key Management Service (KMS) is enabled, encrypting all Kubernetes Secrets
- cluster access via aws-auth/configMap is enabled
- a karpenter.sh/discovery resource tag is added to the EKS instance
- various AWS EKS add-ons that are required by Open edX and/or Karpenter and/or its supporting systems (metrics-server, vpa) are included
- additional cluster node security configuration is added to allow node-to-node and pod-to-pod communication using internal DNS resolution
- a managed node group is added containing custom labels, IAM roles, and resource tags; all of which are required by Karpenter
- adds additional resources required by AWS EBS CSI Driver add-on, itself required by EKS since 1.22
- additional EC2 security groups are added to enable pod shell access from kubectl

See additional details here: [AWS EKS README](./k8s-cluster/README.rst)
Loading