-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eks-prow-build-cluster: Reconsider instance type selection #5066
Comments
/sig k8s-infra |
I'm going to transfer this issue to k/k8s.io as other issues related to this cluster are already there. |
/assign @xmudrii @pkprzekwas |
One thing to consider: Because kubernetes doesn't have IO/IOPS isolation, sizing really large nodes changes the CPU : I/O ratio. (Though this will also not be 1:1 between GCP and AWS anyhow), so while really large nodes can allow high core count jobs OR bin packing more jobs per node ... the latter can cause issues by over-packing for I/O throughput. This is less of an issue today than when we ran bazel builds widely, but it's still something that can cause performance issues. The existing size is semi-arbitrary though, and may be somewhat GCP specific, but right now tests that are likely to be IO heavy sometimes reserve that IO by reserving ~all of the CPU at our current node sizes. |
xref #4686 |
To add to what @BenTheElder said: we already had issues with GOMAXPROCS for unit tests. We "migrated" 5 jobs so far and one was affected (potentially one more). To avoid such issues, we might want to have instances close to what we have on GCP. We can't have 1:1 mapping, but we can try using similar instances based on what AWS offers. Not having to deal with stuff such as GOMAXPROCS is going to make the migration more smooth and we'll avoid spending a lot of time on debugging such issues. |
@dims Thanks for driving this forward. But just to note, this fixes it only for k/k, other subprojects might be affected by it and would need to apply a similar patch. |
Go is expected to solve GOMAXPROCS upstream, it's been accepted to detect this in the stdlib, and GOMAXPROCS can also be set in the CI in the meantime, as-is jobs already have this wrong and we should resolve that independently of selecting node-size. |
+1 for setting this on existing jobs. I have a secret hope that it might generally reduce flakiness a bit. |
Maybe try some bare metal node like an m5.2xlarge or m6g.2xlarge? |
@TerryHowe We need to use memory optimized instances because our jobs tend to use a lot of memory. |
Update: we decided to go with a 3 step phased approach:
Note: the order of phases might get changed. Each phase should last at least 24 hours to ensure that tests are stable. I just started the first phase and I think we should leave it on until Wednesday morning CEST. |
Update: we tried r6id.2xlarge but it seems that 8 vCPUs are not enough:
I'm trying |
/retitle eks-prow-build-cluster: Reconsider instance type selection |
@xmudrii are we still doing this ? Do we want to use a instance type with less resources ? |
Blocked by #5168 |
@ameukam Yes, let's figure this out after the 1.32 release. |
What should be cleaned up or changed:
Some changes were made to the EKS cluster to attempt to resolve an issue with test flakes. These changes also increased the per-node cost. We should consider reverting these changes to reduce cost.
a) Changing to an instance type without instance storage.
b) Changing back to an AMD CPU type
c) Changing to a roughly 8 CPU / 64GB type to more closely match the existing GCP cluster nodes
The cluster currently uses an r5d.4xlarge (16 CPU/ 128 GB) with an on-demand cost of 1.152
An r5a.4xlarge (16 CPU / 128 GB) has an on-demand cost of 0.904 per hour
An r5a.2xlarge (8 CPU / 64 GB) has an on-demand cost of 0.45 per hour
Provide any links for context:
The text was updated successfully, but these errors were encountered: