Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregator by Default #2886

Closed
wants to merge 53 commits into from
Closed

Aggregator by Default #2886

wants to merge 53 commits into from

Conversation

michaelmdresser
Copy link
Contributor

@michaelmdresser michaelmdresser commented Dec 19, 2023

What does this PR change?

Does this PR relate to any other PRs?

How does this PR impact users? (This is the kind of thing that goes in release notes!)

  • Kubecost's new query backend "Aggregator" now serves API requests in all configurations. You will notice a new container running in the cost-analyzer Pod by default. If you are an Enterprise Thanos user, follow this guide TODO for necessary configuration changes. If you are an Enterprise Federated ETL user, follow this guide TODO for necessary configuration changes. If you are a Free user, no configuration change is required. This does not introduce a new container image.
  • Cloud Cost data gathering now runs in a separate container (within cost-analyzer) by default. If using an Enterprise Aggregator configuration, Cloud Cost will run in a one-replica Deployment. No configuration or value change is required. This does not introduce a new container image.
  • Removes the "HA" (and associated StatefulSet) mode for cost-analyzer. To deploy a highly-available backend, contact Kubecost Support about using the Enterprise Aggregator configuration with replicas > 1. If the cost-model's low-footprint metric building has an HA requirement in your environment, contact Kubecost Enterprise Support so we can learn more.
  • The federatedETL.primaryCluster value has been removed. In an Enterprise Federated ETL configuration all cost-model containers should behave as if they were secondaries. Aggregator now serves combined queries. If you are an Enterprise Federated ETL user, follow this guide TODO for necessary configuration changes.
  • The Federator component (enabled via federatedETL.federator.enabled) has been removed. In an Enterprise Federated ETL configuration all queries and data-combination is handled by Aggregator. If you are an Enterprise Federated ETL user, follow this guide TODO for necessary configuration changes.
  • The kubecostModel.warmSavingsCache value has been removed. If you need to disable the savings cache in Aggregator, set the env var SAVINGS_ENABLED to false.
  • Query Service (Replicas) has been removed. Aggregator now takes its place, serving queries faster with a lower resource footprint. If you are an Enterprise Federated ETL user, follow this guide TODO for necessary configuration changes.
  • Unsupported Helm values for an old metric storage method have been removed. No impact is expected.

Links to Issues or tickets this PR addresses or fixes

What risks are associated with merging this PR? What is required to fully test this PR?

How was this PR tested?

Have you made an update to documentation? If so, please provide the corresponding PR.

WIP

Copy link

gitguardian bot commented Dec 19, 2023

⚠️ GitGuardian has uncovered 4 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id Secret Commit Filename
7414 Google API Key fb28d67 cost-analyzer/templates/_helpers.tpl View secret
7414 Google API Key 3c2979e cost-analyzer/templates/query-service-deployment-template.yaml View secret
7414 Google API Key c9e8622 cost-analyzer/templates/_helpers.tpl View secret
7414 Google API Key c9e8622 cost-analyzer/templates/aggregator-statefulset.yaml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@chipzoller
Copy link
Collaborator

Please also remember to check/update the files required for CI processes:

@michaelmdresser michaelmdresser changed the title Aggregator by Default HOLD UNTIL AFTER HOLIDAY (for nightly stability) | Aggregator by Default Dec 19, 2023
@michaelmdresser
Copy link
Contributor Author

I'm still working out some things on the backend code which is causing chart tests to fail. While I'm working through testing, @chipzoller are you willing to give this PR a review while it is in a draft state? I'd like some feedback from your Helm expertise on things in here, particularly my use of the _helpers.tpl file to establish some guards and to make a common ContainerSpec which is being shared by multiple deployment methods. Are those bad practices? Is there a better way?

Copy link
Collaborator

@chipzoller chipzoller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helper template changes seem extremely over-the-top. What problem are you looking to solve here that you think isn't appropriate for the templates?

@michaelmdresser
Copy link
Contributor Author

What problem are you looking to solve here that you think isn't appropriate for the templates?

There are two containers, Aggregator and Cloud Cost, which must be deployable either within the cost-analyzer Pod or under their own controllers (a StatefulSet and Deployment, respectively). The design is diagrammed out in this doc if you'd like to have a look.

I'm using the templates aggregator.containerTemplate and aggregator.cloudCost.containerTemplate defined in _helpers.tpl to centralize the configuration of these containers to simplify development and hopefully stave off drift. Is there a better way to do this?

@nik-kc
Copy link
Contributor

nik-kc commented Dec 20, 2023

Looking good so far! Fills me with joy to see such a high deletion to addition ratio in a PR.

While you're in here performing this overhaul, what are your thoughts on more thorough automated testing of the helm chart? I know we perform linting, but I feel it would be useful to have the ability to run automated validations on the chart with a wide array of values, in order to verify template correctness for a variety of install configurations.

@chipzoller
Copy link
Collaborator

@nik-kc, I have already set this up and it is being done on all PRs, including deployment and basic e2e smoke tests on real EKS and OpenShift clusters. See here.

@michaelmdresser, ok I understand the goals. I will give this a good look soon. I think the idea of creating a standard container template and stamping it out in the necessary Pod controller form factor is a good practice to reduce/eliminate drift. The challenge is ensuring that stays universal enough so controller-specific configs don't creep in.

Copy link
Contributor

@ameijer ameijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so much tech debt paid, bravo. that dead postgres config search and destroy was 💯 . I don't think the reliance on helpers.tpl is too onerous, especially considering it buys us something approaching 1k lines of avoided code duplication?

cost-analyzer/values.yaml Show resolved Hide resolved
cost-analyzer/values.yaml Outdated Show resolved Hide resolved
cost-analyzer/values.yaml Show resolved Hide resolved
cost-analyzer/templates/_helpers.tpl Show resolved Hide resolved
@michaelmdresser michaelmdresser force-pushed the mmd/default-aggregator branch 2 times, most recently from c5b447b to 5cc809e Compare December 21, 2023 18:37
Copy link
Contributor

@ameijer ameijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this LGTM at this point. Will leave it to @chipzoller to approve after his review since he said he wanted to put eyes on

Copy link
Collaborator

@chipzoller chipzoller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael, I've given this a good look through (yet have not verified the end results myself...will wait for final review), and you've done some really nice work here. I know how much time this took. Thank you.

I have added some relatively minor comments, requests, and questions throughout, but here are some of the more general thoughts/requests:

  1. Also, assuming you have not done so, we'd appreciate it if you go through all the values files present in this repo and ensure they're in good shape based on these changes. I know you did that for the CI files (thanks!) but please check all others if not done.

  2. Looks like you're doing it, but because I haven't done a thorough survey, please just double-check and make sure you've removed all defined templates corresponding to excised sections of the chart (ex., query service and postgres).

  3. We need to include annotations and labels defined under global.podAnnotations and global.additionalLabels. We're terribly inconsistent about that, but "global" means "everywhere" and not selectively.

  4. Any new fields added into the values file we (and, specifically, users) would appreciate comments for every one explaining what they do. We will eventually use this to build docs from this values file (a later initiative).

  5. etlBucketConfigSecret is not defined in values.yaml yet is still referenced by templates in this PR implying it's still necessary. Kindly document.

cost-analyzer/values.yaml Outdated Show resolved Hide resolved
cost-analyzer/values.yaml Outdated Show resolved Hide resolved
cost-analyzer/templates/_helpers.tpl Show resolved Hide resolved
cost-analyzer/templates/_helpers.tpl Outdated Show resolved Hide resolved
cost-analyzer/templates/_helpers.tpl Outdated Show resolved Hide resolved
cost-analyzer/templates/_helpers.tpl Outdated Show resolved Hide resolved
cost-analyzer/templates/_helpers.tpl Outdated Show resolved Hide resolved
cost-analyzer/templates/_helpers.tpl Show resolved Hide resolved
@michaelmdresser michaelmdresser changed the title HOLD UNTIL AFTER HOLIDAY (for nightly stability) | Aggregator by Default Aggregator by Default Jan 2, 2024
@michaelmdresser michaelmdresser marked this pull request as ready for review January 2, 2024 17:29
QSR was an optimization attempt made before Aggregator for high-scale
environments. Aggregator's existence obsoletes QSR, so for this next
release of Kubecost which changes the default install config we should
remove QSR completely.

Signed-off-by: Michael Dresser <[email protected]>
With the advent of Aggregator as the only query handler, there is no
more need for the Federator component to combine federated bingen files.

Signed-off-by: Michael Dresser <[email protected]>
Postgres as a code path has been obsolete for years. It is not
supported.

Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Kubecost's Aggregator component is now the officially-supported method
for highly-available query serving. It has the ability to be split into
its own StatefulSet for managing multiple replica situations.

The cost-analyzer Pod is now dedicated almost wholly to building metric
data and thus no longer needs the highly-complex leader/follower
configuration.

Discussed with Bolt, Ajay, Alex

Signed-off-by: Michael Dresser <[email protected]>
Because of file-based data sharing in the default free config, running
Kubecost fully in-memory cannot be supported.

Signed-off-by: Michael Dresser <[email protected]>
Moved Aggregator container config to helpers so it can be shared by the
cost-analyzer Pod and the dedicated StatefulSet.

Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
Signed-off-by: Michael Dresser <[email protected]>
@michaelmdresser
Copy link
Contributor Author

@chipzoller Thank you for the thorough review, it's exactly what I was looking for. I've addressed as much as I could and left questions on the rest of the feedback. Please let me know if I've misinterpreted any of your ideas.

@chipzoller
Copy link
Collaborator

Chart build is failing across all instances.

@michaelmdresser
Copy link
Contributor Author

@chipzoller The failing tests are failing because https://github.com/kubecost/kubecost-cost-model/pull/2002 is required for this Helm change to work. Without specifically setting values, the Helm chart uses a v1.108 backend image which does not function in this configuration of the Helm chart, causing all tests to fail.

I believe it is still reviewable.

@chipzoller
Copy link
Collaborator

Michael, I think the best place to accept this PR is in the 2.0-helm-changes branch we have been working on since this is a 2.0 change. There have already been significant changes to the chart reflected in that branch. Can you rebase on top of it and merge there instead?

@michaelmdresser
Copy link
Contributor Author

Closing in favor of #2898 which targets 2.0-helm-changes. Cleaning that up before review.

@michaelmdresser michaelmdresser deleted the mmd/default-aggregator branch January 3, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants