Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

Open
MarkTopping opened this issue Nov 25, 2024 · 13 comments
Labels
bug An issue reporting a potential bug in review Gathering information

Comments

@MarkTopping
Copy link

Version

3.7.0

What Kubernetes platforms are you running on?

AKS Azure

Steps to reproduce

I believe that changes in version 3.7.0 or 3.7.1 have introduced a memory consumption issue.

We had to rollback a version bump from v3.6.2 to v3.7.1 today after our Nginx IC Pods all crashed due to OOM Kills. To make matters worse, due to Bug 4604 the Pods then failed to restart (without manual intervention) leading to obvious impact.

Our subsequent investigation after our outage revealed that the memory consumption on the Nginx Pods changed quite dramatically after the release as shown by the following 2 charts.

1st Example
In our least used environment we didn't incur any OOM Kills, but todays investigation revealed how memory usage has both increased, and also become more 'spikey' since we performed the upgrade:

Image

2nd Example
This screenshot shows the IC Pods memory consumption after a release of v3.7.1 into a more busy environment and a subsequent rollback this morning.

Image

What this graph doesn't capture is that the memory went above the 1500MiB line for all Pods in the deployment and thus were OOM Killed. This isn't shown because the metrics are exported every minute and so we just have the last datapoint that happened to be collected before the OOM Kill.

I guess it's worth noting that we also bumped our Helm Chart (not just the image version) with our release. The only notable change with that chart was the explicit creation of the Leader Election resource which I think Nginx used to just create by itself after deployment.

Some environment notes:

  • Azure AKS - 1.30.5
  • Using feature: Mergable Ingress Types
  • Ingress resource count: 516
  • IC Pod Count: 6
  • Memory Request & Limit: 1500MiB per pod
  • ReadOnlyRootFileSystem: true
@MarkTopping MarkTopping added bug An issue reporting a potential bug needs triage An issue that needs to be triaged labels Nov 25, 2024
Copy link

Hi @MarkTopping thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂

Cheers!

@MarkTopping MarkTopping changed the title [Bug]: [Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x Nov 25, 2024
@vepatel
Copy link
Contributor

vepatel commented Nov 25, 2024

Hi @MarkTopping thanks for opening the issue. 3.7.1 release uses Nginx 1.27.2 https://forum.nginx.org/read.php?27,300237,300237, which now caches SSL certificates, secret keys, and CRLs on start or during reconfiguration.
Can you please confirm

  • you see same issue in 3.7.0
  • you're using limits in deployments

thanks

@MarkTopping
Copy link
Author

Hi @MarkTopping thanks for opening the issue. 3.7.1 release uses Nginx 1.27.2 https://forum.nginx.org/read.php?27,300237,300237, which now caches SSL certificates, secret keys, and CRLs on start or during reconfiguration. Can you please confirm

  • you see same issue in 3.7.0
  • you're using limits in deployments

thanks

Hi @vepatel

Thank you for your response.

I'm haven't tested 3.7.0 - and sadly it's not just a matter of having a go for you - todays outage caused quite a bit of disruption so it isn't something I can replicate as and when I see fit. Unless you have particular reason to believe that 3.7.0 would address an issue that was introduced specifically in 3.7.1?

Re limits - yes, indeed we are. The graphs kind of hide it, but requests and limits are both set; and for memory they are both equal to one another. We have set that limit though to be 4x higher than what we typically see each Nginx Pod consuming - hence a lot of headroom.

A question for you please... thanks for the link... but it doesn't state the implications of the changes. Would I be right in assuming that an increase in memory consumption is expected due to the caching behaviour introduced? Certs aren't exactly big - so I'd assume that would only result in a fairly small memory increase anyway?

@MarkTopping
Copy link
Author

I'm just following up with another chart and depiction of how the memory usage has adversely changed in v3.7

I've redeployed version 3.7.1 with a Request and Limit of 3000MiB. This was in the hope of seeing just how high the memory would spike but without incurring the OOM Kills that happened earlier in the week.

Here is the result and a view over the past 5 days:
Image

The blue line doesn't worry me. It shows an approximate 40% increase in memory which consumers might need to account for but it's pretty stable at least.

The orange line however shows just how spiky the memory usage has become between versions 3.6 and 3.7.

In my case those spikes have become ~3x larger and they surpassed the memory limits (which were quite generous IMO). I guess it's worth understanding whether this is/was truly known and is by design from the contributors? That at least would confirm whether or not this should be considered a bug.

I for one certainly find the memory profile of 3.6 far more pleasing and easier to right-size my environment for.

@jjngx
Copy link
Contributor

jjngx commented Nov 27, 2024

Thank you @MarkTopping for providing details. We are investigating the memory spikes.

@jjngx
Copy link
Contributor

jjngx commented Nov 29, 2024

memory consumption after a release of v3.7.1 into a more busy environment

@MarkTopping could you please provide more detailed information about the busy environment please? What traffic pattern you observe in the affected cluster? Does the changes in traffic trigger autoscaling?

@MarkTopping
Copy link
Author

memory consumption after a release of v3.7.1 into a more busy environment

@MarkTopping could you please provide more detailed information about the busy environment please? What traffic pattern you observe in the affected cluster? Does the changes in traffic trigger autoscaling?

Hi,

Sure.

There are many deployments which are configured with a HPA but in reality it's a pretty stable cluster - most deployments do not scale - we only bounce between 700-710 Pods on a typical day.

Scaling activity is generally focused to a 4 hour duration where several scheduled processes run which cause a spike in resource utilization and network traffic. The spike is large when compared with the baseline activity of the cluster, but the volumes themselves are still small and predictable.

At its busiest there are approximately 12000 inbound HTTPS requests per minute coming through the Ingress Controller. This is handled by a static deployment of 6 Nginx Pods. It handles those sort of volumes collectively for around 30 minutes during any given day. The baseline is much lower - around 250 requests per minute.

Around 95% requests will result in small response payloads completing in < 1s. The largest requests complete in <= 5m (with timeouts increased as to allow this). There are also several applications which will hold long-lived connections between the front and backend (with Nginx in the middle) as a channel for push notifications.

I think these are pretty modest numbers of requests and scale. I'd say most pod-churn in this environment actually comes from regular application deployments rather than scaling activity.

Let me know if you were after some particular information which I've not provided.

@vepatel
Copy link
Contributor

vepatel commented Dec 2, 2024

@MarkTopping is your deployment a Nginx OSS or Plus Ingress Controller?

@MarkTopping
Copy link
Author

@MarkTopping is your deployment a Nginx OSS or Plus Ingress Controller?

Nginx OSS

@MarkTopping
Copy link
Author

Hello

It might be of little use by this point, but I figured that now that we've had 3.7.1 running within two clusters for a longer duration of time I'd provide a couple of updated graphs of the before vs after. They both depict a 31 day duration with a 1h aggregation interval. I've not indicated where the upgrade from 3.6.2 took place - I think that's fairly apparent though.

First cluster (which handles fewer requests):
Image

Second cluster (where the OOMKills occurred):
Image

On this 2nd graph you can see that I further increased the memory limit for the Pods about 1 week ago - this was in direct response to further OOMKills which occurred when the limit was 3000MiB. It's been stable since in terms of no crashes or restarts.

I also enabled container log scraping on Nginx pods recently to help measure the volume of requests that the 'busier' (but still not busy) environment was handling - my previous approximation turned out to be pretty spot on... the most requests running through the IC during any given minute hasn't surpassed 12,000. Here is a chart of the past 5 days to give you a feel:

Image

The vertical axis shows Requests per Minute.

That traffic is then handled by / spread across 6 Nginx instances - so that's broadly speaking around 2k requests per minute per instance - peanuts in terms of that Nginx is known to be capable of,

I'm curious and surprised that nobody else has raised similar issues here on GitHub... I wonder if others have reported significant memory use changes via any other forums?

@jjngx
Copy link
Contributor

jjngx commented Dec 11, 2024

Thank you @MarkTopping for providing valuable insights. Yes, you are the first who reported the memory spikes. The memory utilisation is likely due to introduced caching in NGINX v1.27.2

...
Changes with nginx 1.27.2                                        02 Oct 2024

    *) Feature: SSL certificates, secret keys, and CRLs are now cached on
       start or during reconfiguration.
...

We are investigating this behaviour.

@shaun-nx shaun-nx added the in review Gathering information label Jan 13, 2025
@pdabelf5
Copy link
Collaborator

Hi @MarkTopping in regards to 4604 we have a fix in progress which should be available in an upcoming release.

Regarding the OOM issues, we have not yet been able to reproduce the problem and are continuing to investigate.

@pdabelf5 pdabelf5 removed the needs triage An issue that needs to be triaged label Jan 13, 2025
@MarkTopping
Copy link
Author

MarkTopping commented Jan 14, 2025

Hi @MarkTopping in regards to 4604 we have a fix in progress which should be available in an upcoming release.

Regarding the OOM issues, we have not yet been able to reproduce the problem and are continuing to investigate.

Thank you wrt #4604!

Interesting that you've been unable to replicate the memory issue. That raises the question as to if it's something about our particular use of Nginx which v3.7.1 did not take well to - possibly the use of mergeable ingress types as something that I imagine fewer people use of or maybe related to the couple of annotations we use here and there. I could share more details of our usage with you outside of this Issue with you if that would be of value?

Aside, as a quick update from my end I've recently deployed v4.0.0 into a couple of clusters. It's a bit too early to make any conclusions but initial indications are that the memory usage/spikes have reduced somewhat despite there being no intentional (documented) change to address what I'm reporting. I'll report back next week though with an updated graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue reporting a potential bug in review Gathering information
Projects
Status: Todo ☑
Development

No branches or pull requests

5 participants