-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860
Comments
Hi @MarkTopping thanks for reporting! Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂 Cheers! |
Hi @MarkTopping thanks for opening the issue. 3.7.1 release uses Nginx 1.27.2 https://forum.nginx.org/read.php?27,300237,300237, which now caches SSL certificates, secret keys, and CRLs on start or during reconfiguration.
thanks |
Hi @vepatel Thank you for your response. I'm haven't tested 3.7.0 - and sadly it's not just a matter of having a go for you - todays outage caused quite a bit of disruption so it isn't something I can replicate as and when I see fit. Unless you have particular reason to believe that 3.7.0 would address an issue that was introduced specifically in 3.7.1? Re limits - yes, indeed we are. The graphs kind of hide it, but requests and limits are both set; and for memory they are both equal to one another. We have set that limit though to be 4x higher than what we typically see each Nginx Pod consuming - hence a lot of headroom. A question for you please... thanks for the link... but it doesn't state the implications of the changes. Would I be right in assuming that an increase in memory consumption is expected due to the caching behaviour introduced? Certs aren't exactly big - so I'd assume that would only result in a fairly small memory increase anyway? |
Thank you @MarkTopping for providing details. We are investigating the memory spikes. |
@MarkTopping could you please provide more detailed information about the |
Hi, Sure. There are many deployments which are configured with a HPA but in reality it's a pretty stable cluster - most deployments do not scale - we only bounce between 700-710 Pods on a typical day. Scaling activity is generally focused to a 4 hour duration where several scheduled processes run which cause a spike in resource utilization and network traffic. The spike is large when compared with the baseline activity of the cluster, but the volumes themselves are still small and predictable. At its busiest there are approximately 12000 inbound HTTPS requests per minute coming through the Ingress Controller. This is handled by a static deployment of 6 Nginx Pods. It handles those sort of volumes collectively for around 30 minutes during any given day. The baseline is much lower - around 250 requests per minute. Around 95% requests will result in small response payloads completing in < 1s. The largest requests complete in <= 5m (with timeouts increased as to allow this). There are also several applications which will hold long-lived connections between the front and backend (with Nginx in the middle) as a channel for push notifications. I think these are pretty modest numbers of requests and scale. I'd say most pod-churn in this environment actually comes from regular application deployments rather than scaling activity. Let me know if you were after some particular information which I've not provided. |
@MarkTopping is your deployment a Nginx OSS or Plus Ingress Controller? |
Nginx OSS |
Thank you @MarkTopping for providing valuable insights. Yes, you are the first who reported the memory spikes. The memory utilisation is likely due to introduced caching in NGINX v1.27.2
We are investigating this behaviour. |
Hi @MarkTopping in regards to 4604 we have a fix in progress which should be available in an upcoming release. Regarding the OOM issues, we have not yet been able to reproduce the problem and are continuing to investigate. |
Thank you wrt #4604! Interesting that you've been unable to replicate the memory issue. That raises the question as to if it's something about our particular use of Nginx which v3.7.1 did not take well to - possibly the use of mergeable ingress types as something that I imagine fewer people use of or maybe related to the couple of annotations we use here and there. I could share more details of our usage with you outside of this Issue with you if that would be of value? Aside, as a quick update from my end I've recently deployed v4.0.0 into a couple of clusters. It's a bit too early to make any conclusions but initial indications are that the memory usage/spikes have reduced somewhat despite there being no intentional (documented) change to address what I'm reporting. I'll report back next week though with an updated graph. |
Version
3.7.0
What Kubernetes platforms are you running on?
AKS Azure
Steps to reproduce
I believe that changes in version 3.7.0 or 3.7.1 have introduced a memory consumption issue.
We had to rollback a version bump from v3.6.2 to v3.7.1 today after our Nginx IC Pods all crashed due to OOM Kills. To make matters worse, due to Bug 4604 the Pods then failed to restart (without manual intervention) leading to obvious impact.
Our subsequent investigation after our outage revealed that the memory consumption on the Nginx Pods changed quite dramatically after the release as shown by the following 2 charts.
1st Example
In our least used environment we didn't incur any OOM Kills, but todays investigation revealed how memory usage has both increased, and also become more 'spikey' since we performed the upgrade:
2nd Example
This screenshot shows the IC Pods memory consumption after a release of v3.7.1 into a more busy environment and a subsequent rollback this morning.
What this graph doesn't capture is that the memory went above the 1500MiB line for all Pods in the deployment and thus were OOM Killed. This isn't shown because the metrics are exported every minute and so we just have the last datapoint that happened to be collected before the OOM Kill.
I guess it's worth noting that we also bumped our Helm Chart (not just the image version) with our release. The only notable change with that chart was the explicit creation of the Leader Election resource which I think Nginx used to just create by itself after deployment.
Some environment notes:
The text was updated successfully, but these errors were encountered: