-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix whatever is causing the pg_wal folder to grow so large in prod cluster! #338
Comments
Update: After I fixed the 2nd crash (mentioned in original post above), by increasing the PVC size by ~500mb, the next day (July 26th) the pgbackrest backups appear to have started up again. My current understanding of the problem is the following: (edited as I've learned more)
So the biggest red flag to resolve atm seems to be that the "repo" PVC (repo1) is taking up way more storage space than the main db PVC itself (even after allowing plenty of time for it to "do its thing" and clear out unneeded data).
|
More relevant links:
|
Update: I tried adding a valid backup-schedule and retention-policy for the in-cluster "repo1", and this fixed its PVC ballooning in size! (Within a minute of the new full-backup completing [which I triggered manually using Lens], the PVC dropped from ~11gb to 625.5mb! This roughly matches what I would expect, since a [less compact] pgdump is ~800mb atm.) Now I'm curious why the main db pvc is 2.4gb while the repo1 pvc is so much lower... (this is within expectations though since it may need extra space for storing indexes, have other WAL keep-alive requirements, etc.) I think this means this thread's issue is now fixed. But I will keep it open for several more months first, to see if it happens again. |
…orage usage of the db itself's pvc), by setting a valid backup-schedule and retention-policy for it. See: #338 (comment) This probably fixes issue 338, but I'll wait to mark it as fixed until some more time has passed without incident.
For now, when wal folder grows too large, use following fix route:
10000Mi
, set to10100Mi
or something)df
in the two pods mentioned above, or more easily, by runningdescribe
on the PVCs, like seen here, except done for the purpose of seeing the events rather than the pods using the PVC; can also just observe the events in the Lens UI)Also see: #331 (comment)
Other misc. info from DM
Btw: The issue of the PVC getting to 100% space usage happened again. Thankfully this time it did not corrupt the database, so I was able to fix it by simply increasing the PVC's size, restarting the database pod, then restarting the app-server. After that, the 100% usage (from the pg_wal folder like before) went down to ~20%, presumably since the cause of the WAL sticking around was disconnected, letting the WAL segments get cleaned up.
However, this is of course a terrible thing to keep happening.
Some remeditation plans:
Possibly it is my logical-replication slot, maybe after an app-server crash or something.
But possibly it's some side-effect of the pgbackrest backups getting broken. (I discovered that after we restored from backup on June 25th, the next day the pgbackrest backups started working like normal. They kept working until July 20th. Maybe that's the point where postgres knew the backups were failing and so started keeping all WAL segments until the pgbackrest backups could complete, similar to here: https://www.crunchydata.com/blog/postgres-is-out-of-disk-and-how-to-recover-the-dos-and-donts#broken-archives)
Other notes:
/pgbackrest/archive
rather than the/pgdata/pg_wal
on the main postgres database pod)The text was updated successfully, but these errors were encountered: