Fix whatever is causing the pg_wal folder to grow so large in prod cluster! #338

Venryx · 2024-07-25T09:30:49Z

For now, when wal folder grows too large, use following fix route:

1) Increase size of the out-of-space PVC(s), using Lens. (instead of default 10000Mi, set to 10100Mi or something)
2) Kill the pods that are using those PVC(s), so they can finish resizing (kill the "...instance1..." and "repo-host-0" pods, OR do "Restart" on the two corresponding Stateful Sets).
- Note: The resizing may fail to take place the first attempt, for one or both PVCs; repeat the process until it works. (can confirm by checking df in the two pods mentioned above, or more easily, by running describe on the PVCs, like seen here, except done for the purpose of seeing the events rather than the pods using the PVC; can also just observe the events in the Lens UI)
- Note: In some cases, this step appears unnecessary. (it wasn't necessary one time when I increased the size of just the repo pvc)
3) Open shell in the "instance1" and/or "repo-host-0" pods; check if "df" shows 100% still, on any of the filesystems. If they all are lower than 100% now, then that's good.
4) IF the database was not corrupted by the space running out (the first time this issue happened the db got corrupted to some extent, require a full scp -> ... -> pgdump import process -- but the second time it didn't), then the main database PVC should reduce in size a lot as the WAL segments get cleared out. (doesn't seem to happen for the repo PVC as well unfortunately; EDIT: see second comment in thread for apparent explanation)
5) Restart the app-server, to confirm that it works again. (it should, if step 4 succeeded)
6) You should now try to do a pgdump of the contents. See: https://github.com/debate-map/app#pg-dump
- Note: Atm the option 1 pgdump approach (nodejs script) is failing for prod cluster, since the database has too much data for the HTTP request to be solidified prior to nginx timing out the request. Need to fix this. For now, use option 4.

Other misc. info from DM

Btw: The issue of the PVC getting to 100% space usage happened again. Thankfully this time it did not corrupt the database, so I was able to fix it by simply increasing the PVC's size, restarting the database pod, then restarting the app-server. After that, the 100% usage (from the pg_wal folder like before) went down to ~20%, presumably since the cause of the WAL sticking around was disconnected, letting the WAL segments get cleaned up.

However, this is of course a terrible thing to keep happening.

Some remeditation plans:

Detection: Make space usage more observable. I want to get emails set up at some point, but for now I added this little display to my custom taskbar panel: (it updates by sending a graphql query to the monitor backend once per minute) [image]
Root cause: Discover whatever is causing the database to keep its WAL segments from being cleaned up, and resolve it.
Possibly it is my logical-replication slot, maybe after an app-server crash or something.
But possibly it's some side-effect of the pgbackrest backups getting broken. (I discovered that after we restored from backup on June 25th, the next day the pgbackrest backups started working like normal. They kept working until July 20th. Maybe that's the point where postgres knew the backups were failing and so started keeping all WAL segments until the pgbackrest backups could complete, similar to here: https://www.crunchydata.com/blog/postgres-is-out-of-disk-and-how-to-recover-the-dos-and-donts#broken-archives)

Other notes:

The PVC size-increasing worked for the main database pod+pvc, but more complicated for the "repo-host" / in-cluster database copy, as seen in screenshot above. (not exactly sure what that repo1 is, but anyway it's large folder is /pgbackrest/archive rather than the /pgdata/pg_wal on the main postgres database pod)
- More specifically, the size increase worked, but the WAL data did not clear out in that repo-host PVC like it did for the main database PVC.

The text was updated successfully, but these errors were encountered:

Venryx · 2024-08-05T20:28:08Z

Update: After I fixed the 2nd crash (mentioned in original post above), by increasing the PVC size by ~500mb, the next day (July 26th) the pgbackrest backups appear to have started up again.

My current understanding of the problem is the following: (edited as I've learned more)

1) The "repo" PVC gets 100% filled. (I think the "repo1" in yaml)
- Reason: The in-cluster "repo1" had no backup schedule or retention policy set, so the defaults were used.
  - The default backup schedule is just to make one initial full-backup at time of pgo-init/pvc-reset. (along with WAL segments that just get added as changes are made)
  - The default retention policy is just to keep all WAL fragments since the last non-expired full-backup. Combined with the default backup schedule above, this means keeping all WAL segments for all db changes since pgo-init/pvc-reset.
2) The main db PVC sees that its backup to the "repo" pod/PVC is failing, so starts buffering up all WAL segments. (so that no history is lost for repo-pvc backup) [this could also/instead be a side-effect of 3 below, ie. remote pgbackrest backup failing]
3) Due to either 1 or 2, the pgbackrest backups to the remote pgbackrest repo (I think the "repo2" in yaml) start failing. (small chance that it is "making a backup", but sees no changes in repo1 to backup to repo2; most likely just failing though)

So the biggest red flag to resolve atm seems to be that the "repo" PVC (repo1) is taking up way more storage space than the main db PVC itself (even after allowing plenty of time for it to "do its thing" and clear out unneeded data).

EDIT: with newer understanding, this makes sense; main db can clear out WAL since it only needs to keep it long enough to send to "repo" pod/pvc, whereas the "repo" pod/pvc keeps those WAL fragments forever due to lack of sane settings for backup-schedule + retention-policy.

Venryx · 2024-09-10T06:19:11Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix whatever is causing the pg_wal folder to grow so large in prod cluster! #338

Fix whatever is causing the pg_wal folder to grow so large in prod cluster! #338

Venryx commented Jul 25, 2024 •

edited

Loading

Venryx commented Aug 5, 2024 •

edited

Loading

Venryx commented Sep 10, 2024 •

edited

Loading

Venryx commented Sep 10, 2024

Fix whatever is causing the pg_wal folder to grow so large in prod cluster! #338

Fix whatever is causing the pg_wal folder to grow so large in prod cluster! #338

Comments

Venryx commented Jul 25, 2024 • edited Loading

Other misc. info from DM

Venryx commented Aug 5, 2024 • edited Loading

Venryx commented Sep 10, 2024 • edited Loading

Venryx commented Sep 10, 2024

Venryx commented Jul 25, 2024 •

edited

Loading

Venryx commented Aug 5, 2024 •

edited

Loading

Venryx commented Sep 10, 2024 •

edited

Loading