p2k16 - sometimes the disk fills up, and p2k16 stops working #138

tingox · 2020-04-26T09:29:02Z

Sometimes (not very often) the disk drive of the p2k16 serve fills up. this is bad, because the then p2k16 web app stops working.

the server p2k16 runs two services related to the PostgreSQL database:
[email protected]
[email protected]
and also this
[email protected]
we also have monitoring (via riemann), but nodbody watches that on a regular basis (not sure if someone gets alarm notificatons).

tingox · 2020-04-26T09:33:44Z

The directory /var/lib/postgresql/backups/ was filling up with db backups, causing the disk to fill. I cleaned out a few files, the disk is now better:

root@p2k16:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.9G  9.8G  48% /

and then I restartet postgres via systemctl restart [email protected].

tingox · 2020-04-26T09:35:41Z

There is a backup service for postegres, I haven't restarted it

root@p2k16:~# systemctl status [email protected]
● [email protected] - PostgreSQL base backup
   Loaded: loaded (/etc/systemd/system/[email protected]; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2020-04-26 04:00:51 CEST; 7h ago
  Process: 19137 ExecStart=/usr/bin/env bash -c i="10-main"; i=${i/-//}; bin/envdir /etc/wal-e/10-main-env.
 Main PID: 19137 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

should this serrvice be running, or should we stop it?

tingox · 2020-04-26T10:00:05Z

p2k16-staging also had the same problem, so I did the same there: clean out most files from db backups, then resstart postgresql.

rkarlsba · 2020-04-26T12:29:06Z

Perhaps it would be nice to have some monitoring on that, with email alerts to those who run the system - something like zabbix? I have a zabbix VM running…

tingox · 2020-05-13T18:52:54Z

monitoring is in place, we miss someplace good to send the alerts. Our "IT operations group" is on a volunteer basis...

rkarlsba · 2020-05-14T11:06:51Z

Where can I see this monitoring status?

tingox · 2020-05-14T11:21:24Z

monitoring is at riemann.bitraf.no

tingox · 2020-06-07T13:13:52Z

It happened again; the disk of p2k16 filled up with postgres database backups, and the postgres service failed, causing p2k16 to fail. The disk full error was dutifully recorded by riemann.bitraf.no, but nobody looked at it.
Fix: usual fix - clean out database backups from /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart [email protected].

tingox · 2020-06-07T13:15:01Z

perhaps we should add a separate (virtual) disk drive for database backups to the server p2k16. Or better: make sure that the backups go to another server instead. Hmm.

tingox · 2020-07-23T18:17:58Z

cleaned the backup directory on p2k16-staging, better now

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.9G  9.8G  48% /

then I restarted postgres with
sudo systemctl restart [email protected]
that's all

tingox · 2020-08-16T12:55:22Z

cleaned backup directory on p2k16-staging again

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   11G  7.9G  58% /

and restarted postgres

tingox · 2020-09-09T12:49:20Z

Another cleaning of the postgres backup directory on p2k16-staging today:

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  9.9G  8.7G  54% /

plus a restart of postgres.

tingox · 2020-12-13T10:34:13Z

p2k16 had full disk again. As usual, I cleaned out /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart [email protected]. Better now:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.3G   11G  45% /

omega · 2020-12-13T10:40:38Z

We could probably extend the backup service to only retain a set number of base backups, or have a different timer retain only N basebackups.

infrastructure/shared-roles/postgresql-wal-e/tasks/main.yml

Line 85 in 3dabb24

- name: systemd base backup service

Seems to be the code for installing the wal-e backup service, doing something similar for delete might be good enough

ExecStart=/usr/bin/env bash -c 'i="%i"; i=${i/-//}; bin/envdir /etc/wal-e/%i-env.d bin/wal-e delete --retain 5 --confirm'

This will add service alongside the base-backup service and timer that will use wal-e to delete the oldest base-backups after making new base-backups each sunday night. The number 5 is picked a bit at random, it doesn't seem we run out of disk that often. There might be a better way to trigger this than a timer, but I am not that experienced with systemd services. Attempts to fix #138

flexd · 2020-12-15T11:00:49Z

How about we just send notifications to Slack? https://riemann.io/api/riemann.slack.html
Easy enough, and lots of people around to see it.

rkarlsba · 2020-12-15T11:05:58Z

It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.

flexd · 2020-12-16T15:54:43Z

It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.

Better for everyone to be notified so that someone will take action , or mention it to someone that can, than it just fail silently because nobody manually checked monitoring.

tingox · 2021-02-07T11:37:52Z

p2k16 - full disk again today. Cleaned out /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart [email protected] as usual. Good to go for a while again:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.6G   11G  46% /

tingox · 2021-05-18T16:32:13Z

Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again.

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   12G  6.7G  65% /

that keeps a few weeks, I think.

haavares · 2021-05-18T17:13:45Z

Med tanke på hvor kritisk p2k16 er for hele virksomheten, er det egentlig akseptabelt med kun 6.7G ledig og at den går full med noen ukers mellomrom? Jeg vil si nei og at vi bør investere i nødvendig utstyr for at dette ikke skal skje. - H

…

On 18 May 2021, at 18:32, Torfinn Ingolfsen ***@***.***> wrote: Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again. ***@***.***:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 20G 12G 6.7G 65% / that keeps a few weeks, I think. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#138 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMZGOHN4VUGO7L7OS35RFDTOKJCHANCNFSM4MREUIQA>.

haavares · 2021-05-18T17:28:36Z

Helt enig, men jeg føler meg ikke kompetent til å specce opp og sette opp. Men jeg godkjenner glatt innkjøp som minsker nedetidsfare. Thomas tir. 18. mai 2021, 19:13 skrev Håvard Espeland ***@***.***>:

…

Med tanke på hvor kritisk p2k16 er for hele virksomheten, er det egentlig akseptabelt med kun 6.7G ledig og at den går full med noen ukers mellomrom? Jeg vil si nei og at vi bør investere i nødvendig utstyr for at dette ikke skal skje. - H On 18 May 2021, at 18:32, Torfinn Ingolfsen ***@***.***> wrote: Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on *p2k16* before it gets full again. ***@***.***:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 20G 12G 6.7G 65% / that keeps a few weeks, I think. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#138 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMZGOHN4VUGO7L7OS35RFDTOKJCHANCNFSM4MREUIQA> .

jenschr · 2021-05-19T09:14:14Z

Men hva er det egentlig vi logger så aggresivt? Dette kan jo ikke dreie seg om vanlig bruk av systemet (innlogging/utsjekking). Det må være noe mer som logges for å komme opp i mange gigabyte på bare et par uker? Jeg har aldri sett på loggene, men jeg mistenker at det ikke er nødvendig med mer hardware her - heller en optimalisering av hva som logges slik at det som står i loggene er nyttig.

rkarlsba · 2021-05-19T09:18:03Z

Det burde jo bare være å sette opp logrotasjon, evt sende loggene til en annen server først. Eller kanskje enda bedre - logge parallelt til en annen server og så ha kort rotasjon lokalt. @jenschr jeg gjetter at det kan være webserverloggen.

tingox · 2021-05-19T09:22:33Z

før det sporer helt av her: det som fyller opp disken er databasebackup'er - har ingenting med logger og gjøre. Såvidt meg bekjent bruker p2k16 databasen på helt vanlig måte - ikke spesielt intensivt.

rkarlsba · 2021-05-19T09:24:20Z

Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?

tingox · 2021-05-19T10:22:04Z

Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?

Selvfølgelig er det mulig - det krever dog at mennesker med rett kompetanse (og ledig tid) setter seg ned og faktisk gjør jobben. Noen av oss har forsøkt å lage en løsning for å begrense antall lokale backuper (se #144), uten at vi kom helt i mål. Jeg er definitivt ingen ekspert på postgresql, så jeg har ikke mer å bidra med der.

rkarlsba · 2021-05-19T10:40:26Z

Forslag: Ta en dump jevnlig til en katalog og du har ei fil, typisk pg_dump -Fc dbnavn > dbnavn.dump. -Fc er --format=custom, noe som gjør at man kan ta en restore av separate tabeller eller tilsvarende uten så mye knot. I tillegg gzipper den dataene for deg. Å kjøre en dump uten -Fc funker jo også og utgjør ikke noen forskjell i denne sammenhengen, men jeg ville bare nevne det. Denne kjøres typisk en gang i døgnet, så sett opp logrotate til å bare rotere denne (uten å komprimere mer) som om det var ei loggfil. logrotate ser jo ikke på innholdet uansett og konfigrasjonen er enkel.

tingox · 2021-05-30T13:01:54Z

it didn't keep as long as I had hoped, today the disk was full again, so p2k16 stopped letting people open the door. Cleaned up, restarted postgresql. Disk space looks better now:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  7.7G   11G  42% /

but probably doesn't hold two weeks.

tingox added the bug label Apr 26, 2020

omega mentioned this issue Dec 13, 2020

Add service/timer for removing old wal-e backups #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2k16 - sometimes the disk fills up, and p2k16 stops working #138

p2k16 - sometimes the disk fills up, and p2k16 stops working #138

tingox commented Apr 26, 2020

tingox commented Apr 26, 2020

tingox commented Apr 26, 2020

tingox commented Apr 26, 2020

rkarlsba commented Apr 26, 2020

tingox commented May 13, 2020

rkarlsba commented May 14, 2020

tingox commented May 14, 2020

tingox commented Jun 7, 2020

tingox commented Jun 7, 2020 •

edited

Loading

tingox commented Jul 23, 2020 •

edited

Loading

tingox commented Aug 16, 2020

tingox commented Sep 9, 2020

tingox commented Dec 13, 2020

omega commented Dec 13, 2020 •

edited

Loading

flexd commented Dec 15, 2020

rkarlsba commented Dec 15, 2020

flexd commented Dec 16, 2020

tingox commented Feb 7, 2021

tingox commented May 18, 2021 •

edited

Loading

haavares commented May 18, 2021 via email

haavares commented May 18, 2021 via email

jenschr commented May 19, 2021

rkarlsba commented May 19, 2021

tingox commented May 19, 2021

rkarlsba commented May 19, 2021

tingox commented May 19, 2021

rkarlsba commented May 19, 2021

tingox commented May 30, 2021 •

edited

Loading

p2k16 - sometimes the disk fills up, and p2k16 stops working #138

p2k16 - sometimes the disk fills up, and p2k16 stops working #138

Comments

tingox commented Apr 26, 2020

tingox commented Apr 26, 2020

tingox commented Apr 26, 2020

tingox commented Apr 26, 2020

rkarlsba commented Apr 26, 2020

tingox commented May 13, 2020

rkarlsba commented May 14, 2020

tingox commented May 14, 2020

tingox commented Jun 7, 2020

tingox commented Jun 7, 2020 • edited Loading

tingox commented Jul 23, 2020 • edited Loading

tingox commented Aug 16, 2020

tingox commented Sep 9, 2020

tingox commented Dec 13, 2020

omega commented Dec 13, 2020 • edited Loading

flexd commented Dec 15, 2020

rkarlsba commented Dec 15, 2020

flexd commented Dec 16, 2020

tingox commented Feb 7, 2021

tingox commented May 18, 2021 • edited Loading

haavares commented May 18, 2021 via email

haavares commented May 18, 2021 via email

jenschr commented May 19, 2021

rkarlsba commented May 19, 2021

tingox commented May 19, 2021

rkarlsba commented May 19, 2021

tingox commented May 19, 2021

rkarlsba commented May 19, 2021

tingox commented May 30, 2021 • edited Loading

tingox commented Jun 7, 2020 •

edited

Loading

tingox commented Jul 23, 2020 •

edited

Loading

omega commented Dec 13, 2020 •

edited

Loading

tingox commented May 18, 2021 •

edited

Loading

tingox commented May 30, 2021 •

edited

Loading