Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2k16 - sometimes the disk fills up, and p2k16 stops working #138

Open
tingox opened this issue Apr 26, 2020 · 28 comments
Open

p2k16 - sometimes the disk fills up, and p2k16 stops working #138

tingox opened this issue Apr 26, 2020 · 28 comments
Labels

Comments

@tingox
Copy link
Contributor

tingox commented Apr 26, 2020

Sometimes (not very often) the disk drive of the p2k16 serve fills up. this is bad, because the then p2k16 web app stops working.

the server p2k16 runs two services related to the PostgreSQL database:
[email protected]
[email protected]
and also this
[email protected]
we also have monitoring (via riemann), but nodbody watches that on a regular basis (not sure if someone gets alarm notificatons).

@tingox tingox added the bug label Apr 26, 2020
@tingox
Copy link
Contributor Author

tingox commented Apr 26, 2020

The directory /var/lib/postgresql/backups/ was filling up with db backups, causing the disk to fill. I cleaned out a few files, the disk is now better:

root@p2k16:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.9G  9.8G  48% /

and then I restartet postgres via systemctl restart [email protected].

@tingox
Copy link
Contributor Author

tingox commented Apr 26, 2020

There is a backup service for postegres, I haven't restarted it

root@p2k16:~# systemctl status [email protected][email protected] - PostgreSQL base backup
   Loaded: loaded (/etc/systemd/system/[email protected]; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2020-04-26 04:00:51 CEST; 7h ago
  Process: 19137 ExecStart=/usr/bin/env bash -c i="10-main"; i=${i/-//}; bin/envdir /etc/wal-e/10-main-env.
 Main PID: 19137 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

should this serrvice be running, or should we stop it?

@tingox
Copy link
Contributor Author

tingox commented Apr 26, 2020

p2k16-staging also had the same problem, so I did the same there: clean out most files from db backups, then resstart postgresql.

@rkarlsba
Copy link
Contributor

Perhaps it would be nice to have some monitoring on that, with email alerts to those who run the system - something like zabbix? I have a zabbix VM running…

@tingox
Copy link
Contributor Author

tingox commented May 13, 2020

monitoring is in place, we miss someplace good to send the alerts. Our "IT operations group" is on a volunteer basis...

@rkarlsba
Copy link
Contributor

Where can I see this monitoring status?

@tingox
Copy link
Contributor Author

tingox commented May 14, 2020

monitoring is at riemann.bitraf.no

@tingox
Copy link
Contributor Author

tingox commented Jun 7, 2020

It happened again; the disk of p2k16 filled up with postgres database backups, and the postgres service failed, causing p2k16 to fail. The disk full error was dutifully recorded by riemann.bitraf.no, but nobody looked at it.
Fix: usual fix - clean out database backups from /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart [email protected].

@tingox
Copy link
Contributor Author

tingox commented Jun 7, 2020

perhaps we should add a separate (virtual) disk drive for database backups to the server p2k16. Or better: make sure that the backups go to another server instead. Hmm.

@tingox
Copy link
Contributor Author

tingox commented Jul 23, 2020

cleaned the backup directory on p2k16-staging, better now

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.9G  9.8G  48% /

then I restarted postgres with
sudo systemctl restart [email protected]
that's all

@tingox
Copy link
Contributor Author

tingox commented Aug 16, 2020

cleaned backup directory on p2k16-staging again

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   11G  7.9G  58% /

and restarted postgres

@tingox
Copy link
Contributor Author

tingox commented Sep 9, 2020

Another cleaning of the postgres backup directory on p2k16-staging today:

tingo@p2k16-staging:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  9.9G  8.7G  54% /

plus a restart of postgres.

@tingox
Copy link
Contributor Author

tingox commented Dec 13, 2020

p2k16 had full disk again. As usual, I cleaned out /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart [email protected]. Better now:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.3G   11G  45% /

@omega
Copy link
Contributor

omega commented Dec 13, 2020

We could probably extend the backup service to only retain a set number of base backups, or have a different timer retain only N basebackups.

- name: systemd base backup service

Seems to be the code for installing the wal-e backup service, doing something similar for delete might be good enough

ExecStart=/usr/bin/env bash -c 'i="%i"; i=${i/-//}; bin/envdir /etc/wal-e/%i-env.d bin/wal-e delete --retain 5 --confirm'

omega added a commit that referenced this issue Dec 13, 2020
This will add service alongside the base-backup service and timer that
will use wal-e to delete the oldest base-backups after making new
base-backups each sunday night.

The number 5 is picked a bit at random, it doesn't seem we run out of
disk that often.

There might be a better way to trigger this than a timer, but I am not
that experienced with systemd services.

Attempts to fix #138
@flexd
Copy link

flexd commented Dec 15, 2020

How about we just send notifications to Slack? https://riemann.io/api/riemann.slack.html
Easy enough, and lots of people around to see it.

@rkarlsba
Copy link
Contributor

It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.

@flexd
Copy link

flexd commented Dec 16, 2020

It has been mentioned before. This won't help the fact that few people have access to solving the issues when they happen, but at least it will alarm people earlier so that whoever can fix it, have time to do so. Just, please, make sure it won't trigger an alarm for everything. Too many false positives will ruin the whole projects.

Better for everyone to be notified so that someone will take action , or mention it to someone that can, than it just fail silently because nobody manually checked monitoring.

@tingox
Copy link
Contributor Author

tingox commented Feb 7, 2021

p2k16 - full disk again today. Cleaned out /var/lib/postgresql/backups/, then restart postgres with sudo systemctl restart [email protected] as usual. Good to go for a while again:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  8.6G   11G  46% /

@tingox
Copy link
Contributor Author

tingox commented May 18, 2021

Maintenance this evening, cleaned out /var/lib/postgresql/backups/ on p2k16 before it gets full again.

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   12G  6.7G  65% /

that keeps a few weeks, I think.

@haavares
Copy link
Member

haavares commented May 18, 2021 via email

@haavares
Copy link
Member

haavares commented May 18, 2021 via email

@jenschr
Copy link
Member

jenschr commented May 19, 2021

Men hva er det egentlig vi logger så aggresivt? Dette kan jo ikke dreie seg om vanlig bruk av systemet (innlogging/utsjekking). Det må være noe mer som logges for å komme opp i mange gigabyte på bare et par uker? Jeg har aldri sett på loggene, men jeg mistenker at det ikke er nødvendig med mer hardware her - heller en optimalisering av hva som logges slik at det som står i loggene er nyttig.

@rkarlsba
Copy link
Contributor

Det burde jo bare være å sette opp logrotasjon, evt sende loggene til en annen server først. Eller kanskje enda bedre - logge parallelt til en annen server og så ha kort rotasjon lokalt. @jenschr jeg gjetter at det kan være webserverloggen.

@tingox
Copy link
Contributor Author

tingox commented May 19, 2021

før det sporer helt av her: det som fyller opp disken er databasebackup'er - har ingenting med logger og gjøre. Såvidt meg bekjent bruker p2k16 databasen på helt vanlig måte - ikke spesielt intensivt.

@rkarlsba
Copy link
Contributor

Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?

@tingox
Copy link
Contributor Author

tingox commented May 19, 2021

Beklager, men da burde det vel være mulig å sende den backupen til en annen server og heller bare overskrive gamle backuper lokalt?

Selvfølgelig er det mulig - det krever dog at mennesker med rett kompetanse (og ledig tid) setter seg ned og faktisk gjør jobben. Noen av oss har forsøkt å lage en løsning for å begrense antall lokale backuper (se #144), uten at vi kom helt i mål. Jeg er definitivt ingen ekspert på postgresql, så jeg har ikke mer å bidra med der.

@rkarlsba
Copy link
Contributor

Forslag: Ta en dump jevnlig til en katalog og du har ei fil, typisk pg_dump -Fc dbnavn > dbnavn.dump. -Fc er --format=custom, noe som gjør at man kan ta en restore av separate tabeller eller tilsvarende uten så mye knot. I tillegg gzipper den dataene for deg. Å kjøre en dump uten -Fc funker jo også og utgjør ikke noen forskjell i denne sammenhengen, men jeg ville bare nevne det. Denne kjøres typisk en gang i døgnet, så sett opp logrotate til å bare rotere denne (uten å komprimere mer) som om det var ei loggfil. logrotate ser jo ikke på innholdet uansett og konfigrasjonen er enkel.

@tingox
Copy link
Contributor Author

tingox commented May 30, 2021

it didn't keep as long as I had hoped, today the disk was full again, so p2k16 stopped letting people open the door. Cleaned up, restarted postgresql. Disk space looks better now:

tingo@p2k16:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  7.7G   11G  42% /

but probably doesn't hold two weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants