From 49a8bbf8a4e001c46537c080586f662fca7895e7 Mon Sep 17 00:00:00 2001 From: Alex Garel Date: Thu, 12 Dec 2024 19:10:33 +0100 Subject: [PATCH] docs: started a doc on handling alerts (#445) --- docs/how-to-handle-alerts.md | 80 ++++++++++++++++++++++++++++++++++++ docs/index.md | 3 ++ 2 files changed, 83 insertions(+) create mode 100644 docs/how-to-handle-alerts.md diff --git a/docs/how-to-handle-alerts.md b/docs/how-to-handle-alerts.md new file mode 100644 index 00000000..65e159ee --- /dev/null +++ b/docs/how-to-handle-alerts.md @@ -0,0 +1,80 @@ +# How to handle alerts + +This document describes for each alerts what you should do to diagnose and fix it. + + +## Postfix mail messages queue is high (slack) + +This is fired by prometheus. + +This means that Proxmox Mail Gateway defered mail queue has many messages. + +This happens because of bad email addresses that can't be delivered. + +- [Connect to the Proxmox Mail Gateway interface](./mail.md#administration), +- go to the *Administration* / *Queues* and check deferred mails. + You can see them by target domains. + +It might be that: +* we have emails of users with full inbox or that do not exists anymore + In this case the best is to remove the users (remove them also in the newsletter, brevo) +* if you have emails to `root@.openfoodfacts.org`, + it might be that either the server relay is not configured correctly, + or the `mailx` program is not installed or not the `bsd-mailx` package. + See [mail configuration for servers](./mail.md#servers) + +Finally when you are done, you can empty the queues. + +*Tip*: To extract the email addresses in the detail of deferred mails, +you can use the "Console" of the developers toolbar with this expression: + +```javascript +$x('//div[contains(@id, "pmgPostfixMailQueue")]//div[@class="x-grid-item-container"]//table//td[4]//text()').map(x => x.nodeValue).join("\n") +``` + +## sanoid_check.sh error on `` (email) + +This is fired by the `sanoid_check.sh` script which is regularly run by systemd +on every host using ZFS. + +There are two possible alerts for each dataset. + +## Last snapshot `` is too old + +This is fired because of different reasons: +* if the dataset is a replication of a dataset on another host, + it is not synchronized anymore. + This might be transient because of a large volume of data to transfer, + until syncoid catches up. + + To diagnose, go on the server and: + * list snapshots with `zfs list -t snapshot path/of/dataset` and check the last one + * eventually look at syncoid logs with `journalctl -xe -u syncoid` searching for you dataset. + + Sometimes the synchronization is not working because the last snapshot has been removed on the source. See [How to resync ZFS replication](./how-to-resync-zfs-replication.md) + +* if the dataset is local, + it might be that you didn't configure sanoid (`sanoid.conf`) + correctly for this dataset, be it you forgot to [add snapshot specification](./sanoid.md#sanoid-snapshot-configuration), + or add it to [`no_sanoid_checks` directives](./sanoid.md#sanoid-checks). + + To diagnose, go on the server and: + * list snapshots with `zfs list -t snapshot path/of/dataset` and check the last one + * eventually look at sanoid logs with `journalctl -xe -u sanoid` searching for you dataset. + + +## `` has too many snapshots + +This is fired because snapshots are accumulating on a dataset +(which can lead in increased disk usage). + +On the server, list snapshots with `zfs list -t snapshot path/of/dataset`. + +Accumulation can be due to two main reasons: +* you forgot to configure the retention policy in `sanoid.conf` for this dataset, + and sanoid is not removing old snapshots. + In this case you will see a lot of hourly / daily snapshots. +* there is another source adding snapshots to this dataset, which are not removed by sanoid. +* eventually sanoid is not able to cleanup snapshots, + use `journalctl -xe -u sanoid` to try to see why. + diff --git a/docs/index.md b/docs/index.md index 79d0da1e..c67c31a9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,6 +40,9 @@ Observability (monitoring, alerts) allows us to monitor the health and performan We also have a [status page](https://status.openfoodfacts.org/), driven by [openfoodfacts-upptime](https://github.com/openfoodfacts/openfoodfacts-upptime) and a [specific repository regarding monitoring](https://github.com/openfoodfacts/openfoodfacts-monitoring). +The [How to handle alerts](./how-to-handle-alerts.md) helps you diagnose and fix potential issues +that created an alert. + ### ZFS