From 49a8bbf8a4e001c46537c080586f662fca7895e7 Mon Sep 17 00:00:00 2001
From: Alex Garel <alex@garel.org>
Date: Thu, 12 Dec 2024 19:10:33 +0100
Subject: [PATCH] docs: started a doc on handling alerts (#445)

---
 docs/how-to-handle-alerts.md | 80 ++++++++++++++++++++++++++++++++++++
 docs/index.md                |  3 ++
 2 files changed, 83 insertions(+)
 create mode 100644 docs/how-to-handle-alerts.md
diff --git a/docs/how-to-handle-alerts.md b/docs/how-to-handle-alerts.md
new file mode 100644
index 00000000..65e159ee
--- /dev/null
+++ b/docs/how-to-handle-alerts.md
@@ -0,0 +1,80 @@
+# How to handle alerts
+
+This document describes for each alerts what you should do to diagnose and fix it.
+
+
+## Postfix mail messages queue is high (slack)
+
+This is fired by prometheus.
+
+This means that Proxmox Mail Gateway defered mail queue has many messages.
+
+This happens because of bad email addresses that can't be delivered.
+
+- [Connect to the Proxmox Mail Gateway interface](./mail.md#administration),
+- go to the *Administration* / *Queues* and check deferred mails.
+  You can see them by target domains.
+
+It might be that:
+* we have emails of users with full inbox or that do not exists anymore
+  In this case the best is to remove the users (remove them also in the newsletter, brevo)
+* if you have emails to `root@<server>.openfoodfacts.org`,
+  it might be that either the server relay is not configured correctly,
+  or the `mailx` program is not installed or not the `bsd-mailx` package.
+  See [mail configuration for servers](./mail.md#servers)
+
+Finally when you are done, you can empty the queues.
+
+*Tip*: To extract the email addresses in the detail of deferred mails,
+you can use the "Console" of the developers toolbar with this expression:
+
+```javascript
+$x('//div[contains(@id, "pmgPostfixMailQueue")]//div[@class="x-grid-item-container"]//table//td[4]//text()').map(x => x.nodeValue).join("\n")
+```
+
+## sanoid_check.sh error on `<server>` (email)
+
+This is fired by the `sanoid_check.sh` script which is regularly run by systemd
+on every host using ZFS.
+
+There are two possible alerts for each dataset.
+
+## Last snapshot `<dataset>` is too old
+
+This is fired because of different reasons:
+* if  the dataset is a replication of a dataset on another host,
+  it is not synchronized anymore.
+  This might be transient because of a large volume of data to transfer,
+  until syncoid catches up.
+
+  To diagnose, go on the server and:
+  * list snapshots with `zfs list  -t snapshot path/of/dataset` and check the last one
+  * eventually look at syncoid logs with `journalctl -xe -u syncoid` searching for you dataset.
+
+  Sometimes the synchronization is not working because the last snapshot has been removed on the source. See [How to resync ZFS replication](./how-to-resync-zfs-replication.md)
+
+* if the dataset is local,
+  it might be that you didn't configure sanoid (`sanoid.conf`)
+  correctly for this dataset, be it you forgot to [add snapshot specification](./sanoid.md#sanoid-snapshot-configuration),
+  or add it to [`no_sanoid_checks` directives](./sanoid.md#sanoid-checks).
+
+  To diagnose, go on the server and:
+  * list snapshots with `zfs list  -t snapshot path/of/dataset` and check the last one
+  * eventually look at sanoid logs with `journalctl -xe -u sanoid` searching for you dataset.
+
+
+## `<dataset>` has too many snapshots
+
+This is fired because snapshots are accumulating on a dataset
+(which can lead in increased disk usage).
+
+On the server, list snapshots with `zfs list  -t snapshot path/of/dataset`.
+
+Accumulation can be due to two main reasons:
+* you forgot to configure the retention policy in `sanoid.conf` for this dataset,
+  and sanoid is not removing old snapshots.
+  In this case you will see a lot of hourly / daily snapshots.
+* there is another source adding snapshots to this dataset, which are not removed by sanoid.
+* eventually sanoid is not able to cleanup snapshots,
+  use `journalctl -xe -u sanoid` to try to see why.
+
diff --git a/docs/index.md b/docs/index.md
index 79d0da1e..c67c31a9 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -40,6 +40,9 @@ Observability (monitoring, alerts) allows us to monitor the health and performan
 We also have a [status page](https://status.openfoodfacts.org/), driven by [openfoodfacts-upptime](https://github.com/openfoodfacts/openfoodfacts-upptime)
 and a [specific repository regarding monitoring](https://github.com/openfoodfacts/openfoodfacts-monitoring).
 
+The [How to handle alerts](./how-to-handle-alerts.md) helps you diagnose and fix potential issues
+that created an alert.
+
 
 ### ZFS