docs: first take on ovh3 backups

openfoodfacts · Oct 30, 2024 · 2ffeae6 · 2ffeae6
1 parent 3e91ffd
commit 2ffeae6
Showing 1 changed file with 165 additions and 0 deletions.
diff --git a/docs/reports/2024-09-30-ovh3-backups.md b/docs/reports/2024-09-30-ovh3-backups.md
@@ -0,0 +1,165 @@
+# 2024-09-30 OVH3 backups
+
+We need an intervention to change a disk on ovh3.
+
+We still have very few backups for OVH services.
+
+Before the operation, I want to at least have replication of OVH backups on the new MOJI server.
+
+Ideally I would like to use sanoid for every kind of backups, but I don't want to disrupt the current setup, as I don't have the time to.
+
+What I wanted to do:
+* add sanoid managed snapshots to current volumes replications of ovh1 / ovh2 containers and VMs
+* add sanoid managed snapshots to current volumes ovh3 containers + add a backup of ovh3 system (it's not on ZFS)
+* synchronize all those ZFS to MOJI, it may need some tweaking because of the replication snapshot, which should not be replicated.
+
+This is not feasible ! Because of the replication. The replication must start from last replication snapshot and cannot be done in reverse.
+
+
+What we can do instead:
+* add sanoid snapshots on the ovh1 / ovh2 servers
+* let replication of the container sync those snapshots to ovh3
+* keep very few snapshots (1 or 2) on the ovh1/ ovh2 side (we have very few space left)
+* keep snapshots for longer on the ovh3 side
+
+## Moving replication to a pve sub dataset (abandonned)
+
+**NOTE:** finally **not done**, because I didn't succeed to make it work, and was not really confident about the procedure.
+
+Currently we have a replication landing in `/rpool`, 
+this is annoying because it does not enable to configure sanoid
+using recursive property (which would also ensure new volumes are under sanoid control).
+So I would like to move them to pve.
+
+
+To do this:
+- first, 106 replication was stalled for a long time, I deleted the replication job and re-created it.
+- Using the interface, I first disabled replication of all vm containers to ovh3.
+  It can also be done using `pvesr disable <id>`
+- I also stopped the two containers on ovh3 (100 (Munin) and 150 (gdrive-backup)).
+
+- I then created a new dataset: `zfs create rpool/pve`
+- Then I changed  `/etc/pve/storage.cfg` to change the pool and mountpoint of the rpool storage
+- I tried to move a first replication by using `zfs rename` to move a subvol from `rpool` to `rpool/pve` and then re-enabled the replication… but it failed with a zfs allow/unallow error. 
+- As I was not able to understand the error (there was no particular allowed user before (as showned by `zfs allow rpool`)), **I stepped back**:
+  - disable replication on the container where I did re-enabled it
+  - rename the volume back to a child of `rpool`
+  - restored `/etc/pve/storage.cfg` to its original state
+
+
+
+## Adding sanoid snapshots to replicated volumes
+
+## Adding sanoid on ovh1 and ovh2
+
+I installed sanoid using the .deb that was on ovh3:
+```bash
+apt install libcapture-tiny-perl libconfig-inifiles-perl pv lzop mbuffer
+dpkg -i /opt/sanoid_2.2.0_all.deb
+```
+
+I then:
+* created the email on failure unit
+* personalized the sanoid systemctl unit
+
+```bash
+cd /opt/openfoodfacts-infrastructure/confs/$HOSTNAME
+mkdir -p systemd/system
+cd systemd/system
+ln -s ../../../common/systemd/system/email-failures\@.service .
+ln -s ../../../common/systemd/system/sanoid.service.d .
+
+ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/systemd/system/email-failures\@.service /etc/systemd/system
+ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/systemd/system/sanoid.service.d /etc/systemd/system
+systemctl daemon-reload
+```
+
+Then I added the sanoid.conf telling to snapshot the volumes but keeping only 2 snapshots
+and snapshot once an hour.
+
+Then we activate:
+```bash
+ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/sanoid/sanoid.conf /etc/sanoid/
+systemctl enable --now sanoid.timer
+```
+
+## Configuring sanoid on ovh3
+
+On ovh3 we want to keep more snapshots than on ovh1 and ovh2.
+So we configure sanoid to do so.
+
+## Syncing to MOJI
+
+On Moji, we don't currently sync data from ovh3.
+
+I [setup an operator account](../sanoid.md#how-to-setup-synchronization-without-using-root) on ovh3 for moji.
+
+Created the syncoid-args.conf file.
+
+I did a first sync using:
+```bash
+grep -v "^#" syncoid-args.conf | while read -a sync_args;do [[ -n "$sync_args" ]] && time syncoid  "${sync_args[@]}" </dev/null;done
+```
+
+Setup syncoid service and timer, and enable them.
+
+## Side fix: fixing vm 200 replication
+
+VM 200 (docker staging) was stalled on ovh1.
+
+I tried to remove the replication job but it failed.
+To remove it I did:
+`pvesr delete 200-0 -force`
+and it worked.
+
+I then recreated the replication job.
+
+## Side fix: removing old volumes
+
+There are volumes remaining on ovh3 of containers that were deleted.
+
+To get an idea of the container it was from, I can cat the `/etc/hostname`. 
+For example, for container 112:
+```bash
+cat /rpool/subvol-112-disk-0/etc/hostname 
+mongo2
+```
+
+```bash
+for num in 109 115 116 117 119 120 122;do echo $num; cat /rpool/subvol-$num-disk-0/etc/hostname;done
+109
+slack
+115
+robotoff-dev
+116
+mongo-dev
+117
+tensorflow-xp
+119
+robotoff-net
+120
+impact-estimator
+122
+off-net2
+``
+
+I did destroy the following volumes:
+```bash
+# slack
+zfs destroy rpool/subvol-109-disk-0 -r
+# mongo2
+zfs destroy rpool/subvol-112-disk-0 -r
+# robotoff-dev
+zfs destroy rpool/subvol-115-disk-0 -r
+# mongo-dev
+zfs destroy rpool/subvol-116-disk-0 -r
+# tensorflow-xp
+zfs destroy rpool/subvol-117-disk-0 -r
+# robotoff-net
+zfs destroy rpool/subvol-119-disk-0 -r
+# impact estimator
+zfs destroy rpool/subvol-120-disk-0 -r
+# off-net2
+zfs destroy rpool/subvol-122-disk-0 -r
+```
+