qonto · chtitux · Nov 17, 2023 · Nov 16, 2023 · Nov 16, 2023 · Nov 16, 2023
diff --git a/content/runbooks/postgresql/PostgreSQLExporterDown.md b/content/runbooks/postgresql/PostgreSQLExporterDown.md
@@ -14,7 +14,7 @@ The monitoring system is degraded. PostgreSQL exporter does not collect PostgreS
 
 ## Diagnosis
 
-1. Look at Prometheus PostgreSQL exporter logs. The error messages (`level=error`) should explain why the exporter can't scrape metrics.
+1. **Look at Prometheus PostgreSQL exporter logs**. The error messages (`level=error`) should explain why the exporter can't scrape metrics.
 
     Usually, the exporter can't connect to the PostgreSQL server due to network restrictions, authentication failure, missing permissions or timeout.
 
@@ -24,7 +24,7 @@ The monitoring system is degraded. PostgreSQL exporter does not collect PostgreS
 The PostgreSQL exporter needs `pg_monitoring` role and `LOGIN` options.
     {{< /hint >}}
 
-1. Look at PostgreSQL connection logs
+1. **Look at PostgreSQL connection logs**
 
     You'll get an error message if PostgreSQL exporter connections are rejected by PostgreSQL.
 

diff --git a/content/runbooks/postgresql/PostgreSQLExporterScrapingLimit.md b/content/runbooks/postgresql/PostgreSQLExporterScrapingLimit.md
@@ -18,7 +18,9 @@ The monitoring system is degraded. PostgreSQL exporter does not collect PostgreS
 
     An overloaded server may have difficulty collecting metrics.
 
-1. Check `prometheus-postgresql-exporter` logs
+2. Look at PostgreSQL Server logs to identify long running queries.
+
+    You may need to enable [`log_min_duration_statement`](https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-MIN-DURATION-STATEMENT) to identify which queries are long to be executed.
 
 ## Mitigation
 

diff --git a/content/runbooks/postgresql/PostgreSQLInactiveLogicalReplicationSlot.md b/content/runbooks/postgresql/PostgreSQLInactiveLogicalReplicationSlot.md
@@ -20,25 +20,30 @@ Logical replication slots are used by applications for [Change Data Capture](htt
 Example of services that may use CDC: Kafka connect, AWS DMS, ...
 {{< /hint >}}
 
-1. Prioritize. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the  `Replication slot dashboard` to estimate the delay before reaching storage space saturation
-
-2. Identify the non-running logical replication slot
-
-    The `database` and `slot_name` information provide elements to identify the slot replication client.
-
-    If the `wal_status` is `lost`, you may need to recreate the slot.
+1. **Prioritize**. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the  `Replication slot dashboard` to estimate the delay before reaching storage space saturation
 
+2. **Identify** the non-running logical replication slot
     <details>
-    <summary>SQL</summary>
+    <summary>List of Replication Slots</summary>
 
     {{% sql "sql/list-replication-slots.sql" %}}
 
     </details>
 
+    The inactive slot can be identified with `active=false` from the SQL query above.
+
+    The `database` and `slot_name` information provide elements to identify the slot replication client.
+
+    If the `wal_status` is `lost`, you may need to recreate the slot.
+
 ## Mitigation
 
-The replication slot client is not consuming its replication slot. Investigate and fix the replication slot client.
+The replication slot client is not consuming its replication slot. Investigate and fix the replication slot client:
+
+- Ensure the client consuming the replication slot (Kafka Connect, AWS DMS, etc.) is up and running
+- Check logs of the client to determine if it is producing an error
+- Check logs of the PostgreSQL server to determine if there is an error related to the client
 
 ## Additional resources
 
-- <https://www.postgresql.org/docs/15/view-pg-replication-slots.html>
+- <https://www.postgresql.org/docs/current/view-pg-replication-slots.html>
diff --git a/content/runbooks/postgresql/PostgreSQLInactivePhysicalReplicationSlot.md b/content/runbooks/postgresql/PostgreSQLInactivePhysicalReplicationSlot.md
@@ -15,40 +15,49 @@ Alert is triggered when a PostgreSQL physical replication slot is inactive.
 ## Diagnosis
 
 {{< hint info >}}
-Physical replication is only used by AWS to replicate RDS instances.
+Most of the Cloud providers use Physical replication for replicas. This is the case for AWS RDS.
+{{< /hint >}}
 
+{{< hint info >}}
 A newly created RDS instance may need time to replay WAL files since the last full backup. The replication slot will not be used until the replicas have replayed all the WAL files.
 {{< /hint >}}
 
-1. Prioritize. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the  `Replication slot dashboard` to estimate the delay before reaching storage space saturation
+1. **Prioritize**. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the `Replication slot dashboard` to estimate the delay before reaching storage space saturation
 
     <details>
     <summary>Find the RDS instance that uses the physical replication slot</summary>
     <ol>
         <li>Identify which replication slot is consuming disk space</li>
-        <li>Extract the AWS RDS <i>resource_id</i> from the slot name (<i>rds_[aws_region]_db_[resource_id]</i>)</li>
-        <li>Found the RDS instance in <b>RDS instances dashboard</b></li>
+        <li>Extract the AWS RDS <code>resource_id</code> from the slot name (<code>rds_[aws_region]_db_[resource_id]</code>)</li>
+        <li>Search the RDS instance in <b>RDS instances dashboard</b></li>
     </ul>
     </details>
 
 2. Check lag of RDS replica in `RDS instance details dashboard`
 
-3. Check replica instance logs in AWS Cloudwatch
+3. Check replica instance logs
+
+    Logs of RDS instances can be seen in AWS Cloudwatch if they are exported to it.
 
-    You may see replaying WAL file messages
+    If the standby is replaying WAL file messages, inactivity of the Logical Replication slot is expected and should resume once all WAL files have been replayed.
 
 ## Mitigation
 
-1. If an RDS replica instance was just created
+1. If the replica instance was just created
+
+    - If the primary instance doesn't risk disk space saturation, wait until instance initialization is finished.
+
+        Initialization phase can take several hours, especially on AWS RDS, as WAL replaying process is single-threaded.
+
+    - If there is a risk of saturating disk space, delete the replica that owns the non-running replication slot.
 
-    - If the RDS primary instance doesn't risk disk space saturation, wait until RDS initialization is finished
-    - Otherwise, delete the RDS replica that owns the non-running replication slot
+        If you are on AWS RDS, recreate the RDS replica after a full RDS snapshot and in a low activity period to limit WAL files to replay.
 
-        Recreate the RDS replica after a full RDS snapshot and in a low activity period to limit WAL files to replay
+        If you manage the replication, you may need to delete the physical replication slot.
 
-1. Increase disk space on the primary instance
+2. Increase disk space on the primary instance
 
-1. Open AWS support case to report non-running physical replication
+3. On AWS RDS, open an AWS support case to report non-running physical replication
 
 ## Additional resources
 

diff --git a/content/runbooks/postgresql/sql/list-replication-slots.sql b/content/runbooks/postgresql/sql/list-replication-slots.sql
@@ -1,4 +1,5 @@
 SELECT
+    slot_type,
     database,
     slot_name,
     active::TEXT,