Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve various PostgreSQL runbooks #3

Merged
merged 5 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions content/runbooks/postgresql/PostgreSQLExporterDown.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The monitoring system is degraded. PostgreSQL exporter does not collect PostgreS

## Diagnosis

1. Look at Prometheus PostgreSQL exporter logs. The error messages (`level=error`) should explain why the exporter can't scrape metrics.
1. **Look at Prometheus PostgreSQL exporter logs**. The error messages (`level=error`) should explain why the exporter can't scrape metrics.

Usually, the exporter can't connect to the PostgreSQL server due to network restrictions, authentication failure, missing permissions or timeout.

Expand All @@ -24,7 +24,7 @@ The monitoring system is degraded. PostgreSQL exporter does not collect PostgreS
The PostgreSQL exporter needs `pg_monitoring` role and `LOGIN` options.
{{< /hint >}}

1. Look at PostgreSQL connection logs
1. **Look at PostgreSQL connection logs**

You'll get an error message if PostgreSQL exporter connections are rejected by PostgreSQL.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ The monitoring system is degraded. PostgreSQL exporter does not collect PostgreS

An overloaded server may have difficulty collecting metrics.

1. Check `prometheus-postgresql-exporter` logs
2. Look at PostgreSQL Server logs to identify long running queries.

You may need to enable [`log_min_duration_statement`](https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-MIN-DURATION-STATEMENT) to identify which queries are long to be executed.

## Mitigation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,25 +20,30 @@ Logical replication slots are used by applications for [Change Data Capture](htt
Example of services that may use CDC: Kafka connect, AWS DMS, ...
{{< /hint >}}

1. Prioritize. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the `Replication slot dashboard` to estimate the delay before reaching storage space saturation

2. Identify the non-running logical replication slot

The `database` and `slot_name` information provide elements to identify the slot replication client.

If the `wal_status` is `lost`, you may need to recreate the slot.
1. **Prioritize**. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the `Replication slot dashboard` to estimate the delay before reaching storage space saturation

2. **Identify** the non-running logical replication slot
<details>
<summary>SQL</summary>
<summary>List of Replication Slots</summary>

{{% sql "sql/list-replication-slots.sql" %}}

</details>

The inactive slot can be identified with `active=false` from the SQL query above.

The `database` and `slot_name` information provide elements to identify the slot replication client.

If the `wal_status` is `lost`, you may need to recreate the slot.

## Mitigation

The replication slot client is not consuming its replication slot. Investigate and fix the replication slot client.
The replication slot client is not consuming its replication slot. Investigate and fix the replication slot client:

- Ensure the client consuming the replication slot (Kafka Connect, AWS DMS, etc.) is up and running
- Check logs of the client to determine if it is producing an error
- Check logs of the PostgreSQL server to determine if there is an error related to the client

## Additional resources

- <https://www.postgresql.org/docs/15/view-pg-replication-slots.html>
- <https://www.postgresql.org/docs/current/view-pg-replication-slots.html>
Original file line number Diff line number Diff line change
Expand Up @@ -15,40 +15,49 @@ Alert is triggered when a PostgreSQL physical replication slot is inactive.
## Diagnosis

{{< hint info >}}
Physical replication is only used by AWS to replicate RDS instances.
Most of the Cloud providers use Physical replication for replicas. This is the case for AWS RDS.
{{< /hint >}}

{{< hint info >}}
A newly created RDS instance may need time to replay WAL files since the last full backup. The replication slot will not be used until the replicas have replayed all the WAL files.
{{< /hint >}}

1. Prioritize. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the `Replication slot dashboard` to estimate the delay before reaching storage space saturation
1. **Prioritize**. Look at the replication slot disk space consumption trend in `Replication slot available storage` panel of the `Replication slot dashboard` to estimate the delay before reaching storage space saturation

<details>
<summary>Find the RDS instance that uses the physical replication slot</summary>
<ol>
<li>Identify which replication slot is consuming disk space</li>
<li>Extract the AWS RDS <i>resource_id</i> from the slot name (<i>rds_[aws_region]_db_[resource_id]</i>)</li>
<li>Found the RDS instance in <b>RDS instances dashboard</b></li>
<li>Extract the AWS RDS <code>resource_id</code> from the slot name (<code>rds_[aws_region]_db_[resource_id]</code>)</li>
<li>Search the RDS instance in <b>RDS instances dashboard</b></li>
</ul>
</details>

2. Check lag of RDS replica in `RDS instance details dashboard`

3. Check replica instance logs in AWS Cloudwatch
3. Check replica instance logs

Logs of RDS instances can be seen in AWS Cloudwatch if they are exported to it.

You may see replaying WAL file messages
If the standby is replaying WAL file messages, inactivity of the Logical Replication slot is expected and should resume once all WAL files have been replayed.

## Mitigation

1. If an RDS replica instance was just created
1. If the replica instance was just created

- If the primary instance doesn't risk disk space saturation, wait until instance initialization is finished.

Initialization phase can take several hours, especially on AWS RDS, as WAL replaying process is single-threaded.

- If there is a risk of saturating disk space, delete the replica that owns the non-running replication slot.

- If the RDS primary instance doesn't risk disk space saturation, wait until RDS initialization is finished
- Otherwise, delete the RDS replica that owns the non-running replication slot
If you are on AWS RDS, recreate the RDS replica after a full RDS snapshot and in a low activity period to limit WAL files to replay.

Recreate the RDS replica after a full RDS snapshot and in a low activity period to limit WAL files to replay
If you manage the replication, you may need to delete the physical replication slot.

1. Increase disk space on the primary instance
2. Increase disk space on the primary instance

1. Open AWS support case to report non-running physical replication
3. On AWS RDS, open an AWS support case to report non-running physical replication

## Additional resources

Expand Down
1 change: 1 addition & 0 deletions content/runbooks/postgresql/sql/list-replication-slots.sql
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
SELECT
slot_type,
database,
slot_name,
active::TEXT,
Expand Down