Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update docs with new Validator Dashboard template #309

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 26 additions & 7 deletions docs/operate/validators/monitoring-alerting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Validators expose metrics on the port number specified by the argument `--metrics`. Port `9090` is the default, though any valid port can be chosen.

We recommend using Prometheus to scrape these metrics and Grafana to visualize them, using [this dashboard JSON template](https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/2f7f714c5b698fedcead9825f52da6ef95f96be1/tools/grafana/validator-dashboard-template.json), as seen below. The benefit of this dashboard is that it displays multiple validators at the same time. The screenshot shows metrics grouped by chain or by kubernetes pod.
We recommend using Prometheus to scrape these metrics and Grafana to visualize them, using [this dashboard JSON template](https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/84e8c5c218ac5d211c766cf4d5a135aee10e508c/tools/grafana/validator-dashboard-template.json), as seen below. The benefit of this dashboard is that it displays multiple validators at the same time. The screenshot shows metrics grouped by chain or by kubernetes pod.

![Validator Grafana Dashboard Template](/img/dashboard-template-validator.png)

Expand All @@ -15,12 +15,31 @@ If running as a Docker image, make sure to port-forward the metrics endpoint por
## Metrics

The dashboard template includes the following metrics.
| Metric | Description |
|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `hyperlane_latest_checkpoint` | The latest checkpoint observed by the validator, visualized as 30 min increases ("Signed checkpoint diffs"). If this metric is not positive in spite of `hyperlane_block_height` increasing, it could mean that either there are no new messages, or that the validator stopped signing checkpoints. |
| `hyperlane_block_height` | The block height of the RPC node the agent is connected to, visualized as 30 min increases ("Latest block diffs"). If this metric is not increasing, the RPC may be unhealthy and need changing. |
| `hyperlane_span_events_total{agent="validator", event_level="error"}` | The total number of errors logged, visualized as 30 min increases ("Error log count diffs"). If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |
| `hyperlane_span_events_total{agent="validator", event_level="warn"}` | The total number of warnings logged, visualized as 30 min increases ("Warning log count diffs"). If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |

| Metric | Description |
|-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `hyperlane_latest_checkpoint` | The latest checkpoint observed by the validator |
| `hyperlane_block_height` | The block height of the RPC node the agent is connected to. This metric is updated periodically independently from the checkpoint indexing functionality of validator. |
| `hyperlane_contract_sync_liveness` | The timestamp which is reported on every iteration of the indexing loop of validator. |
| `hyperlane_contract_sync_block_height` | The block height observed by validator as part of indexing loop. |
| `hyperlane_cursor_current_block` | The block height observed by internal cursor of the validator. |
| `hyperlane_span_events_total{agent="validator", event_level="error"}` | The total number of errors logged, visualized as 30 min increases ("Error log count diffs"). If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |
| `hyperlane_span_events_total{agent="validator", event_level="warn"}` | The total number of warnings logged, visualized as 30 min increases ("Warning log count diffs"). If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |

## Charts

The dashboard template includes the following charts.

| Chart | Description |
|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Signed Checkpoint Diffs (30m)` | The latest checkpoint observed by the validator, visualized as 30 min increases. If values not positive in spite of `Latest Block Diffs (30m)` stable, it could mean that either there are no new messages, or that the validator stopped signing checkpoints. |
| `Diff observed - processed checkpoints` | The difference between checkpoints which validator observed on blockchain and the checkpoints which validator signed and published. This chart should be zero most of the time with occasional and short lived positive values. If it is positive for extended period of time, it means that validator has issues with signing and publishing checkpoints. |
| `Latest Block Diffs (30m)` | The block height of the RPC node the agent is connected to, visualized as 30 min increases ("Latest block diffs"). If this metric is not increasing, the RPC may be unhealthy and need changing. |
| `Contract Sync Liveness` | Renders metric `hyperlane_contract_sync_liveness` as it is. The graph should always go up. If it is constant, it indicates an issue. If chart contains metrics for several chains, they will be rendered on top of each other. It is normal when nothing is broken. Deviations will be visible if there is an issue which should be investigated. |
| `Contract Sync Block Diffs (30m)` | Renders metric `hyperlane_contract_sync_block_height` as 30 min increases. It should approximately match the number of blocks produced by the chain during last 30 minutes. If the metric deviate significantly in any direction, becomes zero, it should be investigated. |
| `Cursor Block Diffs (30m)` | Renders metric `hyperlane_cursor_current_block` as 30 min increases. It should approximately match the number of blocks produced by the chain during last 30 minutes. If the metric deviate significantly in any direction, becomes zero, it should be investigated. |
| `Error log count diffs (30m)` | The total number of errors logged, visualized as 30 min increases ("Error log count diffs"). If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |
| `Warning log count diffs (30m)` | The total number of warnings logged, visualized as 30 min increases ("Warning log count diffs"). If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |
ameten marked this conversation as resolved.
Show resolved Hide resolved


## Alerts
Expand Down
Binary file modified static/img/dashboard-template-validator.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.