Skip to content

Commit

Permalink
(#236) agent_state_summary: Count nodes without report as unhealthy
Browse files Browse the repository at this point in the history
It's possible that a Puppet Agent was stopped or disabled and all old
reports were garbage collected from PuppetDB. The node still exists in
PuppetDB, but when checking for a report the timestamp is null:

```
puppet query nodes[certname,report_timestamp]{}
```

```json
[
  {
    "certname": "pe.tim.local",
    "report_timestamp": "2024-09-30T13:21:17.042Z"
  },
  {
    "certname": "pe2.tim.local",
    "report_timestamp": null
  }
]
```

Previously we always assumed that `report_timestamp` has a valid
timestamp. With this patch we explicitly validate the timestamp and
count nodes withhout a timestamp as unhealthy.

Now with the fix:

```
puppet plan run pe_status_check::agent_state_summary --environment peadm log_healthy_nodes=true log_unhealthy_nodes=true
```

```json
{
    "responsive": [
        "pe.tim.local",
        "pe2.tim.local"
    ],
    "healthy_counter": 0,
    "total_counter": 2,
    "unhealthy_counter": 2,
    "noop": [],
    "unhealthy": [
        "pe2.tim.local",
        "pe.tim.local"
    ],
    "healthy": [],
    "changed": [
        "pe.tim.local"
    ],
    "no_report": [
        "pe.tim.local"
    ],
    "corrective_changes": [],
    "used_cached_catalog": [
        "pe2.tim.local"
    ],
    "unresponsive": [],
    "failed": []
}
```
  • Loading branch information
bastelfreak committed Oct 7, 2024
1 parent 1941487 commit e241f42
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 3 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ The plan `pe_status_check::agent_state_summary` provides you a hash with all nod
"failed": [ ],
"changed": [ "student2.local" ],
"unresponsive": [ "student3.local", "student4.local", "student1.local", "login.local" ],
"no_report": [ "newnode.with.report.local" ],
"responsive": [ "pe.bastelfreak.local"],
"unhealthy": [ "student2.local", "student3.local", "student4.local", "student1.local", "login.local" ],
"unhealthy_counter": 5,
Expand All @@ -181,6 +182,7 @@ The plan `pe_status_check::agent_state_summary` provides you a hash with all nod
* `failed`: The last catalog couldn't be compiled or catalog application raised an error
* `changed`: A node reported a change
* `unresponsive`: Last report is older than 30 minutes (can be configured via the `runinterval` parameter)
* `no_report`: The node exists in PuppetDB but has no reports
* `corrective_changes`: A node reported corrective changes
* `used_cached_catalog`: The node didn't apply a new catalog but used a cached version
* `unhealthy`: List of nodes that are in any of the above categories
Expand Down
12 changes: 9 additions & 3 deletions plans/agent_state_summary.pp
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,15 @@
$nodes = puppetdb_query('nodes[certname,latest_report_noop,latest_report_corrective_change,cached_catalog_status,latest_report_status,report_timestamp]{}')
$fqdns = $nodes.map |$node| { $node['certname'] }

# check if the last catalog is older than X minutes
# check if the node has a report
# `report_timestamp` will be undef, or null, if no report exists
$no_report_nodes = $nodes.filter |$node| { $node['report_timestamp'] =~ Undef }
$no_report = $no_report_nodes.map |$node| { $node['certname'] }

# check if the last report is older than X minutes, for all nodes that have a report
$current_timestamp = Integer(Timestamp().strftime('%s'))
$runinterval_seconds = $runinterval * 60
$unresponsive = $nodes.map |$node| {
$unresponsive = $no_report_nodes.map |$node| {
$old_timestamp = Integer(Timestamp($node['report_timestamp']).strftime('%s'))
if ($current_timestamp - $old_timestamp) >= $runinterval_seconds {
$node['certname']
Expand All @@ -45,7 +50,7 @@
$changed = $nodes.map |$node| { if ($node['latest_report_status'] == 'changed'){ $node['certname'] } }.filter |$node| { $node =~ NotUndef }

# all nodes that aren't healthy in any form
$unhealthy = [$noop, $corrective_changes, $used_cached_catalog, $failed, $changed, $unresponsive].flatten.unique
$unhealthy = [$noop, $corrective_changes, $used_cached_catalog, $failed, $changed, $unresponsive, $no_report].flatten.unique

# all healthy nodes
$healthy = $fqdns - $unhealthy
Expand All @@ -58,6 +63,7 @@
'failed' => $failed,
'changed' => $changed,
'unresponsive' => $unresponsive,
'no_report' => $no_report,
'responsive' => $responsive,
'unhealthy' => $unhealthy,
'unhealthy_counter' => $unhealthy.count,
Expand Down

0 comments on commit e241f42

Please sign in to comment.