Skip to content

Latest commit

 

History

History
200 lines (138 loc) · 8.96 KB

README.md

File metadata and controls

200 lines (138 loc) · 8.96 KB

Prometheus Slurm Exporter

Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.

Exported Metrics

State of the CPUs

  • Allocated: CPUs which have been allocated to a job.
  • Idle: CPUs not allocated to a job and thus available for use.
  • Other: CPUs which are unavailable for use at the moment.
  • Total: total number of CPUs.

State of the Nodes

  • Allocated: nodes which has been allocated to one or more jobs.
  • Completing: all jobs associated with these nodes are in the process of being completed.
  • Down: nodes which are unavailable for use.
  • Drain: with this metric two different states are accounted for:
    • nodes in drained state (marked unavailable for use per system administrator request)
    • nodes in draining state (currently executing jobs but which will not be allocated for new ones).
  • Fail: these nodes are expected to fail soon and are unavailable for use per system administrator request.
  • Error: nodes which are currently in an error state and not capable of running any jobs.
  • Idle: nodes not allocated to any jobs and thus available for use.
  • Maint: nodes which are currently marked with the maintenance flag.
  • Mixed: nodes which have some of their CPUs ALLOCATED while others are IDLE.
  • Resv: these nodes are in an advanced reservation and not generally available.

Information extracted from the SLURM sinfo command

Status of the Jobs

  • PENDING: Jobs awaiting for resource allocation.
  • PENDING_DEPENDENCY: Jobs awaiting because of a unexecuted job dependency.
  • RUNNING: Jobs currently allocated.
  • SUSPENDED: Job has an allocation but execution has been suspended and CPUs have been released for other jobs.
  • CANCELLED: Jobs which were explicitly cancelled by the user or system administrator.
  • COMPLETING: Jobs which are in the process of being completed.
  • COMPLETED: Jobs have terminated all processes on all nodes with an exit code of zero.
  • CONFIGURING: Jobs have been allocated resources, but are waiting for them to become ready for use.
  • FAILED: Jobs terminated with a non-zero exit code or other failure condition.
  • TIMEOUT: Jobs terminated upon reaching their time limit.
  • PREEMPTED: Jobs terminated due to preemption.
  • NODE_FAIL: Jobs terminated due to failure of one or more allocated nodes.

Information extracted from the SLURM squeue command

Scheduler Information

  • Server Thread count: The number of current active slurmctld threads.
  • Queue size: The length of the scheduler queue.
  • Last cycle: Time in microseconds for last scheduling cycle.
  • Mean cycle: Mean of scheduling cycles since last reset.
  • Cycles per minute: Counter of scheduling executions per minute.
  • (Backfill) Last cycle: Time in microseconds of last backfilling cycle.
  • (Backfill) Mean cycle: Mean of backfilling scheduling cycles in microseconds since last reset.
  • (Backfill) Depth mean: Mean of processed jobs during backfilling scheduling cycles since last reset.

Information extracted from the SLURM sdiag command

How to build an RPM package from the relases

Consult the following document under the packaging/rpm subdirectory.

How to build the exporter from the sources

Debian

Install the Prometheus Go client library

>>> apt install golang-github-prometheus-client-golang-dev

Use the Makefile to build and test the code.

Debian Jessie: in this release, the Prometheus client library package was available only through the backport archives but the Debian maintainers discontinued it, as explained here. Now only Debian Stretch is supported with the previous build method.

CentOS

Under CentOS not all the GOlang dependencies are available as packages.

In order to use the Makefile provided with this repository you can proceed as follows:

  1. Install the GOlang compiler plus GIT and make:
yum install git golang-bin make
  1. Clone this repo and export the GOPATH environment variable:
git clone https://github.com/vpenso/prometheus-slurm-exporter.git
cd prometheus-slurm-exporter
export GOPATH=$(pwd):/usr/share/gocode
  1. Install all the necessary GOlang dependencies:
go get github.com/prometheus/client_golang/prometheus
go get github.com/sirupsen/logrus
go get gopkg.in/alecthomas/kingpin.v2

NOTE: all these packages will be saved under the src subdirectory of the Slurm exporter project.

In some cases, the go get command may report the following message:

package github.com/[...]/[...]: no buildable Go source files

This should be considered harmless and it will not affect the build process.

  1. Build the executable:
make build

Command line options

The following is the list of the command line options available on this exporter:

:~$ prometheus-slurm-exporter -h
Usage of ./prometheus-slurm-exporter:
  -listen-address string
    	The address to listen on for HTTP requests. (default ":8080")
  -log.format value
    	Set the log target and format. Example: "logger:syslog?appname=bob&local=7" or "logger:stdout?json=true" (default "logger:stderr")
  -log.level value
    	Only log messages with the given severity or above. Valid levels: [debug, info, warn, error, fatal] (default "info")

Installation

After successfully ran make, you will have a binary called prometheus-slurm-exporter under the bin/ subdirectory in your local copy of this repository. You can now copy this binary wherever you have installed the Slurm utilities (sinfo,squeue, sdiag) and then put it into execution, either interactively or through a Systemd unit (an example is available here).

Prometheus Configuration for the SLURM exporter

It is strongly advisable to configure the Prometheus server with the following parameters:

scrape_configs:

#
# SLURM resource manager:
# 
  - job_name: 'my_slurm_exporter'

    scrape_interval:  30s

    scrape_timeout:   30s

    static_configs:
      - targets: ['slurm_host.fqdn:8080']
  • scrape_interval: a 30 seconds interval will avoid possible 'overloading' on the SLURM master due to frequent calls of sdiag/squeue/sinfo commands through the exporter.
  • scrape_timeout: on a busy SLURM master a too short scraping timeout will abort the communication from the Prometheus server toward the exporter, thus generating a context_deadline_exceeded error.

The previous configuration file can be immediately used with a fresh installation of Promethues. At the same time, we highly recommend to include at least the global section into the configuration. Official documentation about configuring Prometheus is available here.

NOTE: the Prometheus server is using YAML as format for its configuration file, thus indentation is really important. Before reloading the Prometheus server it would be better to check the syntax:

$~ promtool check-config prometheus.yml

Checking prometheus.yml
  SUCCESS: 1 rule files found
[...]

Grafana Dashboard

A dashboard is available in order to visualize the exported metrics through Grafana.

The following are screenshots of the dashboard:

Status of the Nodes Status of the Jobs SLURM Scheduler Information

Prometheus references

License

Copyright 2017 Victor Penso, Matteo Dessalvi

This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.