Cray System Management (CSM) Administration Guide

The Cray System Management (CSM) operational activities are administrative procedures required to operate an HPE Cray EX system with CSM software installed.

The following administrative topics can be found in this guide:

CSM product management
Bare-metal
Image management
Boot orchestration
System power off procedures
System power on procedures
Power management
Artifact management
Compute rolling upgrades
Configuration management
Kubernetes
Package repository management
Security and authentication
Resiliency
ConMan
Utility storage
System management health
System Layout Service (SLS)
System configuration service
Hardware State Manager (HSM)
Hardware Management (HM) collector
HPE Power Distribution Unit (PDU)
Node management
Network
- Management network
- Customer accessible networks (CMN/CAN/CHN)
- Dynamic Host Configuration Protocol (DHCP)
- Domain Name Service (DNS)
- External DNS
- MetalLB in BGP-mode
Spire
Update firmware with FAS
System Admin Toolkit (SAT)
Install and Upgrade Framework (IUF)
Backup and recovery
Multi-tenancy

CSM product management

Important procedures for configuring, managing, and validating the CSM environment.

Validate CSM Health
Configure Keycloak Account
Configure the Cray Command Line Interface (Cray CLI)
Change Passwords and Credentials - Configure the root password and SSH keys in Vault - Set up passwordless SSH
Configure CSM Packages with CFS
Access the LiveCD USB Device After Reboot
Post-Install Customizations
Validate Signed RPMs
Remove Artifacts from Product Installation

Bare-metal

General information on what needs to be done before the initial install of CSM.

Bare-Metal Steps
Change Air-Cooled BMC Credentials
Change Credentials on ServerTech PDUs

Image management

Build and customize image recipes with the Image Management Service (IMS).

Image Management
Image Management Workflows
Upload and Register an Image Recipe
Build a New UAN Image Using the Default Recipe
Build an Image Using IMS REST Service
Import External Image to IMS
Import NCN Image to IMS
Customize an Image Root Using IMS - Create UAN Boot Images - Convert TGZ Archives to SquashFS Images
Configure a Remote Build Node
Delete or Recover Deleted IMS Content
Configure IMS to Use DKMS
Configure IMS to Validate RPMs
Exporting and Importing IMS Data
Working With aarch64 Images
Troubleshoot Large Image
Troubleshoot Remote Build Node
Troubleshoot zypper interaction
IMS API

Boot orchestration

Use the Boot Orchestration Service (BOS) to boot, reboot, and shut down collections of nodes.

BOS data notice
Boot Orchestration Service (BOS) - BOS Cheat Sheet - BOS Services - BOS API Versions - BOS Multi-tenancy
BOS Workflows
BOS Components - Component Status
BOS Session Templates - Manage a Session Template - Create a Session Template to Boot Compute Nodes with CPS - Create a Session Template to Boot Compute Nodes with SBPS - Boot UANs
BOS Sessions - Manage a BOS Session - View the Status of a BOS Session - Limit the Scope of a BOS Session - Stage Changes with BOS - Kernel Boot Parameters - Troubleshoot UAN Boot Issues - Determine Which BOS Session Booted A Node
BOS Options
Exporting and Importing BOS Data
Exporting and Importing BSS Data
Rolling Upgrades using BOS
BOS API
Boot Script Service (BSS) API
Compute Node Boot Sequence

System power off procedures

Procedures required for a full power off of an HPE Cray EX system.

System Power Off Procedures

Additional links to power off sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:

Prepare the System for Power Off
Shut Down and Power Off Managed Nodes
Save Management Network Switch Configuration Settings
Power Off Compute Cabinets using PCS
Shut Down and Power Off the Management Kubernetes Cluster
Power Off the External Lustre File System

System power on procedures

Procedures required for a full power on of an HPE Cray EX system.

System Power On Procedures

Additional links to power on sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:

Power On and Start the Management Kubernetes Cluster
Power On Compute Cabinets using PCS
Power On the External Lustre File System
Power On and Boot Managed Nodes
Recover from a Liquid Cooled Cabinet EPO Event using PCS

Power management

HPE Cray System Management (CSM) software manages and controls power out-of-band through Redfish APIs.

Power Management
Cray Advanced Platform Monitoring and Control (CAPMC)
Power Control Service (PCS)
Liquid Cooled Node Power Management
- User Access to Compute Node Power Data
Standard Rack Node Power Management
Node Card Power Management
Ignore Nodes with CAPMC
Set the Turbo Boost Limit
CAPMC API
PCS API

Artifact management

Use the Ceph Object Gateway Simple Storage Service (S3) API to manage artifacts on the system.

Artifact Management
Manage Artifacts with the Cray CLI
Use S3 Libraries and Clients
Generate Temporary S3 Credentials

Compute rolling upgrades

NOTE CRUS was deprecated in CSM 1.2.0 and removed in CSM 1.5.0. See the following links for more information:

Rolling Upgrades using BOS
Deprecated Features

Configuration management

The Configuration Framework Service (CFS) is available on systems for remote execution and configuration management of nodes and boot images.

ARP Cache Tuning
Configuration Management
CFS Configurations
CFS Sources
CFS Components
CFS Sessions
Write Ansible Code for CFS
- Target Ansible Tasks for Image Customization
Specific Use Cases
Troubleshoot CFS Issues
Exporting and Importing CFS Data
CFS API Details
Version Control Service (VCS)

Kubernetes

The system management components are broken down into a series of micro-services. Each service is independently deployable, fine-grained, and uses lightweight protocols. As a result, the system's micro-services are modular, resilient, and can be updated independently. Services within the Kubernetes architecture communicate using REST APIs.

Kubernetes Architecture
About kubectl
- Configure kubectl Credentials to Access the Kubernetes APIs
About Kubernetes Taints and Labels
Kubernetes Storage
Kubernetes Networking
Retrieve Cluster Health Information Using Kubernetes
Pod Resource Limits
About etcd
About Postgres
containerd
Kubernetes Encryption
Kyverno policy management
Troubleshoot Kyverno configuration manually
Troubleshoot Intermittent HTTP 503 Code Failures
TDS Lower CPU Requests
Fix Failed to start etcd on Master NCN
Kubernetes and Bare Metal EtcD Certificate Renewal

Package repository management

Repositories are added to systems to extend the system functionality beyond what is initially delivered. The Sonatype Nexus Repository Manager is the primary method for repository management. Nexus hosts the Yum, Docker, raw, and Helm repositories for software and firmware content.

Package Repository Management
Package Repository Management with Nexus
Manage Repositories with Nexus
Nexus Configuration
Nexus Deployment
Nexus Export and Restore
Restrict Admin Privileges in Nexus
Repair Yum Repository Metadata
Nexus Space Cleanup
Troubleshoot Nexus

Security and authentication

Mechanisms used by the system to ensure the security and authentication of internal and external requests.

System Security and Authentication
Manage System Passwords
SSH Keys
Authenticate an Account with the Command Line
Default Keycloak Realms, Accounts, and Clients
Public Key Infrastructure (PKI)
API Authorization
Retrieve an Authentication Token
Manage Sealed Secrets
SOPS Introduction
Audit Logs
Cray STS Token Generator API
Configure root user on HPE iLO BMCs

Resiliency

HPE Cray EX systems are designed so that system management services (SMS) are fully resilient and that there is no single point of failure.

Resiliency
Resilience of System Management Services
Restore System Functionality if a Kubernetes Worker Node is Down
Recreate StatefulSet Pods on Another Node
Resiliency Testing Procedure

ConMan

ConMan is a tool used for connecting to remote consoles and collecting console logs. These node logs can then be used for various administrative purposes, such as troubleshooting node boot issues.

ConMan
Access Compute Node Logs
Access Console Log Data Via the System Monitoring Framework (SMF)
Manage Node Consoles
Log in to a Node Using ConMan
Establish a Serial Connection to NCNs
Disable ConMan After System Software Installation
Console Services Troubleshooting Guide
Troubleshoot ConMan Blocking Access to a Node BMC
Troubleshoot ConMan Failing to Connect to a Console
Troubleshoot ConMan Asking for Password on SSH Connection
Troubleshoot Console Node Pod Stuck in Terminating State
Complete Reset of the Console Services

Utility storage

Ceph is the utility storage platform that is used to enable pods to store persistent data. It is deployed to provide block, object, and file storage to the management services running on Kubernetes, as well as for telemetry data coming from the compute nodes.

Utility Storage
Collect Information about the Ceph Cluster
Manage Ceph Services
Adjust Ceph Pool Quotas
Add Ceph OSDs
Shrink Ceph OSDs
Ceph Health States
Ceph Deep Scrubs
Ceph Daemon Memory Profiling
Ceph Service Check Script Usage
Ceph Orchestrator Usage
Ceph Storage Types
ceph-upgrade-tool Usage
Dump Ceph Crash Data
Identify Ceph Latency Issues
Cephadm Reference Material
Adding a Ceph Node to the Ceph Cluster
Shrink the Ceph Cluster
Alternate Storage Pools
Restore Nexus Data After Data Corruption
Troubleshoot Failure to Get Ceph Health
Troubleshoot a Down OSD
Troubleshoot Ceph OSDs Not Being Created on Disks
Troubleshoot Ceph OSDs Reporting Full
Troubleshoot System Clock Skew
Troubleshoot an Unresponsive S3 Endpoint
Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
Troubleshoot Pods Multi-Attach Error
Troubleshoot Large Object Map Objects in Ceph Health
Troubleshoot Failure of RGW Health Check
Troubleshoot Ceph MDS Client Connectivity Issues
Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
Troubleshoot Ceph image with tag:<none>
Troubleshoot Ceph Services Not Starting After a Server Crash
Troubleshoot HEALTH_ERR Module devicehealth has failed table Device already exists
Troubleshoot Insufficient Standby MDS Daemons Available
Troubleshoot S3FS Mount Issues
Fixing incorrect number of PG Issues

System management health

Enable system administrators to assess the health of their system. Operators need to quickly and efficiently troubleshoot system issues as they occur and be confident that a lack of issues indicates the system is operating normally.

System Management Health
System Management Health Checks and Alerts
Access System Management Health Services
Configure Prometheus Email Alert Notifications
Grafana Dashboards by Component
- Troubleshoot Grafana Dashboard
Remove Kiali
prometheus-kafka-adapter errors during installation
grok-exporter errors during installation
Troubleshoot Prometheus Alerts
Configure UAN Node Exporter

System Layout Service (SLS)

The System Layout Service (SLS) holds information about the system design, such as the physical locations of network hardware, compute nodes, and cabinets. It also stores information about the network, such as which port on which switch should be connected to each compute node.

System Layout Service (SLS)
Dump SLS Information
Load SLS Database with Dump File
Add Liquid-Cooled Cabinets to SLS
Add UAN CAN IP Addresses to SLS
Update SLS with UAN Aliases
Add an alias to a service
Create a Backup of the SLS Postgres Database
Restore SLS Postgres Database from Backup
Restore SLS Postgres without an Existing Backup
SLS API

System configuration service

The System Configuration Service (SCSD) allows administrators to set various BMC and controller parameters. These parameters are typically set during discovery, but this tool enables parameters to be set before or after discovery. The operations to change these parameters are available in the Cray CLI under the scsd command.

System Configuration Service
Configure BMC and Controller Parameters with SCSD
Manage Parameters with the SCSD Service
Set BMC Credentials
SCSD API

Hardware State Manager (HSM)

Use the Hardware State Manager (HSM) to monitor and interrogate hardware components in the HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.

Hardware State Manager (HSM)
Hardware Management Services (HMS) Locking API
- Lock and Unlock Management Nodes
- Manage HMS Locks
Component Groups and Partitions
Hardware State Manager (HSM) State and Flag Fields
HSM Roles and Subroles
Add an NCN to the HSM Database
Add a Switch to the HSM Database
Create a Backup of the HSM Postgres Database
Restore HSM Postgres from a Backup
Restore HSM Postgres without a Backup
Set BMC Management Role
HSM API
Heartbeat Tracker Daemon (HBTD) API
Hardware Management Notification Fanout Daemon (HMNFD) API

Hardware Management (HM) collector

The Hardware Management (HM) Collector is used to collect telemetry and Redfish events from hardware in the system.

Adjust HM Collector resource limits and requests

HPE Power Distribution Unit (PDU)

Procedures for managing and setting up HPE PDUs.

HPE PDU Admin Procedure

Node management

Monitor and manage compute nodes (CNs) and non-compute nodes (NCNs) used in the HPE Cray EX system.

Node Management
Node Management Workflows
Rebuild NCNs
- Identify Nodes and Update Metadata
- Prepare Storage Nodes
- Power Cycle and Rebuild Nodes
- Adding a Ceph Node to the Ceph Cluster
- Customize PCIe Hardware
- Customize Disk Hardware
- Validate Boot Loader
- Validate Storage Node
- Final Validation Steps
Reboot NCNs
- Check and Set the metalno-wipe Setting on NCNs
Enable Nodes
Disable Nodes
Find Node Type and Manufacturer
Add Additional Air-Cooled Cabinets to a System
Add Additional Liquid-Cooled Cabinets to a System
Updating Cabinet Routes on Management NCNs
Move a liquid-cooled blade within a System
- Removing a Liquid-cooled blade from a System
- Removing a Liquid-cooled blade from a System Using SAT
- Adding a Liquid-cooled blade to a System
- Adding a Liquid-cooled blade to a System Using SAT
Add a Standard Rack Node
- Removing a Standard rack node from a System
- Replace a Standard rack node from a System
- Move a Standard Rack Node
- Move a Standard Rack Node (Same Rack/Same HSN Ports)
- Verify Node Removal
Clear Space in Root File System on Worker Nodes
Troubleshoot Issues with Redfish Endpoint DiscoveryCheck for Redfish Events from Nodes
Reset Credentials on Redfish Devices
Access and Update Settings for Replacement NCNs
Change Settings for HMS Collector Polling of Air Cooled Nodes
Use the Physical KVM
Launch a Virtual KVM on Gigabyte Nodes
Launch a Virtual KVM on Intel Nodes
Change Java Security Settings
Configuration of NCN Bonding
- Troubleshoot Interfaces with IP Address Issues
Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
Check the BMC Failover Mode
Update Compute Node Mellanox HSN NIC Firmware
TLS Certificates for Redfish BMCs
- Add TLS Certificates to BMCs
Dump a Non-Compute Node
Enable Passwordless Connections to Liquid Cooled Node BMCs
- View BIOS Logs for Liquid Cooled Nodes
Configure NTP on NCNs
Swap a Compute Blade with a Different System
Swap a Compute Blade with a Different System Using SAT
Replace a Compute Blade
Replace a Compute Blade Using SAT
Update the Gigabyte Node BIOS Time
S3FS Usage Guidelines
Defragment NID Numbering
Repurpose a Compute Node as a UAN
Clear Gigabyte CMOS
Set Gigabyte Node BMC to Factory Defaults
NCN Network Troubleshooting
NCN Drive Identification
Manual Wipe Procedures
Build NCN Images Locally
NCN Lifecycle Service (NLS) API
Enable IPMI access on HPE iLO BMCs
Update the HPE Node BIOS Time
Switch PXE Boot from Onboard NIC to PCIe
NCN NIC Replacement

Network

Overview of the several different networks supported by the HPE Cray EX system.

Network
Access to System Management Services
Default IP Address Ranges
Connect to the HPE Cray EX Environment
Connect to Switch over USB-Serial Cable
Create a CSM Configuration Upgrade Plan
Gateway Testing

Management network

HPE Cray EX systems can have network switches in many roles: spine switches, leaf switches, LeafBMC switches, and CDU switches. Newer systems have HPE Aruba switches, while older systems have Dell and Mellanox switches. Switch IP addresses are generated by Cray Site Init (CSI).

HPE Cray EX Management Network Installation and Configuration Guide
Update Management Network Firmware
BICAN switch configuration
Bonded UAN Configuration

Customer accessible networks (CMN/CAN/CHN)

The customer accessible networks (CMN/CAN/CHN) provide access from outside the customer network to services, NCNs, and User Access Nodes (UANs) in the system.

Customer Accessible Networks
Externally Exposed Services
Connect to the CMN and CAN
BI-CAN Aruba/Arista Configuration
MetalLB Peering with Arista Edge Router
CAN/CMN with Dual-Spine Configuration
Troubleshoot CMN Issues

Dynamic Host Configuration Protocol (DHCP)

The DHCP service on the HPE Cray EX system uses the Internet Systems Consortium (ISC) Kea tool. Kea provides more robust management capabilities for DHCP servers.

DHCP
Troubleshoot DHCP Issues
DHCP boot file customization

Domain Name Service (DNS)

The central DNS infrastructure provides the structural networking hierarchy and datastore for the system.

DNS
Manage the DNS Unbound Resolver
Enable ncsd on UANs
PowerDNS Configuration
PowerDNS Migration Guide
Troubleshoot Common DNS Issues
Troubleshoot PowerDNS

External DNS

External DNS, along with the Customer Management Network (CMN), Border Gateway Protocol (BGP), and MetalLB, makes it simpler to access the HPE Cray EX API and system management services. Services are accessible directly from a laptop without needing to tunnel into a non-compute node (NCN) or override /etc/hosts settings.

External DNS
External DNS csi config init Input Values
Update the cmn-external-dns Value Post-Installation
Ingress Routing
External DNS Failing to Discover Services Workaround
Troubleshoot Connectivity to Services with External IP addresses
Troubleshoot DNS Configuration Issues

MetalLB in BGP-mode

MetalLB is a component in Kubernetes that manages access to LoadBalancer services from outside the Kubernetes cluster. There are LoadBalancer services on the Node Management Network (NMN), Hardware Management Network (HMN), and Customer Access Network (CAN).

MetalLB can run in either Layer2-mode or BGP-mode for each address pool it manages. BGP-mode is used for the NMN, HMN, and CAN. This enables true load balancing (Layer2-mode does failover, not load balancing) and allows for a more robust layer 3 configuration for these networks.

MetalLB in BGP-Mode
MetalLB Configuration
Check BGP Status and Reset Sessions
Troubleshoot Services without an Allocated IP Address
Troubleshoot BGP not Accepting Routes from MetalLB

Spire

Spire provides the ability to authenticate nodes and workloads, and to securely distribute and manage their identities along with the credentials associated with them.

Restore Spire Postgres without a Backup
Troubleshoot Spire Failing to Start on NCNs
Update Spire Intermediate CA Certificate
Xname Validation
Restore Missing Spire Meta-Data
Create a Backup of the Spire Postgres Database

Update firmware with FAS

The Firmware Action Service (FAS) provides an interface for managing firmware versions of Redfish-enabled hardware in the system. FAS interacts with the Hardware State Manager (HSM), device data, and image data in order to update firmware.

See Update Firmware with FAS for a list components that are upgradable with FAS. Refer to the HPC Firmware Pack (HFP) product stream to update firmware on other components.

Update Firmware with FAS
Using the FASUpdate Script
FAS CLI
FAS API
FAS Filters
FAS Recipes and Procedures
FAS Recipes
FAS Admin Procedures
Upload Olympus BMC Recovery Firmware into TFTP Server
Updating Firmware on m001
Updating Firmware without FAS
Update iLO 5 firmware above v2.78

System Admin Toolkit (SAT)

The System Admin Toolkit (SAT) is a command-line interface that can assist administrators with common tasks, such as troubleshooting and querying information about the HPE Cray EX System, system boot and shutdown, replacing hardware components, and more. In CSM 1.3 and newer, the sat command is available on the Kubernetes NCNs without installing the SAT product stream.

Starting in CSM 1.6.0, SAT is fully included in CSM. There is no longer a separate SAT product stream to install. SAT 2.6 releases, which accompanied CSM 1.5, are the last releases of SAT as a separate product.

System Admin Toolkit (SAT)

Install and Upgrade Framework (IUF)

The Install and Upgrade Framework (IUF) provides a CLI and API which automates operations required to install, upgrade and deploy non-CSM product content onto an HPE Cray EX system. Each product distribution includes an iuf-product-manifest.yaml file which IUF uses to determine what operations are needed to install, upgrade, and deploy the product.

Install and Upgrade Framework (IUF)
Install and Upgrade Observability Framework
Using the Argo UI
Using Argo Workflows

Backup and recovery

Information on how to perform backups of individual services or the entire system, and how to restore from these backups.

System Recovery

Backup and recovery: etcd

Create a Manual Backup of a Healthy Bare-Metal etcd Cluster
Create a Manual Backup of a Healthy etcd Cluster
Restore an etcd Cluster from a Backup
Repopulate Data in etcd Clusters When Rebuilding Them
Restore Bare-Metal etcd Clusters from an S3 Snapshot

Backup and recovery: Postgres

Restore Postgres
Disaster Recovery for Postgres

Backup and recovery: Nexus

Nexus Export and Restore
Restore Nexus Data After Data Corruption
Nexus Service Recovery

Backup and recovery: Keycloak

Create a Backup of the Keycloak Postgres Database
Keycloak Service Recovery

Backup and recovery: Vault

Backup and Restore Vault Clusters
Vault Service Recovery

Backup and recovery: SLS

Create a Backup of the SLS Postgres Database
Restore SLS Postgres Database from Backup
Restore SLS Postgres without an Existing Backup

Backup and recovery: HSM

Create a Backup of the HSM Postgres Database
Restore HSM Postgres from a Backup
Restore HSM Postgres without a Backup

Backup and recovery: Spire

Create a Backup of the Spire Postgres Database
Restore Spire Postgres without a Backup
Spire Service Recovery

Backup and recovery: Version Control Service (VCS)

Backup and restore data

Backup and recovery: Boot Orchestration Service (BOS)

Exporting and Importing BOS Data

Backup and recovery: Boot Script Service (BSS)

Exporting and Importing BSS Data

Backup and recovery: Configuration Management Service (CFS)

Exporting and Importing CFS Data

Backup and recovery: Image Management Service (IMS)

Exporting and Importing IMS Data

Backup and recovery: Workload managers

PBS Service Recovery
Slurm Service Recovery

Multi-tenancy

Multi-tenancy Support
Cray Hierarchical Namespace Controller (HNC) Manager
Tenant Administrator Configuration
Creating a Tenant
Modifying a Tenant
Removing a Tenant
Slurm Operator
HPE Slingshot Network Operator
Tenant and Partition Management System (TAPMS) Overview
TAPMS Tenant Status API
Global Tenant Hooks
Example Workflow

Files

README.md

Latest commit

History