The Cray System Management (CSM) operational activities are administrative procedures required to operate an HPE Cray EX system with CSM software installed.
The following administrative topics can be found in this guide:
- CSM product management
- Bare-metal
- Image management
- Boot orchestration
- System power off procedures
- System power on procedures
- Power management
- Artifact management
- Compute rolling upgrades
- Configuration management
- Kubernetes
- Package repository management
- Security and authentication
- Resiliency
- ConMan
- Utility storage
- System management health
- System Layout Service (SLS)
- System configuration service
- Hardware State Manager (HSM)
- Hardware Management (HM) collector
- HPE Power Distribution Unit (PDU)
- Node management
- Network
- Spire
- Update firmware with FAS
- System Admin Toolkit (SAT)
- Install and Upgrade Framework (IUF)
- Backup and recovery
- Multi-tenancy
Important procedures for configuring, managing, and validating the CSM environment.
- Validate CSM Health
- Configure Keycloak Account
- Configure the Cray Command Line Interface (Cray CLI)
- Change Passwords and Credentials
- Configure the
root
password and SSH keys in Vault - Set up passwordless SSH - Configure CSM Packages with CFS
- Access the LiveCD USB Device After Reboot
- Post-Install Customizations
- Validate Signed RPMs
- Remove Artifacts from Product Installation
General information on what needs to be done before the initial install of CSM.
Build and customize image recipes with the Image Management Service (IMS).
- Image Management
- Image Management Workflows
- Upload and Register an Image Recipe
- Build a New UAN Image Using the Default Recipe
- Build an Image Using IMS REST Service
- Import External Image to IMS
- Import NCN Image to IMS
- Customize an Image Root Using IMS - Create UAN Boot Images - Convert TGZ Archives to SquashFS Images
- Configure a Remote Build Node
- Delete or Recover Deleted IMS Content
- Configure IMS to Use DKMS
- Configure IMS to Validate RPMs
- Exporting and Importing IMS Data
- Working With
aarch64
Images - Troubleshoot Large Image
- Troubleshoot Remote Build Node
- Troubleshoot zypper interaction
- IMS API
Use the Boot Orchestration Service (BOS) to boot, reboot, and shut down collections of nodes.
- BOS data notice
- Boot Orchestration Service (BOS) - BOS Cheat Sheet - BOS Services - BOS API Versions - BOS Multi-tenancy
- BOS Workflows
- BOS Components - Component Status
- BOS Session Templates - Manage a Session Template - Create a Session Template to Boot Compute Nodes with CPS - Create a Session Template to Boot Compute Nodes with SBPS - Boot UANs
- BOS Sessions - Manage a BOS Session - View the Status of a BOS Session - Limit the Scope of a BOS Session - Stage Changes with BOS - Kernel Boot Parameters - Troubleshoot UAN Boot Issues - Determine Which BOS Session Booted A Node
- BOS Options
- Exporting and Importing BOS Data
- Exporting and Importing BSS Data
- Rolling Upgrades using BOS
- BOS API
- Boot Script Service (BSS) API
- Compute Node Boot Sequence
- Healthy Compute Node Boot Process
- Node Boot Root Cause Analysis
- Compute Node Boot Issue Symptom: Duplicate Address Warnings and Declined DHCP Offers in Logs
- Compute Node Boot Issue Symptom: Node is Not Able to Download the Required Artifacts
- Compute Node Boot Issue Symptom: Message About Invalid EEPROM Checksum in Node Console or Log
- Boot Issue Symptom: Node HSN Interface Does Not Appear or Show Detected Links Detected
- Compute Node Boot Issue Symptom: Node Console or Logs Indicate that the Server Response has Timed Out
- Tools for Resolving Compute Node Boot Issues
- Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
- Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
- Troubleshoot Compute Node Boot Issues Related to the Boot Script Service
- Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
- Troubleshoot Compute Node Boot Issues Using Kubernetes
- Log File Locations and Ports Used in Compute Node Boot Troubleshooting
- Customize iPXE Binary Names
- Edit the iPXE Embedded Boot Script
- Redeploy the iPXE and TFTP Services
- Upload Node Boot Information to Boot Script Service (BSS)
Procedures required for a full power off of an HPE Cray EX system.
Additional links to power off sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:
- Prepare the System for Power Off
- Shut Down and Power Off Managed Nodes
- Save Management Network Switch Configuration Settings
- Power Off Compute Cabinets using PCS
- Shut Down and Power Off the Management Kubernetes Cluster
- Power Off the External Lustre File System
Procedures required for a full power on of an HPE Cray EX system.
Additional links to power on sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:
- Power On and Start the Management Kubernetes Cluster
- Power On Compute Cabinets using PCS
- Power On the External Lustre File System
- Power On and Boot Managed Nodes
- Recover from a Liquid Cooled Cabinet EPO Event using PCS
HPE Cray System Management (CSM) software manages and controls power out-of-band through Redfish APIs.
- Power Management
- Cray Advanced Platform Monitoring and Control (CAPMC)
- Power Control Service (PCS)
- Liquid Cooled Node Power Management
- Standard Rack Node Power Management
- Node Card Power Management
- Ignore Nodes with CAPMC
- Set the Turbo Boost Limit
- CAPMC API
- PCS API
Use the Ceph Object Gateway Simple Storage Service (S3) API to manage artifacts on the system.
- Artifact Management
- Manage Artifacts with the Cray CLI
- Use S3 Libraries and Clients
- Generate Temporary S3 Credentials
NOTE CRUS was deprecated in CSM 1.2.0 and removed in CSM 1.5.0. See the following links for more information:
The Configuration Framework Service (CFS) is available on systems for remote execution and configuration management of nodes and boot images.
- ARP Cache Tuning
- Configuration Management
- CFS Configurations
- CFS Sources
- CFS Components
- CFS Sessions
- Write Ansible Code for CFS
- Specific Use Cases
- Troubleshoot CFS Issues
- Exporting and Importing CFS Data
- CFS API Details
- Version Control Service (VCS)
The system management components are broken down into a series of micro-services. Each service is independently deployable, fine-grained, and uses lightweight protocols. As a result, the system's micro-services are modular, resilient, and can be updated independently. Services within the Kubernetes architecture communicate using REST APIs.
- Kubernetes Architecture
- About
kubectl
- About Kubernetes Taints and Labels
- Kubernetes Storage
- Kubernetes Networking
- Retrieve Cluster Health Information Using Kubernetes
- Pod Resource Limits
- About etcd
- Check the Health of etcd Clusters
- Rebuild Unhealthy etcd Clusters
- Backups for Etcd Clusters Running in Kubernetes
- Create a Manual Backup of a Healthy Bare-Metal etcd Cluster
- Create a Manual Backup of a Healthy etcd Cluster
- Restore an etcd Cluster from a Backup
- Repopulate Data in etcd Clusters When Rebuilding Them
- Restore Bare-Metal etcd Clusters from an S3 Snapshot
- Check for and Clear etcd Cluster Alarms
- Report the Endpoint Status for etcd Clusters
- Clear Space in an etcd Cluster Database
- About Postgres
containerd
- Kubernetes Encryption
- Kyverno policy management
- Troubleshoot Kyverno configuration manually
- Troubleshoot Intermittent HTTP 503 Code Failures
- TDS Lower CPU Requests
- Fix
Failed to start etcd
on Master NCN - Kubernetes and Bare Metal EtcD Certificate Renewal
Repositories are added to systems to extend the system functionality beyond what is initially delivered. The Sonatype Nexus Repository Manager is the primary method for repository management. Nexus hosts the Yum, Docker, raw, and Helm repositories for software and firmware content.
- Package Repository Management
- Package Repository Management with Nexus
- Manage Repositories with Nexus
- Nexus Configuration
- Nexus Deployment
- Nexus Export and Restore
- Restrict Admin Privileges in Nexus
- Repair Yum Repository Metadata
- Nexus Space Cleanup
- Troubleshoot Nexus
Mechanisms used by the system to ensure the security and authentication of internal and external requests.
- System Security and Authentication
- Manage System Passwords
- Update NCN Passwords
- Change Root Passwords for Compute Nodes
- Set NCN Image Root Password, SSH Keys, and Timezone
- Change EX Liquid-Cooled Cabinet Global Default Password
- Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
- Updating the Liquid-Cooled EX Cabinet Default Credentials after a CEC Password Change
- Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
- Change Air-Cooled Node BMC Credentials
- Update Default ServerTech PDU Credentials used by the Redfish Translation Service
- Change Credentials on ServerTech PDUs
- Add Root Service Account for Gigabyte Controllers
- Recovering from Mismatched BMC Credentials
- SSH Keys
- Authenticate an Account with the Command Line
- Default Keycloak Realms, Accounts, and Clients
- Certificate Types
- Change Keycloak Token Lifetime
- Change the Keycloak Admin Password
- Create a Service Account in Keycloak
- Retrieve the Client Secret for Service Accounts
- Get a Long-Lived Token for a Service Account
- Access the Keycloak User Management UI
- Create Internal User Accounts in the Keycloak Shasta Realm
- Delete Internal User Accounts in the Keycloak Shasta Realm
- Create Internal Groups in the Keycloak Shasta Realm
- Remove Internal Groups from the Keycloak Shasta Realm
- Remove the Email Mapper from the LDAP User Federation
- Re-Sync Keycloak Users to Compute Nodes
- Keycloak Operations
- Configure Keycloak for LDAP/AD authentication
- Configure the RSA Plugin in Keycloak
- Preserve Username Capitalization for Users Exported from Keycloak
- Change the LDAP Server IP Address for Existing LDAP Server Content
- Change the LDAP Server IP Address for New LDAP Server Content
- Remove the LDAP User Federation from Keycloak
- Add LDAP User Federation
- Keycloak User Management with
kcadm.sh
- Keycloak User Localization
- Create a Backup of the Keycloak Postgres Database
- Public Key Infrastructure (PKI)
- API Authorization
- Retrieve an Authentication Token
- Manage Sealed Secrets
- SOPS Introduction
- Audit Logs
- Cray STS Token Generator API
- Configure root user on HPE iLO BMCs
HPE Cray EX systems are designed so that system management services (SMS) are fully resilient and that there is no single point of failure.
- Resiliency
- Resilience of System Management Services
- Restore System Functionality if a Kubernetes Worker Node is Down
- Recreate
StatefulSet
Pods on Another Node - Resiliency Testing Procedure
ConMan is a tool used for connecting to remote consoles and collecting console logs. These node logs can then be used for various administrative purposes, such as troubleshooting node boot issues.
- ConMan
- Access Compute Node Logs
- Access Console Log Data Via the System Monitoring Framework (SMF)
- Manage Node Consoles
- Log in to a Node Using ConMan
- Establish a Serial Connection to NCNs
- Disable ConMan After System Software Installation
- Console Services Troubleshooting Guide
- Troubleshoot ConMan Blocking Access to a Node BMC
- Troubleshoot ConMan Failing to Connect to a Console
- Troubleshoot ConMan Asking for Password on SSH Connection
- Troubleshoot Console Node Pod Stuck in Terminating State
- Complete Reset of the Console Services
Ceph is the utility storage platform that is used to enable pods to store persistent data. It is deployed to provide block, object, and file storage to the management services running on Kubernetes, as well as for telemetry data coming from the compute nodes.
- Utility Storage
- Collect Information about the Ceph Cluster
- Manage Ceph Services
- Adjust Ceph Pool Quotas
- Add Ceph OSDs
- Shrink Ceph OSDs
- Ceph Health States
- Ceph Deep Scrubs
- Ceph Daemon Memory Profiling
- Ceph Service Check Script Usage
- Ceph Orchestrator Usage
- Ceph Storage Types
ceph-upgrade-tool
Usage- Dump Ceph Crash Data
- Identify Ceph Latency Issues
- Cephadm Reference Material
- Adding a Ceph Node to the Ceph Cluster
- Shrink the Ceph Cluster
- Alternate Storage Pools
- Restore Nexus Data After Data Corruption
- Troubleshoot Failure to Get Ceph Health
- Troubleshoot a Down OSD
- Troubleshoot Ceph OSDs Not Being Created on Disks
- Troubleshoot Ceph OSDs Reporting Full
- Troubleshoot System Clock Skew
- Troubleshoot an Unresponsive S3 Endpoint
- Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
- Troubleshoot Pods Multi-Attach Error
- Troubleshoot Large Object Map Objects in Ceph Health
- Troubleshoot Failure of RGW Health Check
- Troubleshoot Ceph MDS Client Connectivity Issues
- Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
- Troubleshoot Ceph image with tag:
<none>
- Troubleshoot Ceph Services Not Starting After a Server Crash
- Troubleshoot
HEALTH_ERR
Moduledevicehealth
has failed table Device already exists - Troubleshoot Insufficient Standby MDS Daemons Available
- Troubleshoot S3FS Mount Issues
- Fixing incorrect number of PG Issues
Enable system administrators to assess the health of their system. Operators need to quickly and efficiently troubleshoot system issues as they occur and be confident that a lack of issues indicates the system is operating normally.
- System Management Health
- System Management Health Checks and Alerts
- Access System Management Health Services
- Configure Prometheus Email Alert Notifications
- Grafana Dashboards by Component
- Remove Kiali
prometheus-kafka-adapter
errors during installationgrok-exporter
errors during installation- Troubleshoot Prometheus Alerts
- Configure UAN Node Exporter
The System Layout Service (SLS) holds information about the system design, such as the physical locations of network hardware, compute nodes, and cabinets. It also stores information about the network, such as which port on which switch should be connected to each compute node.
- System Layout Service (SLS)
- Dump SLS Information
- Load SLS Database with Dump File
- Add Liquid-Cooled Cabinets to SLS
- Add UAN CAN IP Addresses to SLS
- Update SLS with UAN Aliases
- Add an alias to a service
- Create a Backup of the SLS Postgres Database
- Restore SLS Postgres Database from Backup
- Restore SLS Postgres without an Existing Backup
- SLS API
The System Configuration Service (SCSD) allows administrators to set various BMC and controller parameters. These parameters are typically set during discovery, but
this tool enables parameters to be set before or after discovery. The operations to change these parameters are available in the Cray CLI under the scsd
command.
- System Configuration Service
- Configure BMC and Controller Parameters with SCSD
- Manage Parameters with the SCSD Service
- Set BMC Credentials
- SCSD API
Use the Hardware State Manager (HSM) to monitor and interrogate hardware components in the HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.
- Hardware State Manager (HSM)
- Hardware Management Services (HMS) Locking API
- Component Groups and Partitions
- Hardware State Manager (HSM) State and Flag Fields
- HSM Roles and Subroles
- Add an NCN to the HSM Database
- Add a Switch to the HSM Database
- Create a Backup of the HSM Postgres Database
- Restore HSM Postgres from a Backup
- Restore HSM Postgres without a Backup
- Set BMC Management Role
- HSM API
- Heartbeat Tracker Daemon (HBTD) API
- Hardware Management Notification Fanout Daemon (HMNFD) API
The Hardware Management (HM) Collector is used to collect telemetry and Redfish events from hardware in the system.
Procedures for managing and setting up HPE PDUs.
Monitor and manage compute nodes (CNs) and non-compute nodes (NCNs) used in the HPE Cray EX system.
- Node Management
- Node Management Workflows
- Rebuild NCNs
- Reboot NCNs
- Enable Nodes
- Disable Nodes
- Find Node Type and Manufacturer
- Add Additional Air-Cooled Cabinets to a System
- Add Additional Liquid-Cooled Cabinets to a System
- Updating Cabinet Routes on Management NCNs
- Move a liquid-cooled blade within a System
- Add a Standard Rack Node
- Clear Space in Root File System on Worker Nodes
- Troubleshoot Issues with Redfish Endpoint
DiscoveryCheck
for Redfish Events from Nodes - Reset Credentials on Redfish Devices
- Access and Update Settings for Replacement NCNs
- Change Settings for HMS Collector Polling of Air Cooled Nodes
- Use the Physical KVM
- Launch a Virtual KVM on Gigabyte Nodes
- Launch a Virtual KVM on Intel Nodes
- Change Java Security Settings
- Configuration of NCN Bonding
- Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
- Check the BMC Failover Mode
- Update Compute Node Mellanox HSN NIC Firmware
- TLS Certificates for Redfish BMCs
- Dump a Non-Compute Node
- Enable Passwordless Connections to Liquid Cooled Node BMCs
- Configure NTP on NCNs
- Swap a Compute Blade with a Different System
- Swap a Compute Blade with a Different System Using SAT
- Replace a Compute Blade
- Replace a Compute Blade Using SAT
- Update the Gigabyte Node BIOS Time
- S3FS Usage Guidelines
- Defragment NID Numbering
- Repurpose a Compute Node as a UAN
- Clear Gigabyte CMOS
- Set Gigabyte Node BMC to Factory Defaults
- NCN Network Troubleshooting
- NCN Drive Identification
- Manual Wipe Procedures
- Build NCN Images Locally
- NCN Lifecycle Service (NLS) API
- Enable IPMI access on HPE iLO BMCs
- Update the HPE Node BIOS Time
- Switch PXE Boot from Onboard NIC to PCIe
- NCN NIC Replacement
Overview of the several different networks supported by the HPE Cray EX system.
- Network
- Access to System Management Services
- Default IP Address Ranges
- Connect to the HPE Cray EX Environment
- Connect to Switch over USB-Serial Cable
- Create a CSM Configuration Upgrade Plan
- Gateway Testing
HPE Cray EX systems can have network switches in many roles: spine switches, leaf switches, LeafBMC
switches, and CDU switches. Newer systems have HPE Aruba switches,
while older systems have Dell and Mellanox switches. Switch IP addresses are generated by Cray Site Init
(CSI).
- HPE Cray EX Management Network Installation and Configuration Guide
- Update Management Network Firmware
- BICAN switch configuration
- Bonded UAN Configuration
The customer accessible networks (CMN/CAN/CHN) provide access from outside the customer network to services, NCNs, and User Access Nodes (UANs) in the system.
- Customer Accessible Networks
- Externally Exposed Services
- Connect to the CMN and CAN
- BI-CAN Aruba/Arista Configuration
- MetalLB Peering with Arista Edge Router
- CAN/CMN with Dual-Spine Configuration
- Troubleshoot CMN Issues
The DHCP service on the HPE Cray EX system uses the Internet Systems Consortium (ISC) Kea tool. Kea provides more robust management capabilities for DHCP servers.
The central DNS infrastructure provides the structural networking hierarchy and datastore for the system.
- DNS
- Manage the DNS Unbound Resolver
- Enable
ncsd
on UANs - PowerDNS Configuration
- PowerDNS Migration Guide
- Troubleshoot Common DNS Issues
- Troubleshoot PowerDNS
External DNS, along with the Customer Management Network (CMN), Border Gateway Protocol (BGP), and MetalLB, makes it simpler to access the HPE Cray EX API and system management services. Services are accessible directly from a laptop without needing to tunnel into a non-compute node (NCN) or override /etc/hosts settings.
- External DNS
- External DNS
csi config init
Input Values - Update the
cmn-external-dns
Value Post-Installation - Ingress Routing
- External DNS Failing to Discover Services Workaround
- Troubleshoot Connectivity to Services with External IP addresses
- Troubleshoot DNS Configuration Issues
MetalLB is a component in Kubernetes that manages access to LoadBalancer
services from outside the Kubernetes cluster. There are LoadBalancer
services on the Node
Management Network (NMN), Hardware Management Network (HMN), and Customer Access Network (CAN).
MetalLB can run in either Layer2-mode
or BGP-mode
for each address pool it manages. BGP-mode
is used for the NMN, HMN, and CAN. This enables true load balancing
(Layer2-mode
does failover, not load balancing) and allows for a more robust layer 3 configuration for these networks.
- MetalLB in BGP-Mode
- MetalLB Configuration
- Check BGP Status and Reset Sessions
- Troubleshoot Services without an Allocated IP Address
- Troubleshoot BGP not Accepting Routes from MetalLB
Spire provides the ability to authenticate nodes and workloads, and to securely distribute and manage their identities along with the credentials associated with them.
- Restore Spire Postgres without a Backup
- Troubleshoot Spire Failing to Start on NCNs
- Update Spire Intermediate CA Certificate
- Xname Validation
- Restore Missing Spire Meta-Data
- Create a Backup of the Spire Postgres Database
The Firmware Action Service (FAS) provides an interface for managing firmware versions of Redfish-enabled hardware in the system. FAS interacts with the Hardware State Manager (HSM), device data, and image data in order to update firmware.
See Update Firmware with FAS for a list components that are upgradable with FAS. Refer to the HPC Firmware Pack (HFP) product stream to update firmware on other components.
- Update Firmware with FAS
- Using the
FASUpdate
Script - FAS CLI
- FAS API
- FAS Filters
- FAS Recipes and Procedures
- FAS Recipes
- FAS Admin Procedures
- Upload Olympus BMC Recovery Firmware into TFTP Server
- Updating Firmware on
m001
- Updating Firmware without FAS
- Update iLO 5 firmware above
v2.78
The System Admin Toolkit (SAT) is a command-line interface that can assist administrators with
common tasks, such as troubleshooting and querying information about the HPE Cray EX System, system
boot and shutdown, replacing hardware components, and more. In CSM 1.3 and newer, the sat
command
is available on the Kubernetes NCNs without installing the SAT product stream.
Starting in CSM 1.6.0, SAT is fully included in CSM. There is no longer a separate SAT product stream to install. SAT 2.6 releases, which accompanied CSM 1.5, are the last releases of SAT as a separate product.
The Install and Upgrade Framework (IUF) provides a CLI and API which automates operations required to install, upgrade
and deploy non-CSM product content onto an HPE Cray EX system. Each product distribution includes an iuf-product-manifest.yaml
file which IUF uses to determine what operations are needed to install, upgrade, and deploy the product.
- Install and Upgrade Framework (IUF)
- Install and Upgrade Observability Framework
- Using the Argo UI
- Using Argo Workflows
Information on how to perform backups of individual services or the entire system, and how to restore from these backups.
- Create a Manual Backup of a Healthy Bare-Metal etcd Cluster
- Create a Manual Backup of a Healthy etcd Cluster
- Restore an etcd Cluster from a Backup
- Repopulate Data in etcd Clusters When Rebuilding Them
- Restore Bare-Metal etcd Clusters from an S3 Snapshot
- Create a Backup of the SLS Postgres Database
- Restore SLS Postgres Database from Backup
- Restore SLS Postgres without an Existing Backup
- Create a Backup of the HSM Postgres Database
- Restore HSM Postgres from a Backup
- Restore HSM Postgres without a Backup
- Create a Backup of the Spire Postgres Database
- Restore Spire Postgres without a Backup
- Spire Service Recovery
- Multi-tenancy Support
- Cray Hierarchical Namespace Controller (HNC) Manager
- Tenant Administrator Configuration
- Creating a Tenant
- Modifying a Tenant
- Removing a Tenant
- Slurm Operator
- HPE Slingshot Network Operator
- Tenant and Partition Management System (TAPMS) Overview
- TAPMS Tenant Status API
- Global Tenant Hooks
- Example Workflow