Overview:
The slurm-pytorch-ddp-boilerplate
v0.1.0 offers a foundational setup to streamline deep learning projects on High-Performance Computing (HPC) clusters. This release integrates PyTorch's Distributed Data Parallel (DDP) with SLURM job scheduling and introduces support for Weights & Biases (wandb) for detailed experiment logging and sweeps.
Key Features:
- MNIST DDP Example: An introductory DDP solution for an MNIST classification task.
- Configuration Management: Implemented
CurrentConfig
for uniform configuration handling. - SLURM Integration: Provides SLURM scripts tailored for HPC clusters.
- DDP Utilities: Includes DDP identity management, iterable datasets tailored for DDP, and device management.
- Weights and Biases Integration: Features a DDP wrapper for wandb to ensure distributed experiment logging and hyperparameter sweeps.
- Environment Setup: Offers environment setup scripts optimized for Linux/Mac and Windows.
Usage:
The primary entry point is main.py
, which contains a comprehensive set of command-line arguments for training configuration, DDP setup, and wandb integrations. For deployment on HPC clusters, users can utilize provided SLURM scripts after making necessary configurations.
Upcoming Features:
- Apptainer Support: Future releases will aim to integrate Apptainer for containerized deep learning environments.
This release marks the initial phase of the project, and we appreciate feedback and contributions from the community to enhance its capabilities in subsequent versions.