Instructions for how to install and run NVidia's COSMOS model on ARC HPC systems
- Downloading NVIDIAs/COSMOS model
git clone https://github.com/NVIDIA/Cosmos.git (or over SSH) git clone [email protected]:NVIDIA/Cosmos.git
- Create environment
cd Cosmos wget https://raw.githubusercontent.com/umich-arc/arc-tutorial-nvidia-cosmos/refs/heads/main/cosmos.yml
- Load dependencies & create mamba environment
module load gcc cuda/12.6 cudnn/12.6 mamba/py3.11 source /sw/pkgs/arc/mamba/py3.11/etc/profile.d/conda.sh mamba env create -f cosmos.yml mamba activate cosmos_transformer
Warning
If CUDA dependent build errors occur with mamba env create -f cosmos.yml
, then try running that command from within a GPU compute node.
For example
salloc --partition=gpu --mem=30GB --gpus=1 --time=02:00:00
module load gcc cuda/12.6 cudnn/12.6 mamba/py3.11
source /sw/pkgs/arc/mamba/py3.11/etc/profile.d/conda.sh
mamba env create -f cosmos.yml
mamba activate cosmos_transformer
- After successfully installing. Return to https://github.com/NVIDIA/Cosmos.git and follow instrunctions on downloading model weights and running the inference pipeline.
Tip
The model weights can easily reach sizes between 300G - 500G. We recommend downloading the weights to a persistant high-performant storage volume such as Turbo or scratch (note; scratch is not a persistant storage, refer to ARC's scratch storage policy for more info).
To configure the COSMOS repository to use model weights and parameters stored in a shared nfs turbo location, follow these steps:
-
Place the Checkpoints in the Turbo Location Ensure the COSMOS model weights and parameters (referred to as "checkpoints") are stored in the shared nfs turbo location. For example:
/nfs/turbo/arcts-sw-ops/cosmos/checkpoints
-
Create a Symbolic Link in the Cosmos Repository Navigate to the root of your local Cosmos repository and create a symbolic link pointing to the turbo location:
cd /path/to/your/Cosmos ln -s /nfs/turbo/arcts-sw-ops/cosmos/checkpoints checkpoints
-
Verify the Setup Run the following command to confirm that the symbolic link was created successfully:
ls -l lrwxrwxrwx 1 user group 42 Jan 23 15:20 checkpoints -> /nfs/turbo/arcts-sw-ops/cosmos/checkpoints
This indicates that the checkpoints symlink in your Cosmos repository correctly points to the desired turbo location.
- Use the Checkpoints The repository will now use the model weights and parameters stored in the turbo location whenever it accesses the checkpoints directory.