Facing issue with improper file renaming while multi-node training in ROCm #868

rodosingh · 2025-01-27T10:46:00Z

Environment

OS: [Ubuntu 22.04]
Hardware (GPU, or instance type): [MI300, ROCm==6.1]
NUM_NODES: [2, NODE-001 & NODE-025]
GPUs/NODE: [8]

Context

Trying to do multinode SFT with datasets being downloaded from S3 into a temporary storage of apptainer sandbox. But while doing so I am getting the following error, which after two-three retries vanishes, which is kind of strange.

Unable to rename a file as it doesn't exist:

0: [rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/home/<user>/local/<user>/tmp/data_8201/tmp/tmp_5/index.json.tmp' -> '/home/<user>/local/<user>/tmp/data_8201/tmp/tmp_5/index.json'

Here is the complete error trace.

0: [rank0]: Traceback (most recent call last):
0: [rank0]:   File "/home/<user>/PROJECTS/LLaVA-NeXT/llava/train/train_mem.py", line 4, in <module>
0: [rank0]:     train()
0: [rank0]:   File "/home/<user>/PROJECTS/LLaVA-NeXT/llava/train/train.py", line 2012, in train
0: [rank0]:     data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
0: [rank0]:   File "/home/<user>/PROJECTS/LLaVA-NeXT/llava/train/train.py", line 1541, in make_supervised_data_module
0: [rank0]:     train_dataset = StreamingLLaVADataset(data_path=data_args.data_path, local=os.path.join(os.environ['DATA_CACHE'], 'tmp'),
0: [rank0]:   File "/home/<user>/PROJECTS/LLaVA-NeXT/llava/train/train.py", line 1381, in __init__
0: [rank0]:     super().__init__(streams=streams,
0: [rank0]:   File "/home/<user>/PROJECTS/LLaVA-NeXT/streaming/streaming/base/dataset.py", line 487, in __init__
0: [rank0]:     stream_shards = stream.get_shards(self._unique_rank_world, self.allow_unsafe_types)
0: [rank0]:   File "/home/<user>/PROJECTS/LLaVA-NeXT/streaming/streaming/base/stream.py", line 451, in get_shards
0: [rank0]:     os.rename(tmp_filename, filename)
0: [rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/home/<user>/local/<user>/tmp/data_8201/tmp/tmp_5/index.json.tmp' -> '/home/<user>/local/<user>/tmp/data_8201/tmp/tmp_5/index.json'
0: [rank0]:[W127 01:56:31.018773538 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())                                                                                         0: useocpm2m-401-001:1461592:1464352 [0] NCCL INFO [Service thread] Connection closed by localRank 0                                                                                                             0: useocpm2m-401-001:1461592:1464363 [0] NCCL INFO comm 0x410cd630 rank 0 nranks 16 cudaDev 0 busId 11000 - Abort COMPLETE
0: W0127 01:56:35.148000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1461593 closing signal SIGTERM
0: W0127 01:56:35.149000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1461594 closing signal SIGTERM
0: W0127 01:56:35.152000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1461595 closing signal SIGTERM
0: W0127 01:56:35.155000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1461596 closing signal SIGTERM
0: W0127 01:56:35.156000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1461597 closing signal SIGTERM
0: W0127 01:56:35.158000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1461598 closing signal SIGTERM
0: W0127 01:56:35.161000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1461599 closing signal SIGTERM
0: E0127 01:56:35.877000 139648015943488 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1461592) of binary: /opt/conda/envs/py_3.10/bin/python
0: Traceback (most recent call last):
0:   File "/opt/conda/envs/py_3.10/bin/torchrun", line 33, in <module>
0:     sys.exit(load_entry_point('torch==2.4.0+rocm6.1', 'console_scripts', 'torchrun')())
0:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
0:     return f(*args, **kwargs)
0:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
0:     run(args)
0:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
0:     elastic_launch(
0:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
0:     return launch_agent(self._config, self._entrypoint, list(args))
0:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
0:     raise ChildFailedError(
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
0: ============================================================
0: llava/train/train_mem.py FAILED
0: ------------------------------------------------------------
0: Failures:
0:   <NO_OTHER_FAILURES>
0: ------------------------------------------------------------
0: Root Cause (first observed failure):
0: [0]:
0:   time      : 2025-01-27_01:56:35
0:   host      : NODE-001.com
0:   rank      : 0 (local_rank: 0)
0:   exitcode  : 1 (pid: 1461592)                                                                                                                                                                                0:   error_file: <N/A>
0:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
0: ============================================================
1: W0127 01:56:36.666000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346571 closing signal SIGTERM
1: W0127 01:56:36.667000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346572 closing signal SIGTERM
1: W0127 01:56:36.669000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346573 closing signal SIGTERM
1: W0127 01:56:36.673000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346574 closing signal SIGTERM
1: W0127 01:56:36.675000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346575 closing signal SIGTERM
1: W0127 01:56:36.677000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346576 closing signal SIGTERM
1: W0127 01:56:36.680000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346577 closing signal SIGTERM
1: W0127 01:56:36.681000 140355291236160 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 346578 closing signal SIGTERM
srun: error: NODE-001: task 0: Exited with exit code 1
1: W0127 01:56:37.321000 140355291236160 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'NODE-025.com_346190_0' has failed to shutdown the rendezvous '16243' due to an error of type RendezvousConnectionError.                                                                                                                                                                          1: W0127 01:56:37.360000 140355291236160 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'NODE-025.com_346190_0' has failed to shutdown the rendezvous '16243' due to an error of type RendezvousConnectionError.                                                                                                                                                                          1: Traceback (most recent call last):                                                                                                                                                                            1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 114, in _call_store                                                       1:     return getattr(self._store, store_op)(*args, **kwargs)                                                                                                                                                    1: torch.distributed.DistNetworkError: Connection reset by peer                                                                                                                                                  1:                                                                                                                                                                                                               1: The above exception was the direct cause of the following exception:                                                                                                                                          1:
1: Traceback (most recent call last):
1:   File "/opt/conda/envs/py_3.10/bin/torchrun", line 33, in <module>
1:     sys.exit(load_entry_point('torch==2.4.0+rocm6.1', 'console_scripts', 'torchrun')())
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
1:     return f(*args, **kwargs)
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
1:     run(args)
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
1:     elastic_launch(
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
1:     return launch_agent(self._config, self._entrypoint, list(args))
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
1:     result = agent.run()
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
1:     result = f(*args, **kwargs)
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
1:     result = self._invoke_run(role)
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 867, in _invoke_run
1:     num_nodes_waiting = rdzv_handler.num_nodes_waiting()
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1189, in num_nodes_waiting
1:     self._state_holder.sync()
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 428, in sync
1:     get_response = self._backend.get_state()
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 74, in get_state
1:     base64_state: bytes = self._call_store("get", self._key)
1:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 116, in _call_store
1:     raise RendezvousConnectionError(
1: torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: NODE-025: task 1: Exited with exit code 1
Attempt 1 of 10 failed with exit code 1. Retrying in 5 seconds...

Streaming team, please lend me a hand in resolving this issue.
If any further details is needed, do let me know in the thread.

Thanks and any help is greatly appreciated!

The text was updated successfully, but these errors were encountered:

ethantang-db · 2025-01-29T20:48:02Z

I don't think this might be related to rocm itself, as streaming I believe is hardware agnostic... can you confirm that said file does not indeed exist in your environment?

rodosingh · 2025-01-29T21:36:27Z

Thanks 🙏 @ethantang-db for reverting.

So to tackle this, we run in multiple iterations, something like until srun -l torchrun $TRAIN_ARGS; do and we try it for a max of 10 times.

After 2-3 retries, it vanishes. Shall I provide you my training script?

ethantang-db · 2025-01-29T21:52:45Z

Yes please if you are allowed to, as well as the full launch args

rodosingh · 2025-02-03T11:05:28Z

Hi @ethantang-db, please find below the script. Also the --data_path mentioned refers to the .yaml file of data shards in S3 bucket. Also attached a small snippet of it (Also mentioned in the issue #869 ). Here --online_training True means loading and saving ckpts and data shards to S3.

Let me know if you need further info.

The Script:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --job-name=finetune_FINAL                       #specify job name
#SBATCH --partition=<group>                             #specify your partition
#SBATCH --account=<group>                               #specify your relevant group/account
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8     
#SBATCH --mem=0                                         #specify no. of GPUs required
#SBATCH --time=210:40:00                                #specify time for the job
#SBATCH --exclusive
#SBATCH --requeue
#SBATCH --output=<some_path>
#SBATCH --exclude=<some-node>    #specify specific nodes, if you want those specific nodes
#SBATCH --nodelist=<some_nodes>

# Run commands or execute jobs/scripts
amd-smi list
 
echo $APPTAINER_CACHEDIR
echo $HF_HOME

pwd

master_port=$((20000 + $RANDOM % 40000))
echo $SLURM_JOB_NODELIST
head_node=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
head_node_ip=`getent hosts $head_node | awk '{ print $1 }'`
RDZV_ID=$SLURM_JOB_ID
NNODES=$SLURM_NNODES
echo $RDZV_ID
echo $NNODES
echo $head_node
echo $head_node_ip
echo $master_port


export OMP_NUM_THREADS=200
export HSA_FORCE_FINE_GRAIN_PCIE=1
export NCCL_DEBUG=INFO
export NCCL_ENABLE_DMABUF_SUPPORT=1
# export NCCL_DMABUF_ENABLE=1
### set these environment variables if using RoCE interconnect
export NCCL_IB_GID_INDEX=3
### set NIC cards
export NCCL_IB_HCA=mlx5_0,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_7,mlx5_8,mlx5_9   #rdma link

export NCCL_IB_DISABLE=0
export NCCL_NSOCKS_PERTHREAD=12
export NCCL_SOCKET_IFNAME=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 #ip a

export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

export RCCL_MSCCL_ENABLE=0
export RCCL_MSCCLPP_ENABLE=0


echo "RDZV_ID: ${RDZV_ID}"
echo "NNODES: ${NNODES}"
echo "head_node: ${head_node}"
echo "master_port: ${NNODES}"
ACCUM_STEPS=$((16/4/${NNODES}))


export HF_TOKEN=""
export S3_ENDPOINT_URL='https://idhpuomb10ix.compat.objectstorage.us-ashburn-1.oraclecloud.com'
export CACHE_DIR='/home/<user>/local/<user>/tmp'
srun -l rm -rf $CACHE_DIR/data_*
srun -l rm -rf $CACHE_DIR/checkpoints/
export DATA_CACHE="${CACHE_DIR}/data_${RANDOM}/"


LLM_VERSION="OLMo-1B-SFT"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

APPTAINER_PATH="/home/<user>/apptainer/llava_next_v0.1"
apptainer_cmd="apptainer exec --bind /mnt --no-mount /etc/hosts $APPTAINER_PATH  "

################ FineTune ##############

PROMPT_VERSION="olmo"


MID_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain-full-finetune-stage2-si-more-chart-more-res-more-sft"
echo -e "\n\n"
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
echo -e "\n\n"



SI_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain-full-finetune-stage2-si-more-OCR-REPEAT_SCIENCE-MultiNode-NEW"
echo "SI_RUN_NAME: ${SI_RUN_NAME}"
echo -e "\n\n"



CKPT_PATH="/home/<user>/PROJECTS/LLaVA-NeXT/checkpoints/${MID_RUN_NAME}"  # this could also be the previous stage checkpoint

NUM_GPUS=$((8*NNODES))
# ACCELERATE_CPU_AFFINITY=1 
TRAIN_ARGS="--nproc_per_node=8 --nnodes=${NNODES}  \
    --rdzv_id=${RDZV_ID} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$head_node:$master_port \
    llava/train/train_mem.py \
    --deepspeed scripts/zero2.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path scripts/train/LLaVA-Stage2-Single-Image-dataset-small_shard_12152024.yaml \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres \
    --image_grid_pinpoints \"[(336, 336), (336, 672), (336, 1008), (336, 1344), (336, 1680), (672, 336), (672, 672), (1008, 336), (1344, 336), (1680, 336)]\" \
    --mm_patch_merge_type spatial_unpad \
    --bf16 True \
    --run_name $SI_RUN_NAME \
    --output_dir \"<user>/checkpoints/NEW/${SI_RUN_NAME}\" \
    --resume_from_checkpoint True \
    --num_nodes $NNODES \
    --num_gpus $NUM_GPUS \
    --num_train_epochs 2 \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps ${ACCUM_STEPS} \
    --eval_strategy "no" \
    --save_strategy "steps" \
    --save_steps 250 \
    --save_total_limit 2 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 32768 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --dataloader_drop_last True \
    --online_training True \
    --dataloader_pin_memory False \
    --dispatch_batches False"


# export MIOPEN_USER_DB_PATH="/tmp/my-miopen-cache"
export MIOPEN_USER_DB_PATH="$CACHE_DIR/my-miopen-cache"
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
export MIOPEN_DEBUG_DISABLE_FIND_DB=1
export MIOPEN_DISABLE_CACHE=true
rm -rf ${MIOPEN_USER_DB_PATH}
mkdir -p ${MIOPEN_USER_DB_PATH}
touch ${MIOPEN_USER_DB_PATH}/gfx942130.HIP.3_2_0_36bb7fd4a-dirty.ufdb.txt
touch ${MIOPEN_USER_DB_PATH}/gfx942130.HIP.3_2_0_36bb7fd4a-dirty.udb.txt
touch ${MIOPEN_USER_DB_PATH}/gfx942130.ukdb
export MIOPEN_DEBUG_DISABLE_SQL_WAL=1



cd "/home/<user>/PROJECTS/LLaVA-NeXT/"
max_attempts=10
attempt=1



until srun -l $apptainer_cmd bash -c "torchrun $TRAIN_ARGS"; do
    echo "Attempt $attempt of $max_attempts failed with exit code $?. Retrying in 5 seconds..."
    echo -e "\n\n"
    if [ $attempt -ge $max_attempts ]; then
        echo "Maximum attempts reached. Giving up."
        rm -rf $CACHE_DIR/
        exit 1
    fi
    attempt=$((attempt+1))
    sleep 5
done

rm -rf $CACHE_DIR/

--data_path:

datasets:
- shard_path: 's3://object/data/shards/LLaVA_Stage2/VQA-RAD/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/infographic_vqa/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/iconqa/'
  choose: 1365
- shard_path: 's3://object/data/shards/LLaVA_Stage2/TabMWP/'  
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa_nona_context/'  
  choose: 960
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa_nona_context/'  
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa/' 
  repeat: 2
- shard_path: 's3://object/data/shards/LLaVA_Stage2/ureader_kg/' 
- shard_path: 's3://object/data/shards/LLaVA_Stage2/aokvqa/' 
- shard_path: 's3://object/data/shards/LLaVA_Stage2/k12_printing/' 
  choose: 2566

rodosingh added the bug Something isn't working label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Facing issue with improper file renaming while multi-node training in ROCm #868

Facing issue with improper file renaming while multi-node training in ROCm #868

rodosingh commented Jan 27, 2025 •

edited

Loading

ethantang-db commented Jan 29, 2025

rodosingh commented Jan 29, 2025

ethantang-db commented Jan 29, 2025

rodosingh commented Feb 3, 2025 •

edited

Loading

Facing issue with improper file renaming while multi-node training in ROCm #868

Facing issue with improper file renaming while multi-node training in ROCm #868

Comments

rodosingh commented Jan 27, 2025 • edited Loading

Environment

Context

ethantang-db commented Jan 29, 2025

rodosingh commented Jan 29, 2025

ethantang-db commented Jan 29, 2025

rodosingh commented Feb 3, 2025 • edited Loading

rodosingh commented Jan 27, 2025 •

edited

Loading

rodosingh commented Feb 3, 2025 •

edited

Loading