-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Facing issue with improper file renaming while multi-node training in ROCm #868
Comments
I don't think this might be related to rocm itself, as streaming I believe is hardware agnostic... can you confirm that said file does not indeed exist in your environment? |
Thanks 🙏 @ethantang-db for reverting. So to tackle this, we run in multiple iterations, something like After 2-3 retries, it vanishes. Shall I provide you my training script? |
Yes please if you are allowed to, as well as the full launch args |
Hi @ethantang-db, please find below the script. Also the Let me know if you need further info. The Script: #!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --job-name=finetune_FINAL #specify job name
#SBATCH --partition=<group> #specify your partition
#SBATCH --account=<group> #specify your relevant group/account
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --mem=0 #specify no. of GPUs required
#SBATCH --time=210:40:00 #specify time for the job
#SBATCH --exclusive
#SBATCH --requeue
#SBATCH --output=<some_path>
#SBATCH --exclude=<some-node> #specify specific nodes, if you want those specific nodes
#SBATCH --nodelist=<some_nodes>
# Run commands or execute jobs/scripts
amd-smi list
echo $APPTAINER_CACHEDIR
echo $HF_HOME
pwd
master_port=$((20000 + $RANDOM % 40000))
echo $SLURM_JOB_NODELIST
head_node=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
head_node_ip=`getent hosts $head_node | awk '{ print $1 }'`
RDZV_ID=$SLURM_JOB_ID
NNODES=$SLURM_NNODES
echo $RDZV_ID
echo $NNODES
echo $head_node
echo $head_node_ip
echo $master_port
export OMP_NUM_THREADS=200
export HSA_FORCE_FINE_GRAIN_PCIE=1
export NCCL_DEBUG=INFO
export NCCL_ENABLE_DMABUF_SUPPORT=1
# export NCCL_DMABUF_ENABLE=1
### set these environment variables if using RoCE interconnect
export NCCL_IB_GID_INDEX=3
### set NIC cards
export NCCL_IB_HCA=mlx5_0,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_7,mlx5_8,mlx5_9 #rdma link
export NCCL_IB_DISABLE=0
export NCCL_NSOCKS_PERTHREAD=12
export NCCL_SOCKET_IFNAME=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 #ip a
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export RCCL_MSCCL_ENABLE=0
export RCCL_MSCCLPP_ENABLE=0
echo "RDZV_ID: ${RDZV_ID}"
echo "NNODES: ${NNODES}"
echo "head_node: ${head_node}"
echo "master_port: ${NNODES}"
ACCUM_STEPS=$((16/4/${NNODES}))
export HF_TOKEN=""
export S3_ENDPOINT_URL='https://idhpuomb10ix.compat.objectstorage.us-ashburn-1.oraclecloud.com'
export CACHE_DIR='/home/<user>/local/<user>/tmp'
srun -l rm -rf $CACHE_DIR/data_*
srun -l rm -rf $CACHE_DIR/checkpoints/
export DATA_CACHE="${CACHE_DIR}/data_${RANDOM}/"
LLM_VERSION="OLMo-1B-SFT"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"
APPTAINER_PATH="/home/<user>/apptainer/llava_next_v0.1"
apptainer_cmd="apptainer exec --bind /mnt --no-mount /etc/hosts $APPTAINER_PATH "
################ FineTune ##############
PROMPT_VERSION="olmo"
MID_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain-full-finetune-stage2-si-more-chart-more-res-more-sft"
echo -e "\n\n"
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
echo -e "\n\n"
SI_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain-full-finetune-stage2-si-more-OCR-REPEAT_SCIENCE-MultiNode-NEW"
echo "SI_RUN_NAME: ${SI_RUN_NAME}"
echo -e "\n\n"
CKPT_PATH="/home/<user>/PROJECTS/LLaVA-NeXT/checkpoints/${MID_RUN_NAME}" # this could also be the previous stage checkpoint
NUM_GPUS=$((8*NNODES))
# ACCELERATE_CPU_AFFINITY=1
TRAIN_ARGS="--nproc_per_node=8 --nnodes=${NNODES} \
--rdzv_id=${RDZV_ID} \
--rdzv_backend=c10d \
--rdzv_endpoint=$head_node:$master_port \
llava/train/train_mem.py \
--deepspeed scripts/zero2.json \
--model_name_or_path ${CKPT_PATH} \
--version ${PROMPT_VERSION} \
--data_path scripts/train/LLaVA-Stage2-Single-Image-dataset-small_shard_12152024.yaml \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres \
--image_grid_pinpoints \"[(336, 336), (336, 672), (336, 1008), (336, 1344), (336, 1680), (672, 336), (672, 672), (1008, 336), (1344, 336), (1680, 336)]\" \
--mm_patch_merge_type spatial_unpad \
--bf16 True \
--run_name $SI_RUN_NAME \
--output_dir \"<user>/checkpoints/NEW/${SI_RUN_NAME}\" \
--resume_from_checkpoint True \
--num_nodes $NNODES \
--num_gpus $NUM_GPUS \
--num_train_epochs 2 \
--per_device_train_batch_size 6 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps ${ACCUM_STEPS} \
--eval_strategy "no" \
--save_strategy "steps" \
--save_steps 250 \
--save_total_limit 2 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb \
--dataloader_drop_last True \
--online_training True \
--dataloader_pin_memory False \
--dispatch_batches False"
# export MIOPEN_USER_DB_PATH="/tmp/my-miopen-cache"
export MIOPEN_USER_DB_PATH="$CACHE_DIR/my-miopen-cache"
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
export MIOPEN_DEBUG_DISABLE_FIND_DB=1
export MIOPEN_DISABLE_CACHE=true
rm -rf ${MIOPEN_USER_DB_PATH}
mkdir -p ${MIOPEN_USER_DB_PATH}
touch ${MIOPEN_USER_DB_PATH}/gfx942130.HIP.3_2_0_36bb7fd4a-dirty.ufdb.txt
touch ${MIOPEN_USER_DB_PATH}/gfx942130.HIP.3_2_0_36bb7fd4a-dirty.udb.txt
touch ${MIOPEN_USER_DB_PATH}/gfx942130.ukdb
export MIOPEN_DEBUG_DISABLE_SQL_WAL=1
cd "/home/<user>/PROJECTS/LLaVA-NeXT/"
max_attempts=10
attempt=1
until srun -l $apptainer_cmd bash -c "torchrun $TRAIN_ARGS"; do
echo "Attempt $attempt of $max_attempts failed with exit code $?. Retrying in 5 seconds..."
echo -e "\n\n"
if [ $attempt -ge $max_attempts ]; then
echo "Maximum attempts reached. Giving up."
rm -rf $CACHE_DIR/
exit 1
fi
attempt=$((attempt+1))
sleep 5
done
rm -rf $CACHE_DIR/
datasets:
- shard_path: 's3://object/data/shards/LLaVA_Stage2/VQA-RAD/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/infographic_vqa/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/iconqa/'
choose: 1365
- shard_path: 's3://object/data/shards/LLaVA_Stage2/TabMWP/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa_nona_context/'
choose: 960
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa_nona_context/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa/'
repeat: 2
- shard_path: 's3://object/data/shards/LLaVA_Stage2/ureader_kg/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/aokvqa/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/k12_printing/'
choose: 2566 |
Environment
Context
Trying to do multinode SFT with datasets being downloaded from S3 into a temporary storage of
apptainer
sandbox. But while doing so I am getting the following error, which after two-three retries vanishes, which is kind of strange.Unable to rename a file as it doesn't exist:
Here is the complete error trace.
Streaming team, please lend me a hand in resolving this issue.
If any further details is needed, do let me know in the thread.
Thanks and any help is greatly appreciated!
The text was updated successfully, but these errors were encountered: