-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream #433
Comments
@minseokl Could you help investigate this issue with running NVIDIA's MLPerf Training workload? |
hi @rgandikota what's the configuration you are using? Is it this one? If yes, could you turn off cuda graph and overlap to see if there is more specific error message? The cuda graph and overlap can be turned off here |
Hi @shijieliu. We ran the same training after turning off both cuda graphs and overlap. The error is still the same. Please find the full stack trace below. Wanted to highlight a Warning we are seeing. Not sure if this can cause issues. Logs
|
It will be helpful if you could make the ucp version compatible. What's your command line arguments passing to the training script? |
@shijieliu - Here are the details (Details that was run on single node) Test run 1: We set the environment as configured here root@5795011ad9d8:/workspace/dlrm# source config_DGXH100_1x8x6912.sh docker run --shm-size=1g --ulimit memlock=-1 --cap-add=sys_nice --security-opt seccomp=unconfined --runtime=nvidia --rm -it -v /mnt/weka/mlperf/data/dlrm/dataset/criteo_binary/:/data -v /mnt/weka/mlperf/data/dlrm/dataset/criteo_binary/:/data_val -it dlrm-mlperf:1 Test run 2: with default values root@5795011ad9d8:/workspace/dlrm# python train.py |
I run with the default value and it works well. This is my log.txt. I am suspecting the data preprocessing maybe wrong which leads to illegal memory access because of overflow in embedding input. Could you use the following code to check your dataset?
The script prints the range of input category and in my side the output is
You can share your output so I can help check. |
@shijieliu The script for validating our data fails on a size assertion by the looks of it. Please find the output below:
Here is the process we used to preprocess our data using NVTabular to speed up the preprocessing: Is the script you shared specific to the output from this method of preprocessing mentioned in this ReadMe using CPU only? |
Yes. This ReadMe should match with our training scripts. There is at least one difference I can tell beteen the one using NVTabular and our ReadMe is that the one using NVTabular lacks converting one-hot dataset to multi-hot dataset which is the step 1.5 in our ReadMe. |
Thanks much @shijieliu. Is there anyway we can download pre-processed data? That will be a great help if we can download the data. |
@shijieliu - Thanks for providing the snippet. It was helpful in verifying the dataset. We have executed and tested the code you provided on the dataset. You were right in pointing out that the conversion from a one-hot dataset to a multi-hot dataset was missing in a previous step. However, we are currently facing another issue on the A100 (multi-node) platform with HUGECTR, specifically regarding communication. fyi - we were able to successfully run BERT. Any input or information would be greatly appreciated and will assist us in moving forward. A100-02:2358430:2358556 [1] [proxy.cc:1495](http://proxy.cc:1495/) NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0 A100-02:2358430:2358556 [1] [proxy.cc:1519](http://proxy.cc:1519/) NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 15, retcode 3 A100-02:2358430:2358556 [1] [proxy.cc:1495](http://proxy.cc:1495/) NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0 A100-02:2358430:2358556 [1] [proxy.cc:1519](http://proxy.cc:1519/) NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 14, retcode 3 /workspace/dlrm# python validate_dataset.py |
Hi @rgandikota glad to see the dataset seems right! And for your questions, it seems that the nccl init failed which happens before traning starts. Could you try set |
Please find the log attached with suggested configuration. (Anything more than 2 nodes results with an error) Example run on 4 nodes: slurm-938.log Error Snippet: Note: It works with 2 nodes [slurm-935_2nodes.log] |
From the 4node log, it seems there is some problem in the IB connection on 4 nodes, so the setup of IB failed on Or you can use env NCCL_IB_DISABLE to disable IB in nccl. But it will hurt the performance a lot. |
@shijieliu - Thank you. |
Please find the mpirun results
|
Thanks @jndinesh The perf about allreduce and all2all in NCCL is very bad compared with the desired perf using IB. This is a perf issue however it may imply functional issue as well. Like the nccl_test does not using IB. Could you try setting |
According to NCCL, could you also try to check |
@shijieliu - Thank you for your quick response and suggestion. Your attention to detail is much appreciated. mpirun was executed on a host machine (Bare metal server) and not on a container environment. Please find the config on host machine: /etc/security/limits.conf Thanks, |
@jndinesh The reason for your bad nccl_test perf may be the wrong configuration about numa in nccl test. Could you try running nccl_test with And another thing I want to check is your card type. Could you share the output of |
Please find below the attached lspci.txt log and mpirun output. Please note that instead of using 64, it was configured to use 8 in mpirun -np 8. Configuring with 64 instead of 8 will result in an error. For your reference, the log with that configuration is also attached.
|
After checking with @jndinesh @rgandikota, the issue is sovled by
|
Describe the bug
Facing a CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream while running MLCommons Training benchmark:
https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr
To Reproduce
Expected behavior
DLRM Reference implementation should start training on the cluster
Environment (please complete the following information):
Additional context
terminate called recursively
what(): Runtime error: CUDNN_STATUS_MAPPING_ERROR
cudnnSetStream(cudnn_handle_, current_stream) (set_stream @ /workspace/dlrm/hugectr/HugeCTR/include/gpu_resource.hpp:80)
terminate called recursively
terminate called recursively
terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError'
[A100-06:286040] *** Process received signal ***
The text was updated successfully, but these errors were encountered: