Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Enabling regularization causes CUDNN_STATUS_MAPPING_ERROR for deepfm example #445

Open
klmentzer opened this issue Mar 16, 2024 · 4 comments
Assignees

Comments

@klmentzer
Copy link

klmentzer commented Mar 16, 2024

Describe the bug
Enabling regularization causes CUDNN_STATUS_MAPPING_ERROR for deepfm example (runs without problem without regularization). Also, using a keyword argument lambda to specify the regularization parameter causes a syntax error (though this can be avoided by using **{"lambda": 1e-3} as an argument).

To Reproduce
Steps to reproduce the behavior:

  1. Follow the instructions for the DeepFM sample here
  2. Add the keyword argument use_regularization=True to the hugectr.Layer_t.BinaryCrossEntropyLoss layer and run the code to generate CUDNN_STATUS_MAPPING_ERROR.
  3. (just for syntax error) Specify the lambda regularization parameter and attempt to rerun.

Expected behavior
The model should train with regularization, and the keyword argument does not cause a syntax error.

Screenshots

=====================================================Model Fit=====================================================
[HCTR][00:16:49.881][INFO][RK0][main]: Use non-epoch mode with number of iterations: 2300
[HCTR][00:16:49.881][INFO][RK0][main]: Training batchsize: 16384, evaluation batchsize: 16384
[HCTR][00:16:49.881][INFO][RK0][main]: Evaluation interval: 1000, snapshot interval: 1000000
[HCTR][00:16:49.881][INFO][RK0][main]: Dense network trainable: True
[HCTR][00:16:49.881][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][00:16:49.881][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][00:16:49.881][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][00:16:49.881][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][00:16:49.881][INFO][RK0][main]: Training source file: ./criteo_data/train/_file_list.txt
[HCTR][00:16:49.881][INFO][RK0][main]: Evaluation source file: ./criteo_data/val/_file_list.txt
terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError'
  what():  Runtime error: CUDNN_STATUS_MAPPING_ERROR
        cudnnSetStream(cudnn_handle_, current_stream) (set_stream @ /hugectr/HugeCTR/include/gpu_resource.hpp:80)
[bf8877f31c66:585273] *** Process received signal ***
[bf8877f31c66:585273] Signal: Aborted (6)
[bf8877f31c66:585273] Signal code:  (-6)
[bf8877f31c66:585273] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f54ea3c5520]
[bf8877f31c66:585273] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f54ea4199fc]
[bf8877f31c66:585273] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f54ea3c5476]
[bf8877f31c66:585273] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f54ea3ab7f3]
[bf8877f31c66:585273] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f54e3257b9e]
[bf8877f31c66:585273] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f54e326320c]
[bf8877f31c66:585273] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f54e32621e9]
[bf8877f31c66:585273] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f54e3262959]
[bf8877f31c66:585273] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f54e4225884]
[bf8877f31c66:585273] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f54e4225f41]
[bf8877f31c66:585273] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f54e32634cb]
[bf8877f31c66:585273] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR11GPUResource10set_streamERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x345)[0x7f54e4dc33f5]
[bf8877f31c66:585273] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR13StreamContextD1Ev+0x1b)[0x7f54e4dc367b]
[bf8877f31c66:585273] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x2736f3)[0x7f54e45986f3]
[bf8877f31c66:585273] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xa9bf69)[0x7f54e4dc0f69]
[bf8877f31c66:585273] [15] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR12GraphWrapper7captureESt8functionIFvP11CUstream_stEES3_+0x7b)[0x7f54e4a4452b]
[bf8877f31c66:585273] [16] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR17GraphScheduleable3runESt10shared_ptrINS_11GPUResourceEEb+0x1cc)[0x7f54e4dc0a5c]
[bf8877f31c66:585273] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR8Pipeline9run_graphEv+0x10e)[0x7f54e4dc11ae]
[bf8877f31c66:585273] [18] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xb025a8)[0x7f54e4e275a8]
[bf8877f31c66:585273] [19] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x7f54aadcaa16]
[bf8877f31c66:585273] [20] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model5trainEv+0x14c)[0x7f54e4e2695c]
[bf8877f31c66:585273] [21] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model3fitEiiiiiNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb97)[0x7f54e4e3ce87]
[bf8877f31c66:585273] [22] /usr/local/hugectr/lib/hugectr.so(+0xdd164)[0x7f54e9f2d164]
[bf8877f31c66:585273] [23] /usr/local/hugectr/lib/hugectr.so(+0xa3644)[0x7f54e9ef3644]
[bf8877f31c66:585273] [24] python(+0x15a10e)[0x56453d33b10e]
[bf8877f31c66:585273] [25] python(_PyObject_MakeTpCall+0x25b)[0x56453d331a7b]
[bf8877f31c66:585273] [26] python(+0x168acb)[0x56453d349acb]
[bf8877f31c66:585273] [27] python(_PyEval_EvalFrameDefault+0x198c)[0x56453d32553c]
[bf8877f31c66:585273] [28] python(+0x13f9c6)[0x56453d3209c6]
[bf8877f31c66:585273] [29] python(PyEval_EvalCode+0x86)[0x56453d416256]
[bf8877f31c66:585273] *** End of error message ***
Aborted (core dumped)

Environment (please complete the following information):

Thanks for your help!

@JacoCheung
Copy link
Collaborator

Hi @klmentzer , Thanks for your trial. There is a bug when the regularizer is used together with solver.use_cuda_graph=True. We will fix the bug in the upcoming release. Could you please disable cuda graph as a WAR?

@JacoCheung JacoCheung self-assigned this May 14, 2024
@Abatpool
Copy link

Abatpool commented Aug 5, 2024

Is there any solution to this. I am getting the same issues, when trying run dlrm training v3.1 benchmarking with DGX H100. I have tried with next version v23.08.00 Nvidia-Merlin/HugeCTR like v23.09.00 and latest one too, but the same error persists. Can you please tell me how do we fix it. @JacoCheung

@JacoCheung
Copy link
Collaborator

JacoCheung commented Aug 5, 2024

Hi @Abatpool , have you tried turning cuda_graph off?

@Abatpool
Copy link

Abatpool commented Aug 5, 2024

Hi @Abatpool , have you tried turning cuda_graph off?

Did turn it into false, and used Nvidia-Merlin/HugeCTR like v24.04.00(verified release) still facing the same error as attached in screenshot below
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants