Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM training fail on single GPU, but not with multiple GPUs #338

Open
thistlillo opened this issue Apr 11, 2022 · 4 comments
Open

LSTM training fail on single GPU, but not with multiple GPUs #338

thistlillo opened this issue Apr 11, 2022 · 4 comments

Comments

@thistlillo
Copy link

With the latest versions of EDDL (1.2.0) and ECVL (1.1.0), I get a CUDA error when training the model using a single GPU. I have no problems when using 2 or 4 GPUs. The error occurs systematically at the beginning of the third epoch and does not seem to depend on the batch size. It does not depend on the memory consumption parameter (“full_mem”, “mid_mem” or “low_mem”), I tried all of them. The GPU is a nVidia V100. With previous versions of the libraries, this error did not occur (but I was using a different GPU).

.Traceback (most recent call last):
  File "C01_2_rec_mod_edll.py", line 98, in <module>
    fire.Fire({
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "C01_2_rec_mod_edll.py", line 46, in train
    rec_mod.train()
  File "/mnt/datasets/uc5/UC5_pipeline_forked/src/eddl_lib/recurrent_module.py", line 289, in train
    eddl.train_batch(rnn, [cnn_visual, thresholded], [Y])
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/pyeddl/eddl.py", line 435, in train_batch
    return _eddl.train_batch(net, in_, out)
RuntimeError: [CUDA ERROR]: invalid argument (1) raised in delete_tensor | (check_cuda)

The code is not yet available on the repository, please let me know what details I can add.

@salvacarrion
Copy link
Contributor

Can you send a minimal script to debug it? With that information, I'm a bit lost

@bernia
Copy link

bernia commented May 9, 2022

Hello @thistlillo, we have been debugging this issue but we have not been able to reproduce the problem. Our tests surpass five epochs using both configurations with 1 and 2 GPUs. Do you think we can help you in a virtual meeting?

@thistlillo
Copy link
Author

Hello @bernia and sorry for this late reply, but I did not receive any notification from github about your reply. I have now installed version 1.3 and next week I will perform some more tests. I will report back here.

The code published for UC5 is not up-to-date, now it uses also the ECVL dataloader. I work on a fork that periodically join after cleansing the code. I will try also to update the repository with clean code.

@thistlillo
Copy link
Author

Hello, I have found the cause of the issue. It is related to the dimension of the last batch. When the last batch contains less than "batch size" items, the training of a LSTM-based network fails. The training does not fail when the last batch is kept during the training of a convolutional neural network (resnet18 in my case).

Contrary to what I said, the LSTM training fails when running both on a single GPU and on multiple GPUs. I was able to replicate the issue using the latest versions of ECVL and EDDL, both cudnn-enabled and not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants