-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM training fail on single GPU, but not with multiple GPUs #338
Comments
Can you send a minimal script to debug it? With that information, I'm a bit lost |
Hello @thistlillo, we have been debugging this issue but we have not been able to reproduce the problem. Our tests surpass five epochs using both configurations with 1 and 2 GPUs. Do you think we can help you in a virtual meeting? |
Hello @bernia and sorry for this late reply, but I did not receive any notification from github about your reply. I have now installed version 1.3 and next week I will perform some more tests. I will report back here. The code published for UC5 is not up-to-date, now it uses also the ECVL dataloader. I work on a fork that periodically join after cleansing the code. I will try also to update the repository with clean code. |
Hello, I have found the cause of the issue. It is related to the dimension of the last batch. When the last batch contains less than "batch size" items, the training of a LSTM-based network fails. The training does not fail when the last batch is kept during the training of a convolutional neural network (resnet18 in my case). Contrary to what I said, the LSTM training fails when running both on a single GPU and on multiple GPUs. I was able to replicate the issue using the latest versions of ECVL and EDDL, both cudnn-enabled and not. |
With the latest versions of EDDL (1.2.0) and ECVL (1.1.0), I get a CUDA error when training the model using a single GPU. I have no problems when using 2 or 4 GPUs. The error occurs systematically at the beginning of the third epoch and does not seem to depend on the batch size. It does not depend on the memory consumption parameter (“full_mem”, “mid_mem” or “low_mem”), I tried all of them. The GPU is a nVidia V100. With previous versions of the libraries, this error did not occur (but I was using a different GPU).
The code is not yet available on the repository, please let me know what details I can add.
The text was updated successfully, but these errors were encountered: