Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi GPU training doesnt seem to work #6

Open
ppisljar opened this issue Jun 18, 2023 · 2 comments
Open

multi GPU training doesnt seem to work #6

ppisljar opened this issue Jun 18, 2023 · 2 comments

Comments

@ppisljar
Copy link

I tested with a single GPU and training works fine. I am not testing with multiple GPUs and i noticed, that the outer bar (counting total number of steps) is not updating. Adding some print statements to the code it seems that the statement in train.py:

for batchs in loader returns batchs: [], [], [], [] (so empty batches)

seems something goes wrong in data loader ?

@ppisljar
Copy link
Author

in database.py the batch size is set to total batch size (rather than batch size per gpu). this makes _collate_fn return empty batch array. by fixing this i get batches in train.py, but the process now fails with:

Epoch 1:   0%|1                                                                                                                                                                     | 1/894 [00:08<2:00:37,  8.10s/it]
Traceback (most recent call last):
  File "train.py", line 342, in <module>
    mp.spawn(train, nprocs=num_gpus, args=(args, configs, batch_size, num_gpus))
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/tts/Comprehensive-E2E-TTS/train.py", line 152, in train
    output = model(*(batch[2:]), step=step)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 606, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating los$




@ppisljar
Copy link
Author

trying to set find_unused_parameters=True on DistributedDataParallel does NOT solve the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant