Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training problem #82

Open
hyyuan123 opened this issue Mar 16, 2023 · 2 comments
Open

Training problem #82

hyyuan123 opened this issue Mar 16, 2023 · 2 comments

Comments

@hyyuan123
Copy link

The original code does not change, after increasing the amount of training data, when training PBAFN_e2e code, training to 78 epoch, the network neither runs nor reports errors, and is at a standstill. What is the reason for this?

@MosbehBarhoumiRAI
Copy link

Hello @hyyuan123 I am planning to train the same model using a larger dataset. I am wondering if the problem with training still exists. If not, how did you resolve the issue?

@xxxxl888
Copy link

I encountered a problem during the training at the 50th epoch. Can you help me take a look and see what might have caused the issue?
Traceback (most recent call last):
File "/home/xulei/CodePAF/PF-AFN-main/PF-AFN_train/train_PBAFN_stage1.py", line 196, in
warp_model.module.update_learning_rate(optimizer_warp)
File "/usr/local/anaconda3/envs/xulei/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AFWM' object has no attribute 'module'
update learning rate: 0.000050 -> 0.000049
Traceback (most recent call last):
File "/home/xulei/CodePAF/PF-AFN-main/PF-AFN_train/train_PBAFN_stage1.py", line 196, in
warp_model.module.update_learning_rate(optimizer_warp)
File "/usr/local/anaconda3/envs/xulei/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AFWM' object has no attribute 'module'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2184173) of binary: /usr/local/anaconda3/envs/xulei/bin/python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants