Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于多卡训练卡死的问题 #108

Open
SEUvictor opened this issue Mar 1, 2023 · 1 comment
Open

关于多卡训练卡死的问题 #108

SEUvictor opened this issue Mar 1, 2023 · 1 comment

Comments

@SEUvictor
Copy link

SEUvictor commented Mar 1, 2023

对代码进行了修改以训练自己数据,发现多卡训练总是不能进行下去,单卡是没问题。调试发现是卡在了下面这一步。
msg = model.load_state_dict(checkpoint, strict=False)
image
请问是为什么呢?
训练命令为:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --master_port 12345 main.py

@SEUvictor
Copy link
Author

SEUvictor commented Mar 1, 2023

我把训练命令中的run改为launch发现多卡可以跑了,但这又是为什么呢?为什么run不行?

然后我又有个新的疑问。
因为训练自己数据集的类别只有 5 类,所以对载入的预训练权重做了改动,没有改网络结构,全连接层依然是 1000 个节点,我的改动如下:载入预训练权重后对几个全连接层单独进行改写,然后载入cuda避免这部分网络在cpu中(感觉就是这部分引发了后面的错误):
image
但是这种方式在多卡训练时会报错,报错如下:
image
按照报错信息做了如下修改:
image
重新运行后,训练了 1 个epoch后有报了下面的错误:
image

然后换了一种加载预训练权重的方式,如下:修改网络结构里的类别为5,并在载入ckpt后,将全连接层的参数删除:
image
这种方法好像可行,目前没报错。

请问能解答一下吗?非常感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant