We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
对代码进行了修改以训练自己数据,发现多卡训练总是不能进行下去,单卡是没问题。调试发现是卡在了下面这一步。 msg = model.load_state_dict(checkpoint, strict=False) 请问是为什么呢? 训练命令为: CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --master_port 12345 main.py
msg = model.load_state_dict(checkpoint, strict=False)
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --master_port 12345 main.py
The text was updated successfully, but these errors were encountered:
我把训练命令中的run改为launch发现多卡可以跑了,但这又是为什么呢?为什么run不行?
run
launch
然后我又有个新的疑问。 因为训练自己数据集的类别只有 5 类,所以对载入的预训练权重做了改动,没有改网络结构,全连接层依然是 1000 个节点,我的改动如下:载入预训练权重后对几个全连接层单独进行改写,然后载入cuda避免这部分网络在cpu中(感觉就是这部分引发了后面的错误): 但是这种方式在多卡训练时会报错,报错如下: 按照报错信息做了如下修改: 重新运行后,训练了 1 个epoch后有报了下面的错误:
然后换了一种加载预训练权重的方式,如下:修改网络结构里的类别为5,并在载入ckpt后,将全连接层的参数删除: 这种方法好像可行,目前没报错。
请问能解答一下吗?非常感谢!
Sorry, something went wrong.
No branches or pull requests
对代码进行了修改以训练自己数据,发现多卡训练总是不能进行下去,单卡是没问题。调试发现是卡在了下面这一步。
msg = model.load_state_dict(checkpoint, strict=False)
请问是为什么呢?
训练命令为:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --master_port 12345 main.py
The text was updated successfully, but these errors were encountered: