We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My understanding is that FSDP does not shard the model buffers, and as a result, unlike parameters which would be fred and go back to their sharded state after state_dict()/summon_full_params(), this would not happen with buffers. Although it seems that buffers are still cloned, which may be unnecessary and a small optimization could be made: https://github.com/facebookresearch/fairscale/blob/main/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L2516
The text was updated successfully, but these errors were encountered:
Are you talking about buffers which are separate from model parameters? Or just the state_dict itself which we end up calling clone on.
clone
Yes, I do think we need to optimize this bit where we call clone since it runs into OOM errors.
Do you have any suggestions for what we can do to improve this?
Sorry, something went wrong.
No branches or pull requests
My understanding is that FSDP does not shard the model buffers, and as a result, unlike parameters which would be fred and go back to their sharded state after state_dict()/summon_full_params(), this would not happen with buffers. Although it seems that buffers are still cloned, which may be unnecessary and a small optimization could be made: https://github.com/facebookresearch/fairscale/blob/main/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L2516
The text was updated successfully, but these errors were encountered: