[cleanup] remove ssd offload to simplify the FSDP code (#1080)

* simlificed the readme * clean up ssd offload * try to fix readthedocs Co-authored-by: Min Xu <[email protected]>
facebookresearch · Sep 24, 2022 · e71d257 · e71d257
1 parent f4fcee7
commit e71d257
Show file tree

Hide file tree

Showing 11 changed files with 66 additions and 2,266 deletions.
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,29 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# We need python > 3.8 due to a dependency on numpy.
+build:
+  os: ubuntu-20.04
+  tools:
+    python: "3.9"
+    # You can also specify other tool versions:
+    # nodejs: "16"
+    # rust: "1.55"
+    # golang: "1.17"
+
+# Build documentation in the docs/ directory with Sphinx
+sphinx:
+   configuration: docs/source/conf.py
+
+# If using Sphinx, optionally build your docs in additional formats such as PDF
+# formats:
+#    - pdf
+
+# Optionally declare the Python requirements required to build your docs
+python:
+   install:
+   - requirements: docs/requirements.txt
diff --git a/README.md b/README.md
@@ -25,23 +25,6 @@ FairScale was designed with the following values in mind:
 
 [![Explain Like I’m 5: FairScale](https://img.youtube.com/vi/oDt7ebOwWIc/0.jpg)](https://www.youtube.com/watch?v=oDt7ebOwWIc)
 
-## What's New:
-
-* March 2022 [fairscale 0.4.6 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.6).
-* We have support for CosFace's LMCL in MEVO. This is a loss function that is suitable for large number of prediction target classes.
-* January 2022 [fairscale 0.4.5 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.5).
-* We have experimental support for layer wise gradient scaling.
-* We enabled reduce_scatter operation overlapping in FSDP backward propagation.
-* December 2021 [fairscale 0.4.4 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.4).
-* FairScale is tested with the following PyTorch versions (with CUDA 11.2): 1.8.1, 1.10.0 and 1.11.0.dev20211101+cu111.
-* November 2021 [fairscale 0.4.3 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.3).
-* We have experimental support for offloading params to disk when using the FSDP API for evaluation workloads.
-* We have an experimental layer that fuses multiple layers together to support large vocab size trainings.
-* November 2021 [fairscale 0.4.2 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.2).
-* We have a new experimental API called the LayerwiseMemoryTracker to help track, visualize and suggest fixes for memory issues occurring during the forward/backward pass of your models.
-* Introducing SlowMoDistributedDataParallel API, a distributed training wrapper that is useful on clusters with slow network interconnects (e.g. Ethernet).
-* September 2021 [`master` branch renamed to `main`](https://github.com/github/renaming).
-
 ## Installation
 
 To install FairScale, please see the following [instructions](https://github.com/facebookresearch/fairscale/blob/main/docs/source/installation_instructions.rst).
@@ -50,134 +33,26 @@ You should be able to install a package with pip or conda, or build directly fro
 ## Getting Started
 The full [documentation](https://fairscale.readthedocs.io/) contains instructions for getting started, deep dives and tutorials about the various FairScale APIs.
 
-## Examples
-
-Here are a few sample snippets from a subset of FairScale offerings:
-
-### Pipe
-
-Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.
-
-```python
-import torch
-
-import fairscale
-
-model = torch.nn.Sequential(a, b, c, d)
-model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)
-```
-
-### Optimizer state sharding (ZeRO)
-See a more complete example [here](https://github.com/facebookresearch/fairscale/blob/main/benchmarks/oss.py), but a minimal example could look like the following :
-
-```python
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-from fairscale.optim.oss import OSS
-from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
-
-def train(
-    rank: int,
-    world_size: int,
-    epochs: int):
-
-    # DDP init example
-    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)
-
-    # Problem statement
-    model = myAwesomeModel().to(rank)
-    dataloader = mySuperFastDataloader()
-    loss_fn = myVeryRelevantLoss()
-    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
-    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS
-
-    # Wrap the optimizer in its state sharding brethren
-    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)
-
-    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
-    model = ShardedDDP(model, optimizer)
-
-    # Any relevant training loop, nothing specific to OSS. For example:
-    model.train()
-    for e in range(epochs):
-        for batch in dataloader:
-            # Train
-            model.zero_grad()
-            outputs = model(batch["inputs"])
-            loss = loss_fn(outputs, batch["label"])
-            loss.backward()
-            optimizer.step()
-
-    dist.destroy_process_group()
-
-if __name__ == "__main__":
-    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
-    mp.spawn(
-        train,
-        args=(
-            WORLD_SIZE,
-            EPOCHS,
-        ),
-        nprocs=WORLD_SIZE,
-        join=True,
-    )
-```
-
-### AdaScale SGD
-
-AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel)
-training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
-schedule from a baseline batch size when effective batch size is bigger.
-
-Note that AdaScale does _not_ help increase per-GPU batch size.
-
-```python
-from torch.optim import SGD
-from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
-from fairscale.optim import AdaScale
-
-...
-optim = AdaScale(SGD(model.parameters(), lr=0.1))
-scheduler = LambdaLR(optim, ...)
-...
-# Note: the train loop should be with DDP or with gradient accumulation.
-last_epoch = 0
-step = 0
-done = False
-while not done:
-    for sample in dataset:
-        ...
-        step += optim.gain()
-        optim.step()
-        epoch = step // len(dataset)
-        if last_epoch != epoch:
-            scheduler.step()
-            last_epoch = epoch
-        if epoch > max_epoch:
-            done = True
-```
+## FSDP
 
-Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
-(However, training time might be longer comparing to without AdaScale.)
-
-At a high level, we want ML researchers to:
-  * go parallel more easily (i.e. no need to find new learning rate schedules)
-  * not worrying about losing accuracy
-  * potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)
+FullyShardedDataParallel (FSDP) is the recommended method for scaling to large NN models.
+This library has been [upstreamed to PyTorch](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/).
+The version of FSDP here is for historical references as well as for experimenting with
+new and crazy ideas in research of scaling techniques. Please see the following blog
+for [how to use FairScale FSDP and how does it work](https://engineering.fb.com/2021/07/15/open-source/fsdp/).
 
 ## Testing
 
 We use circleci to test FairScale with the following PyTorch versions (with CUDA 11.2):
-* the latest stable release (1.10.0)
-* the latest LTS release (1.8.1)
-* a recent nightly release (1.11.0.dev20211101+cu111)
+* the latest stable release (e.g. 1.10.0)
+* the latest LTS release (e.g. 1.8.1)
+* a recent nightly release (e.g. 1.11.0.dev20211101+cu111)
 
 Please create an [issue](https://github.com/facebookresearch/fairscale/issues) if you are having trouble with installation.
 
 ## Contributors
 
-We welcome outside contributions! Please see the [CONTRIBUTING](CONTRIBUTING.md) instructions for how you can contribute to FairScale.
+We welcome contributions! Please see the [CONTRIBUTING](CONTRIBUTING.md) instructions for how you can contribute to FairScale.
 
 ## License
 
@@ -198,22 +73,9 @@ If you use FairScale in your publication, please cite it by using the following
 
 ```BibTeX
 @Misc{FairScale2021,
-  author =       {Mandeep Baines and Shruti Bhosale and Vittorio Caggiano and Naman Goyal and Siddharth Goyal and Myle Ott and Benjamin Lefaudeux and Vitaliy Liptchinsky and Mike Rabbat and Sam Sheiffer and Anjali Sridhar and Min Xu},
+  author =       {FairScale authors},
   title =        {FairScale:  A general purpose modular PyTorch library for high performance and large scale training},
   howpublished = {\url{https://github.com/facebookresearch/fairscale}},
   year =         {2021}
 }
 ```
-
-## FAQ
-1. If you experience an error indicating a default branch does not exist, it probably due to the latest update, switching the default branch from "master" to "main"
-```
-error: pathspec 'non-existing-branch' did not match any file(s) known to git
-```
-Please run the following commands to update to the main branch.
-```
-git branch -m master main
-git fetch origin
-git branch -u origin/main main
-git remote set-head origin -a
-```
diff --git a/benchmarks/fsdp.py b/benchmarks/fsdp.py
@@ -25,7 +25,6 @@
 from benchmarks.golden_configs.lm_wikitext2 import FSDP as lm_wikitext2
 from fairscale.nn import auto_wrap, default_auto_wrap_policy, enable_wrap
 from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
-from fairscale.nn.data_parallel import OffloadConfig
 
 RPC_PORT = 29501
 
@@ -95,10 +94,7 @@ def get_lm_model(args, device, config):
     nhid = config["nhid"]
     ndecoder = config["num_decoder_layers"]
 
-    if args.ssd_offload:
-        return transformer_lm.TransformerLM(vocab_size, ninp, nhead, nhid, dropout, initrange, ndecoder)
-    else:
-        return transformer_lm.TransformerLM(vocab_size, ninp, nhead, nhid, dropout, initrange, ndecoder).to(device)
+    return transformer_lm.TransformerLM(vocab_size, ninp, nhead, nhid, dropout, initrange, ndecoder).to(device)
 
 
 def get_tensors_by_size_bucket():
@@ -200,7 +196,7 @@ def get_batch(source):
         if i > 0:
             total_tokens += source.numel()
 
-        if args.benchmark_eval or args.ssd_offload:
+        if args.benchmark_eval:
             input = source.cuda()
             target = target.cuda()
             output = model(input)
@@ -250,7 +246,6 @@ def get_number_of_words(data):
 
 
 def benchmark_language_model(model_config, model, benchmark_config, model_specs, args):
-    # TODO(anj): Uncomment and add a check for regression once we have a couple of runs.
     golden_config = get_golden_config(args.model_name, args)
     epoch = benchmark_config["epochs"]
     start_time = time.time()
@@ -358,8 +353,6 @@ def benchmark_fsdp(rank, args, world_size):
     model_config = create_model_config(args, benchmark_config=benchmark_config, model_specs=model_specs)
     model = model_config["model"]
     config = {}
-    if args.ssd_offload:
-        config["offload_config"] = OffloadConfig(offload_type="ssd_offload")
 
     if args.full_fp16:
         config["compute_dtype"] = torch.float16
@@ -386,15 +379,13 @@ def benchmark_fsdp(rank, args, world_size):
 parser.add_argument("--use_synthetic_data", action="store_true", help="Uses synthetic data for running benchmarks.")
 parser.add_argument("--dry_run", action="store_true", help="Run a sample training run without regression testing.")
 parser.add_argument(
-    # TODO(anj-s): In the process of adding more models and hence the requirement for a flag.
     "--model_name",
     default="lm",
     help="Language Model(LM) used to benchmark FSDP.",
 )
 parser.add_argument("--debug", action="store_true", default=False, help="Display additional debug information")
 parser.add_argument("--enable_auto_wrap", action="store_true", default=False, help="Use auto_wrap with FSDP")
 parser.add_argument("--benchmark_eval", action="store_true", default=False, help="Benchmark evaluation workflow.")
-parser.add_argument("--ssd_offload", action="store_true", default=False, help="Benchmark ssd_offload workflow.")
 parser.add_argument("--full_fp16", action="store_true", default=False, help="Benchmark in full fp16 mode.")
 
 if __name__ == "__main__":

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -30,7 +30,7 @@
 # -- Project information -----------------------------------------------------
 
 project = "FairScale"
-copyright = "2020-2021, Facebook/Meta AI Research"
+copyright = "2020-2022, Facebook/Meta AI Research"
 author = "Facebook/Meta AI Research"
 
 # -- General configuration ---------------------------------------------------
@@ -68,7 +68,7 @@
 autodoc_member_order = "bysource"
 
 intersphinx_mapping = {
-    "python": ("https://docs.python.org/3.6", None),
+    "python": ("https://docs.python.org/3.8", None),
     "numpy": ("https://numpy.org/doc/stable/", None),
     "torch": ("https://pytorch.org/docs/stable/", None),
 }