Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add troubleshooting section in README and improvements to training_example.py #192

Merged
merged 29 commits into from
Feb 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1f74126
Add distributed-ml-training blueprint
Dec 12, 2023
3ebc805
Fix docs
Dec 12, 2023
d15e674
Fix docs
Dec 12, 2023
20f7587
Fix format
Dec 12, 2023
3148419
Add solution diagram and reference links
Dec 12, 2023
2d7bdbe
Add solution diagram and reference links
Dec 12, 2023
29a20e6
Fix typo in docs
Dec 13, 2023
fbadf11
Merge branch 'main' into main
sfloresk Dec 13, 2023
e372d83
Add training example file and simplify docs
Dec 14, 2023
6ccd034
Add training example script file
Dec 14, 2023
35cf696
Update task to use read only root fs
Dec 14, 2023
ecc44f3
Remove EFS - Add S3
sfloresk Jan 9, 2024
9c148b0
Merge branch 'main' of https://github.com/sfloresk/ecs-blueprints
sfloresk Jan 9, 2024
a9b4436
Fix bugs and remove deployment of supporting resources
sfloresk Jan 11, 2024
4001305
Change aws provider version to >= 5.0
sfloresk Jan 11, 2024
49f5c07
Improve docs format
sfloresk Jan 11, 2024
486dd47
Include region in test commands, change output bucket arn to id
sfloresk Jan 11, 2024
70500d1
Merge branch 'main' into main
sfloresk Jan 11, 2024
a32c91d
Change bucket ARN for name in the docs
sfloresk Jan 11, 2024
16daf47
Merge branch 'main' of https://github.com/sfloresk/ecs-blueprints
sfloresk Jan 11, 2024
da408a8
Add result of training script
sfloresk Jan 11, 2024
78c5fca
fix README type and enforce metadata service to v2
sfloresk Jan 12, 2024
7d13157
Merge branch 'main' into main
sfloresk Jan 12, 2024
e7f9bb4
Add troubleshooting section to README. Add checkpoint and remove shar…
sfloresk Feb 8, 2024
058a79c
Merge branch 'main' into main
joozero Feb 8, 2024
457dbe1
Fix capacity provider bug and trailing spaces
sfloresk Feb 11, 2024
da1da1b
Merge branch 'main' of https://github.com/sfloresk/ecs-blueprints
sfloresk Feb 11, 2024
aed800c
Add additional troubleshooting step
sfloresk Feb 12, 2024
338026f
Merge branch 'main' into main
joozero Feb 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions terraform/ec2-examples/distributed-ml-training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,9 +144,7 @@ Wrapping provided model in DistributedDataParallel.

Result(
metrics={'loss': 0.4192830347106792, 'accuracy': 0.8852},
path='dt-results-EXAMPLE/ecs_dt_results/TorchTrainer_d1824_00000_0_(...)',
filesystem='s3',
checkpoint=None
(...)
)
```

Expand All @@ -166,6 +164,12 @@ terraform destroy

```

## Troubleshooting

* Error: creating ECS Service (...): InvalidParameterException: The specified capacity provider (...) was not found: There are some cases where the capacity provider is still being created and is not ready to be used by a service. Execute "terraform apply" again to solve the issue.

* Error: waiting for ECS Service (...) delete: timeout while waiting for state to become 'INACTIVE' (last state: 'DRAINING', timeout: 20m0s): It can take several minutes for the service to finish draining. Wait 30 minutes and execute "terraform destroy" again to solve the issue.

## Support

Please open an issue for questions or unexpected behavior
14 changes: 12 additions & 2 deletions terraform/ec2-examples/distributed-ml-training/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,11 @@ module "autoscaling_head" {
}

tags = local.tags

metadata_options = {
http_endpoint = "enabled"
http_tokens = "required"
}
}

module "autoscaling_workers" {
Expand Down Expand Up @@ -177,6 +182,11 @@ module "autoscaling_workers" {
]

tags = local.tags

metadata_options = {
http_endpoint = "enabled"
http_tokens = "required"
}
}

module "autoscaling_sg" {
Expand Down Expand Up @@ -218,7 +228,7 @@ module "ecs_service_head" {
requires_compatibilities = ["EC2"]
capacity_provider_strategy = {
default = {
capacity_provider = "distributed_ml_training_head" # needs to match name of capacity provider
capacity_provider = module.ecs_cluster.autoscaling_capacity_providers["distributed_ml_training_head"].name # needs to match name of capacity provider
weight = 1
base = 1
}
Expand Down Expand Up @@ -314,7 +324,7 @@ module "ecs_service_workers" {
requires_compatibilities = ["EC2"]
capacity_provider_strategy = {
default = {
capacity_provider = "distributed_ml_training_workers" # needs to match name of capacity provider
capacity_provider = module.ecs_cluster.autoscaling_capacity_providers["distributed_ml_training_workers"].name # needs to match name of capacity provider
weight = 1
base = 1
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import ray
import time
import argparse

import tempfile
# Get arguments

parser = argparse.ArgumentParser()
Expand All @@ -24,12 +24,6 @@
# Connect to the Ray cluster
ray.init()

# Download the data in the shared storage
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
train_data = FashionMNIST(root='./data',
train=True, download=True,
transform=transform)

# Define the training function that the distributed processes will run
def train_func(config):
import os
Expand All @@ -50,11 +44,11 @@ def train_func(config):
# Setup loss and optimizer
criterion = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)
# Retrieve the data from the shared storage.

# Prepare data
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
with FileLock(os.path.expanduser("./data.lock")):
train_data = FashionMNIST(root='./data', train=True, download=True, transform=transform)
# Download test data from open datasets
test_data = FashionMNIST(root="./data",train=False,download=True,transform=transform)
batch_size=128
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
Expand Down Expand Up @@ -89,8 +83,20 @@ def train_func(config):

test_loss /= len(test_loader)
accuracy = num_correct / num_total
# Report metrics and checkpoint to Ray.
ray.train.report(metrics={"loss": test_loss, "accuracy": accuracy})

# Save the checkpoint
with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
checkpoint = None
# Only the global rank 0 worker saves the checkpoint
if ray.train.get_context().get_world_rank() == 0:
torch.save(
model.module.state_dict(),
os.path.join(temp_checkpoint_dir, "model.pt"),
)
checkpoint = ray.train.Checkpoint.from_directory(os.path.join(temp_checkpoint_dir, "model.pt"))

# Report metrics and checkpoint to Ray
ray.train.report(metrics={"loss": test_loss, "accuracy": accuracy},checkpoint=checkpoint)

# The scaling config defines how many worker processes to use for the training. Usually equals to the number of GPUs
scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
Expand Down
Loading