Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance CI Workflow for Scenic Simulators: Improved Volume Management, Reliability, and Cost Efficiency #310

Merged
merged 38 commits into from
Nov 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
67317e3
added to needs of job stop_ec2_instance so all jobs run before it sta…
lola831 Oct 7, 2024
96c758a
changed start_ec2_instance to create volume from latest snapshot and …
lola831 Oct 7, 2024
8453174
changed stop_ec2_instance so that it stops the instance, then takes s…
lola831 Oct 7, 2024
2eebb22
corrected volume id output
lola831 Oct 7, 2024
252c837
changed volume to be root volume /dev/sda1
lola831 Oct 7, 2024
8ee54ab
updated volume Id to use Github variables
lola831 Oct 8, 2024
4caa5cc
testing volume id
lola831 Oct 8, 2024
bea467e
change needs for stop_ec2_instance for testing purposes
lola831 Oct 8, 2024
5b8e3bd
testing volume id
lola831 Oct 8, 2024
0e0c556
using new github_output
lola831 Oct 8, 2024
0b39c4d
trying outputs to pass volume id
lola831 Oct 8, 2024
a9659f5
volume id passed correctly to last job. checking workflow with tests
lola831 Oct 8, 2024
556436a
changed volume type and size
lola831 Oct 10, 2024
ceabc75
Add disk usage checks before and after simulator tests to evaluate vo…
lola831 Oct 10, 2024
63b0714
Reorder disk usage check to run after EC2 instance start
lola831 Oct 10, 2024
8025f26
checking workflow and volume usage
lola831 Oct 10, 2024
bbcef09
Update workflow to create 100 GiB sc1 volume from snapshot
lola831 Oct 10, 2024
574bed1
Fix format workflow by ensuring Python environment and installing iso…
lola831 Oct 11, 2024
5ce23eb
Increase timeout and temporarily disable volume deletion to allow sna…
lola831 Oct 11, 2024
1a4eff9
check workflow with new volume size 100GB
lola831 Oct 15, 2024
338835b
Refactor workflow to streamline instance startup and monitoring:
lola831 Oct 15, 2024
7a43c1a
Refactor instance status check code for initialization
lola831 Oct 15, 2024
843f67e
Revert to original 400 GiB standard volume creation in workflow
lola831 Oct 17, 2024
fd06c06
Increase CARLA connection timeout and improve error handling
lola831 Oct 22, 2024
18b7794
Change volume type to gp3
lola831 Oct 22, 2024
2da31a0
Revert changes: restored file to original state
lola831 Oct 23, 2024
94f4b35
Increase CARLA startup wait time and log connection duration
lola831 Oct 24, 2024
66d6b03
revert back to previous timeout times
lola831 Oct 24, 2024
fac5811
Increase CARLA startup time to 10 mins, log startup duration, and set…
lola831 Oct 28, 2024
f1fb281
Adjust CARLA connection settings: decrease wait loop, increase timeou…
lola831 Oct 28, 2024
3156704
Increase CARLA startup loop to 360 iterations and keep timeout at 180…
lola831 Oct 28, 2024
6c7748b
Revert to original connection settings (600 loops, 60s timeout) to in…
lola831 Oct 28, 2024
ebf69fa
Increased CARLA map load timeout to 120s, adjusted startup sleep time…
lola831 Oct 28, 2024
a87a4f0
Increased CARLA timeout to 180 seconds and kept 10-second sleep to en…
lola831 Oct 28, 2024
38c4b44
Lowered Carla timeout to 180s
lola831 Oct 29, 2024
87c88a1
Add SSH keep-alive options to CARLA tests to prevent broken pipe errors
lola831 Oct 29, 2024
e364c5d
Removed logging of CARLA connection times
lola831 Oct 29, 2024
3250886
Simplify snapshot and instance status checks using AWS wait commands
lola831 Oct 31, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 77 additions & 60 deletions .github/workflows/run-simulators.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,42 @@ jobs:
runs-on: ubuntu-latest
concurrency:
group: sim
outputs:
volume_id: ${{ steps.create_volume_step.outputs.volume_id }}
env:
INSTANCE_ID: ${{ secrets.AWS_EC2_INSTANCE_ID }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }}
steps:
- name: Create Volume from Latest Snapshot and Attach to Instance
id: create_volume_step
run: |
# Retrieve the latest snapshot ID
LATEST_SNAPSHOT_ID=$(aws ec2 describe-snapshots --owner-ids self --query 'Snapshots | sort_by(@, &StartTime) | [-1].SnapshotId' --output text)
echo "Checking availability for snapshot: $LATEST_SNAPSHOT_ID"

# Wait for the snapshot to complete
aws ec2 wait snapshot-completed --snapshot-ids $LATEST_SNAPSHOT_ID
echo "Snapshot is ready."

# Create a new volume from the latest snapshot
volume_id=$(aws ec2 create-volume --snapshot-id $LATEST_SNAPSHOT_ID --availability-zone us-west-1b --volume-type gp3 --size 400 --throughput 250 --query "VolumeId" --output text)
echo "Created volume with ID: $volume_id"

# Set volume_id as output
echo "volume_id=$volume_id" >> $GITHUB_OUTPUT
cat $GITHUB_OUTPUT

# Wait until the volume is available
aws ec2 wait volume-available --volume-ids $volume_id
echo "Volume is now available"

# Attach the volume to the instance
aws ec2 attach-volume --volume-id $volume_id --instance-id $INSTANCE_ID --device /dev/sda1
echo "Volume $volume_id attached to instance $INSTANCE_ID as /dev/sda1"

- name: Start EC2 Instance
env:
INSTANCE_ID: ${{ secrets.AWS_EC2_INSTANCE_ID }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }}
run: |
# Get the instance state
instance_state=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID | jq -r '.Reservations[].Instances[].State.Name')
Expand All @@ -27,7 +56,7 @@ jobs:
sleep 10
instance_state=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID | jq -r '.Reservations[].Instances[].State.Name')
done

# Check if instance state is "stopped"
if [[ "$instance_state" == "stopped" ]]; then
echo "Instance is stopped, starting it..."
Expand All @@ -42,34 +71,17 @@ jobs:
exit 1
fi

# wait for status checks to pass
TIMEOUT=300 # Timeout in seconds
START_TIME=$(date +%s)
END_TIME=$((START_TIME + TIMEOUT))
while true; do
response=$(aws ec2 describe-instance-status --instance-ids $INSTANCE_ID)
system_status=$(echo "$response" | jq -r '.InstanceStatuses[0].SystemStatus.Status')
instance_status=$(echo "$response" | jq -r '.InstanceStatuses[0].InstanceStatus.Status')

if [[ "$system_status" == "ok" && "$instance_status" == "ok" ]]; then
echo "Both SystemStatus and InstanceStatus are 'ok'"
exit 0
fi

CURRENT_TIME=$(date +%s)
if [[ "$CURRENT_TIME" -ge "$END_TIME" ]]; then
echo "Timeout: Both SystemStatus and InstanceStatus have not reached 'ok' state within $TIMEOUT seconds."
exit 1
fi

sleep 10 # Check status every 10 seconds
done
# Wait for instance status checks to pass
echo "Waiting for instance status checks to pass..."
aws ec2 wait instance-status-ok --instance-ids $INSTANCE_ID
echo "Instance is now ready for use."


check_simulator_version_updates:
name: check_simulator_version_updates
runs-on: ubuntu-latest
needs: start_ec2_instance
steps:
steps:
- name: Check for Simulator Version Updates
env:
PRIVATE_KEY: ${{ secrets.SSH_PRIVATE_KEY }}
Expand Down Expand Up @@ -109,11 +121,11 @@ jobs:
echo "NVIDIA Driver is not set"
exit 1
fi
'
'
- name: NVIDIA Driver is not set
if: ${{ failure() }}
run: |
echo "NVIDIA SMI is not working, please run the steps here on the instance:"
echo "NVIDIA SMI is not working, please run the steps here on the instance:"
echo "https://scenic-lang.atlassian.net/wiki/spaces/KAN/pages/2785287/Setting+Up+AWS+VM?parentProduct=JSW&initialAllowedFeatures=byline-contributors.byline-extensions.page-comments.delete.page-reactions.inline-comments.non-licensed-share&themeState=dark%253Adark%2520light%253Alight%2520spacing%253Aspacing%2520colorMode%253Alight&locale=en-US#Install-NVIDIA-Drivers"

run_carla_simulators:
Expand All @@ -128,17 +140,17 @@ jobs:
USER_NAME: ${{secrets.SSH_USERNAME}}
run: |
echo "$PRIVATE_KEY" > private_key && chmod 600 private_key
ssh -o StrictHostKeyChecking=no -i private_key ${USER_NAME}@${HOSTNAME} '
ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 -o ServerAliveCountMax=3 -i private_key ${USER_NAME}@${HOSTNAME} '
cd /home/ubuntu/actions/Scenic &&
source venv/bin/activate &&
carla_versions=($(find /software -maxdepth 1 -type d -name 'carla*')) &&
for version in "${carla_versions[@]}"; do
echo "============================= CARLA $version ============================="
echo "============================= CARLA $version ============================="
export CARLA_ROOT="$version"
pytest tests/simulators/carla
done
'

run_webots_simulators:
name: run_webots_simulators
runs-on: ubuntu-latest
Expand All @@ -164,39 +176,44 @@ jobs:
done
kill %1
'

stop_ec2_instance:
name: stop_ec2_instance
runs-on: ubuntu-latest
needs: [run_carla_simulators, run_webots_simulators]
steps:
needs: [start_ec2_instance, check_simulator_version_updates, check_nvidia_smi, run_carla_simulators, run_webots_simulators]
if: always()
env:
VOLUME_ID: ${{ needs.start_ec2_instance.outputs.volume_id }}
INSTANCE_ID: ${{ secrets.AWS_EC2_INSTANCE_ID }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }}
steps:
- name: Stop EC2 Instance
env:
INSTANCE_ID: ${{ secrets.AWS_EC2_INSTANCE_ID }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }}
run: |
# Get the instance state
# Get the instance state and stop it if running
instance_state=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID | jq -r '.Reservations[].Instances[].State.Name')

# If the machine is pending wait for it to fully start
while [ "$instance_state" == "pending" ]; do
echo "Instance is pending startup, waiting for it to fully start..."
sleep 10
instance_state=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID | jq -r '.Reservations[].Instances[].State.Name')
done

# Check if instance state is "stopped"
if [[ "$instance_state" == "running" ]]; then
echo "Instance is running, stopping it..."
aws ec2 stop-instances --instance-ids $INSTANCE_ID
elif [[ "$instance_state" == "stopping" ]]; then
echo "Instance is stopping..."
echo "Instance is running, stopping it..."
aws ec2 stop-instances --instance-ids $INSTANCE_ID
aws ec2 wait instance-stopped --instance-ids $INSTANCE_ID
echo "Instance has stopped."
elif [[ "$instance_state" == "stopped" ]]; then
echo "Instance is already stopped..."
exit 0
echo "Instance is already stopped."
else
echo "Unknown instance state: $instance_state"
exit 1
echo "Unexpected instance state: $instance_state"
exit 1
fi

- name: Detach Volume
run: |
# Detach the volume
aws ec2 detach-volume --volume-id $VOLUME_ID
aws ec2 wait volume-available --volume-ids $VOLUME_ID
echo "Volume $VOLUME_ID detached."

- name: Delete Volume
run: |
# Delete the volume after snapshot is complete
aws ec2 delete-volume --volume-id $VOLUME_ID
echo "Volume $VOLUME_ID deleted."
14 changes: 8 additions & 6 deletions tests/simulators/carla/test_actions.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,19 +43,21 @@ def getCarlaSimulator(getAssetPath):
f"bash {CARLA_ROOT}/CarlaUE4.sh -RenderOffScreen", shell=True
)

for _ in range(30):
for _ in range(180):
if isCarlaServerRunning():
break
time.sleep(1)
else:
pytest.fail("Unable to connect to CARLA.")

# Extra 5 seconds to ensure server startup
time.sleep(5)
time.sleep(10)

base = getAssetPath("maps/CARLA")

def _getCarlaSimulator(town):
path = os.path.join(base, f"{town}.xodr")
simulator = CarlaSimulator(map_path=path, carla_map=town)
simulator = CarlaSimulator(map_path=path, carla_map=town, timeout=180)
return simulator, town, path

yield _getCarlaSimulator
Expand All @@ -76,7 +78,7 @@ def test_throttle(getCarlaSimulator):
behavior DriveWithThrottle():
while True:
take SetThrottleAction(1)

ego = new Car at (369, -326), with behavior DriveWithThrottle
record ego.speed as CarSpeed
terminate after 5 steps
Expand Down Expand Up @@ -109,8 +111,8 @@ def test_brake(getCarlaSimulator):
do DriveWithThrottle() for 2 steps
do Brake() for 6 steps

ego = new Car at (369, -326),
with blueprint 'vehicle.toyota.prius',
ego = new Car at (369, -326),
with blueprint 'vehicle.toyota.prius',
with behavior DriveThenBrake
record final ego.speed as CarSpeed
terminate after 8 steps
Expand Down
Loading