Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production server migration finished temporarily (2025-01-05) #1241

Open
1 task
homework36 opened this issue Dec 18, 2024 · 5 comments
Open
1 task

Production server migration finished temporarily (2025-01-05) #1241

homework36 opened this issue Dec 18, 2024 · 5 comments

Comments

@homework36
Copy link
Contributor

homework36 commented Dec 18, 2024

I have finished setting up the new production server with three instances:

  1. data storage instance (and NFS host)
  2. docker swarm manager with vGPU (16GiB memory version)
  3. docker swarm worker without vGPU (but a lot of supporting vCPUs)

This new distributed system can respond much faster and all containers are given a lot more resources, especially celery. We have figured out a way to put things across different instances according to our needs without much limits now.
The NFS data sharing is now handled completely by docker swarm, which simplifies the configuration steps and is more secure because the data sharing mount exists only when the docker stack is deployed and running properly.

  • TODO: Update the wiki page to include new steps.

Some problems:

  1. I tested classifying and it works fast with vGPU. However, I still get the same "model 3 does not exist" error for training as on a smaller vGPU instance last summer.
  2. A startup script prevents rodan-main and celery etc. from initialization with a lot of data. We figured out a way to solve this temporarily but still need to think about a long-term solution (see slow start up script in rodan-main with a lot of data  #1243).

Also, I just discovered that if we delete a workflow, the related resources can also be deleted.
It is recommended to delete finished/failed workflow runs once they are no longer needed as our storage disk is 81% full at this moment! It will be very hard to expand this hardware.

Please send me Slack messages or emails when something is not working as before!

Original post in 2024-12:
... and hopefully we can complete it before the next semester.
Please make sure you download everything important! Ideally we will have all the data in the new server but just in case!
Staging might or might not be updated but user data will not be affected regardless.

@homework36
Copy link
Contributor Author

homework36 commented Dec 24, 2024

Update:
Currently we are unable to launch a new vGPU instance and we are seeing the same error message as during summer and the response from Arbutus back then is as follows:

Message Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 91acc7be-7ed5-475e-b87f-4734eb807a52.
Code 500

After some investigations, my colleague confirmed that the reason you could not launch the larger vGPU flavor (g1-16gb-c8-40gb) is because the hardware that host this flavor is completely full. If you require a machine of this size, I suggest try launching the machine periodically; when someone frees up the resources you will be able to launch the machine.

We will keep trying for the next week or so, but since it is holiday, it's possible that no one will close any existing VM.
The previous vGPU server has been converted to a data storage server with no GPU and limited RAM, so even if the production link works, the server does not.

Update:
Weird. We can launch a new instance with the desired flavor without creating a new volume (as root disk) but we cannot if we want to create volume simultaneously. We have tried five times a day for a week so far. So now we will deploy with an instance with only 80GiB storage root disk (default) and attach a 1000-GiB volume to it. Some settings and configurations will be different but we don't want to wait indefinitely.

Update:
We are back in the first step with a new 16GiB vGPU instance running Rodan successfully without old data.
Arbutus has updated some of the hardware, so I need to test some extra settings. I'm also looking into GlusterFS as an alternative to NFS.

@homework36
Copy link
Contributor Author

homework36 commented Jan 4, 2025

Experimented with docker NFS volume (instead of NFS):

On data host:
nano /etc/exports

Add the following:

/var/lib/docker/volumes/rodan_resources/_data 192.168.17.55(rw,sync,no_subtree_check,no_root_squash)
/var/lib/docker/volumes/rodan_pg_backup/_data        192.168.17.55(rw,sync,no_subtree_check,no_root_squash)
/var/lib/docker/volumes/rodan_pg_data/_data          192.168.17.55(rw,sync,no_subtree_check,no_root_squash)

Save and exit and run

sudo systemctl restart nfs-kernel-server
sudo systemctl status nfs-kernel-server
sudo exportfs -v

Modify production.yml and make sure volumes are set as

volumes:
  resources_nfs:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.17.244,rw,nfsvers=4
      device: ":/var/lib/docker/volumes/rodan_resources/_data"
  pg_backup_nfs:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.17.244,rw,nfsvers=4
      device: ":/var/lib/docker/volumes/rodan_pg_backup/_data"
  pg_data_nfs:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.17.244,rw,nfsvers=4
      device: ":/var/lib/docker/volumes/rodan_pg_data/_data"

With this setting, rodan_postgres can start properly but rodan_rodan-main and other related containers still cannot.

@homework36
Copy link
Contributor Author

 {
        "Id": "1947a25ee212781b783d5740a765101ce247894a066df3feb722b6219492e369",
        "Created": "2025-01-04T13:15:11.27505427Z",
        "Path": "/opt/entrypoint",
        "Args": [
            "/run/start"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2025-01-04T13:15:32.791360604Z",
            "FinishedAt": "2025-01-04T13:19:50.77183377Z",

It's not the first time that rodan-main cannot launch properly. However, it says this is not an OOM kill from the docker log. I wonder if this has something to do with NFS.

Currently, there are two (easiest) ways for data sharing using NFS:
(1) use docker volume NFS setting directly (docker swarm will handle NFS client setting when it is running)
(2) use NFS mount locally and then use docker bind mount (we set up NFS client and docker will connect to the local mount point)

I have tried option (1) and until postgres everything works fine.
I will try option (2) tomorrow.

@homework36
Copy link
Contributor Author

homework36 commented Jan 5, 2025

This is moved to a dedicated issue #1243
I figured out that it is because we have too many files already and therefore this line takes a very very long time to run in scripts/start and scripts/start-celery as this

chmod -R a+rwx /rodan

Currently we first run this on the NFS host

chmod -R a+rwx /var/lib/docker/volumes/rodan_resources/_data

and then verify using

find /var/lib/docker/volumes/rodan_resources/_data ! -perm 0777 -exec ls -ld {} \;

And then we replace the original command section in production.yml with tail -f /dev/null.
After we launch the docker stack, we manually go into the container and comment out the line chmod -R a+rwx /rodan in /run/start or /run/start-celery and then execute.

docker ps --filter "name=rodan_rodan-main" --format "{{.ID}}" | xargs -I {} docker exec {} sed -n "11p" /run/start
docker ps --filter "name=rodan_rodan-main" --format "{{.ID}}" | xargs -I {} bash -c 'docker exec {} sed -i "11s/^/ #/" /run/start && docker exec {} /run/start'


docker ps --filter "name=rodan_celery" --format "{{.ID}}" | xargs -I {} docker exec {} sed -n "8p" /run/start-celery
docker ps --filter "name=rodan_celery" --format "{{.ID}}" | xargs -I {} bash -c 'docker exec {} sed -i "8s/^/ #/" /run/start-celery && docker exec {} /run/start-celery'

docker ps --filter "name=rodan_gpu-celery" --format "{{.ID}}" | xargs -I {} docker exec {} sed -n "8p" /run/start-celery
docker ps --filter "name=rodan_gpu-celery" --format "{{.ID}}" | xargs -I {} bash -c 'docker exec {} sed -i "8s/^/ #/" /run/start-celery && docker exec {} /run/start-celery'

docker ps --filter "name=rodan_py3-celery" --format "{{.ID}}" | xargs -I {} docker exec {} sed -n "8p" /run/start-celery
docker ps --filter "name=rodan_py3-celery" --format "{{.ID}}" | xargs -I {} bash -c 'docker exec {} sed -i "8s/^/ #/" /run/start-celery && docker exec {} /run/start-celery'

We might need to rebuild all the images to remove the line. But we also need to make sure the permission status... Still thinking about the solution.
Screenshot 2025-01-05 at 14 22 43

@homework36 homework36 changed the title Reminder: production server migration starting next week (23-12-2024) Production server migration finished (2025-01-05) Jan 5, 2025
@homework36 homework36 pinned this issue Jan 5, 2025
@homework36 homework36 changed the title Production server migration finished (2025-01-05) Production server migration finished temporarily (2025-01-05) Jan 5, 2025
@homework36
Copy link
Contributor Author

The production.yml used for now

version: "3.4"

services:

  nginx:
    image: "ddmal/nginx:v3.1.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.25"
          memory: 0.5G
        limits:
          cpus: "0.25"
          memory: 0.5G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "/usr/sbin/service", "nginx", "status"]
      interval: "30s"
      timeout: "10s"
      retries: 10
      start_period: "5m"
    command: /run/start
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      TLS: 1
    ports:
      - "80:80"
      - "443:443"
      - "5671:5671"
      - "9002:9002"
    volumes:
      - "resources_nfs:/rodan/data"

  rodan-main:
    image: "ddmal/rodan-main:v3.1.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1.5"
          memory: 6G
        limits:
          cpus: "1.5"
          memory: 6G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "/usr/bin/curl -H 'User-Agent: docker-healthcheck' http://localhost:8000/api/?format=json || exit 1"]
      interval: "30s"
      timeout: "30s"
      retries: 5
      start_period: "15m"
    command: tail -f /dev/null
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: None
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources_nfs:/rodan/data"

  rodan-client:
    image: "ddmal/rodan-client:nightly"
    deploy:
      placement:
        constraints:
          - node.role == worker
    volumes:
      - "./rodan-client/config/configuration.json:/client/configuration.json"

  iipsrv:
    image: "ddmal/iipsrv:nightly"
    volumes:
      - "resources_nfs:/rodan/data"

  celery:
    image: "ddmal/rodan-main:v3.1.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "3"
          memory: 6G
        limits:
          cpus: "3"
          memory: 6G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == worker
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@celery", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      start_period: "10m"
      retries: 5
    command: tail -f /dev/null
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: celery
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources_nfs:/rodan/data"

  py3-celery:
    image: "ddmal/rodan-python3-celery:v3.1.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "2.5"
          memory: 5G
        limits:
          cpus: "2.5"
          memory: 5G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@Python3", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: tail -f /dev/null
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: Python3
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources_nfs:/rodan/data"

  gpu-celery:
    image: "ddmal/rodan-gpu-celery:v3.1.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 18G
        limits:
          cpus: "1"
          memory: 18G
      placement:
        constraints:
          - node.role == manager
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: tail -f /dev/null
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: GPU
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources_nfs:/rodan/data"

  redis:
    image: "redis:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 2G
        limits:
          cpus: "1"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == worker
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto

  postgres:
    image: "ddmal/postgres-plpython:v3.1.0"
    deploy:
      replicas: 1
      endpoint_mode: dnsrr
      resources:
        reservations:
          cpus: "1"
          memory: 2G
        limits:
          cpus: "1"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-U", "postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto
    volumes:
      - "pg_data_nfs:/var/lib/postgresql/data"
      - "pg_backup_nfs:/backups"
    env_file:
      - ./scripts/production.env

  rabbitmq:
    image: "rabbitmq:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "2"
          memory: 4G
        limits:
          cpus: "2"
          memory: 4G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == worker
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
      interval: "30s"
      timeout: "3s"
      retries: 3
    environment:
      TZ: America/Toronto
    env_file:
      - ./scripts/production.env

volumes:
  resources_nfs:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.17.244,rw,nfsvers=4,soft,timeo=600,retrans=3
      device: "192.168.17.244:/var/lib/docker/volumes/rodan_resources/_data"
  pg_backup_nfs:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.17.244,rw,nfsvers=4
      device: "192.168.17.244:/var/lib/docker/volumes/rodan_pg_backup/_data"
  pg_data_nfs:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.17.244,rw,nfsvers=4
      device: "192.168.17.244:/var/lib/docker/volumes/rodan_pg_data/_data"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant