Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC integration option 1 - ssh #15

Open
16 tasks done
rlskoeser opened this issue Oct 3, 2024 · 2 comments
Open
16 tasks done

HPC integration option 1 - ssh #15

rlskoeser opened this issue Oct 3, 2024 · 2 comments
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Oct 3, 2024

First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.

Note that CAS integration (included in #10) is a prerequisite for this.

implementation details

  • request access to ssh without duo from test vm to hpc machine
  • ensure access to ssh from vm to hpc (may require PUL .lib domain firewall change)
  • add a vaulted ssh key to deploy and write instructions for adding to authorized keys on hpc machine
  • write setup instructions for hpc conda env
  • write remote versions equivalent to gpu celery tasks to kick off training jobs: export needed data/model, use ssh to log in as the current user, start the slurm job
  • write htr2hpc celery replacement tasks for train and segtrain to run htr2hpc script on hpc via ssh
  • modify escriptorium to call our remote version of the task instead of running locally
  • update escriptorium task to report on status of remote slurm job
  • update working directory name to use a timestamp, to ensure uniqueness
  • delete model if training errors (confirm this is always a new model)
  • fix formatting for link to mydella when sending user notifications
Copy link

linear bot commented Oct 3, 2024

RSE-100 HPC integration option 1 - ssh

First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.

Note that CAS integration (included in #10) is a prerequisite for this.

implementation details

  • request access to ssh without duo from test vm to hpc machine
  • ensure access to ssh from vm to hpc (may require PUL .lib domain firewall change)
  • add a vaulted ssh key to deploy and write instructions for adding to authorized keys on hpc machine
  • write remote versions equivalent to gpu celery tasks to kick off training jobs: export needed data/model, use scp/rsync to transfer files and ssh to log in as the current user, start the slurm job
  • modify escriptorium to call our remote version of the task instead of running locally (think about how to make configurable but this version doesn't have to be elegant)
  • implement method to check status of remote slurm job
  • modify escriptorium task monitoring to handle remote slurm job
  • when the job completes, update the refined model back in escriptorium and report on status

@mnaydan mnaydan moved this from IceBox to To Do in Iteration Planning Board Dec 9, 2024
@rlskoeser rlskoeser moved this from To Do to In Progress in Iteration Planning Board Dec 10, 2024
@mnaydan mnaydan moved this from In Progress to Under Review in Iteration Planning Board Dec 17, 2024
@mnaydan mnaydan moved this from Under Review to In Progress in Iteration Planning Board Jan 13, 2025
@rlskoeser
Copy link
Contributor Author

For the new model check (to ensure we don't delete an existing model being updated), I added some logging to compare model creation time with the start time of the task, thought we could use the time delta. Here's the output:

model was created at 2025-01-15 22:19:20.983751+00:00; task started at 2025-01-15 22:19:21.4227
49+00:00; delta: 0:00:00.438998

@rlskoeser rlskoeser moved this from In Progress to Under Review in Iteration Planning Board Jan 16, 2025
@rlskoeser rlskoeser moved this from Under Review to In Progress in Iteration Planning Board Jan 28, 2025
@rlskoeser rlskoeser moved this from In Progress to Under Review in Iteration Planning Board Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Under Review
Development

No branches or pull requests

1 participant