Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved error handling for htr2hpc train script #36

Open
9 of 12 tasks
rlskoeser opened this issue Dec 11, 2024 · 5 comments
Open
9 of 12 tasks

improved error handling for htr2hpc train script #36

rlskoeser opened this issue Dec 11, 2024 · 5 comments
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Dec 11, 2024

Connection issues (such as if the VM goes down for an inconvenient moment) could cause the upload to fail, so the fuller version of this should have a greater range of error handling. Not a priority, but not something we want to forget.

Originally posted by @cmroughan in #35 (comment)

dev todo

nice to have

less important

testing and review

  • confirm legacy UI checkbox is not available on user profile form
  • when ssh key is not set up:
    • informative message about the error
    • new model is deleted
  • when slurm job is canceled from della
    • user notification indicates task was canceled
    • task report shows task status as canceled
    • new model created for training is deleted
@rlskoeser
Copy link
Contributor Author

Additional error handling we should add is documented here : #35 (comment)

@cmroughan
Copy link
Collaborator

Tested error handling for running the htr2hpc script from the eScr frontend, firstly: what happens when a user tries to run a train job before setting their SSH key.

In our logs we get adequate warnings:

[2025-01-17 01:25:50,242: INFO/ForkPoolWorker-2] Authentication (publickey) failed.
[2025-01-17 01:25:50,328: ERROR/ForkPoolWorker-2] Task htr2hpc.tasks.train[fb0b3e93-8f72-412f-ab59-a4fbd91957df] raised unexpected: AuthenticationException('Authentication failed.')

But the user receives little information in the GUI. In the Task Usage page the task State will be shown as "Crashed".

The model that was supposed to be trained, however, continues to sit on the models page with the training status shown to be in progress.

Image

Clicking cancel here leads to an error page (the library's "something went wrong" page).

The task has to be cancelled in the Task Monitoring page, even though there too eScr thinks no task is in progress. However, clicking cancel here will get rid of the in progress training indication on the models page.

@cmroughan
Copy link
Collaborator

The user is also provided with several avenues from which they can try to cancel the training task.

cancelling in the cluster

Cancelling a train job in the cluster's Active Jobs page prompts a "Training finished" message in eScriptorium, but nothing uploads. The model object continues to sit in the models page -- it displays no accuracy but is just the initial copy of the model to be trained on top of.

  • Preferrable behavior would have the script delete this.
  • Prefrerable behavior would also say something like "Training cancelled" instead of "Training finished".

cancelling in eScriptorium

Cancelling a train job via eScr's GUI leads to entries like the following in the logs:

[2025-01-17 00:07:23,247: DEBUG/MainProcess] pidbox received method revoke(task_id='072adc81-7263-4c61-a9c3-df7dcdf33f1e', terminate=True, signal='SIGTERM') [reply_to:None ticket:None]
[2025-01-17 00:07:23,248: INFO/MainProcess] Terminating 072adc81-7263-4c61-a9c3-df7dcdf33f1e (15)

The job, however, continues to run on the cluster. When the slurm job completes, the resulting model gets passed back to eScriptorium. Everything gets cleaned up on the cluster correctly.

  • Preferrable behavior would pass the cancellation forward to the cluster.

@cmroughan
Copy link
Collaborator

Copying error-handling-related comment from #41 :

🟡 If slurm's train job fails to produce a _best.mlmodel -- whether due to lack of convergence, the slurm job timing out, or the slurm job hitting an OOM error -- no model file is uploaded back into eScr.

(This is expected behavior for the code at this stage, but especially for the latter two cases I think it would be preferrable for the script to locate the highest accuracy model so far (so long as it is higher accuracy than the first model and so represents some improvement) and select that to upload back into eScr.)

The (failed) completion of the slurm job and the end of the script still leads to a "Training finished!" message appearing in eScr, which is confusing for the user when they check the model page and find no model uploaded, or they find what looks like a model but with no accuracy value (and no associated model file, as in the screenshot).
Screenshot 2025-01-16 at 10 03 18 PM

It would be preferable to have a more informative message if no model is uploaded. The relevant empty model object should also be deleted in such cases.

@rlskoeser rlskoeser moved this from To Do to In Progress in Iteration Planning Board Jan 28, 2025
@rlskoeser
Copy link
Contributor Author

@cmroughan I've added code to handle the slurm task being cancelled from mydella web ui, but I don't think there's any easy way to handle the task being canceled from the eScriptorium web ui. The eScriptorium code calls a celery task revoke method, which terminates the task - I did some searching and I don't see any evidence of a signal we can catch to terminate gracefully. There's a task method handler after_return where we could maybe add logic for this case, but that doesn't seem to be compatible with the way shared tasks are defined for celery in django (it requires a class-based task).

I'm also not sure what kind of handling or error messaging you want when della can't be reached, e.g. during maintenance. We could temporarily configure the hpc remote server to a different hostname and see what happens, but I'm hoping that the connection error handling and improved logging I've added is sufficient for this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Under Review
Development

No branches or pull requests

2 participants