improved error handling for htr2hpc train script #36

rlskoeser · 2024-12-11T14:17:12Z

rlskoeser · 2024-12-13T13:52:09Z

Additional error handling we should add is documented here : #35 (comment)

cmroughan · 2025-01-17T03:18:40Z

Tested error handling for running the htr2hpc script from the eScr frontend, firstly: what happens when a user tries to run a train job before setting their SSH key.

In our logs we get adequate warnings:

[2025-01-17 01:25:50,242: INFO/ForkPoolWorker-2] Authentication (publickey) failed.
[2025-01-17 01:25:50,328: ERROR/ForkPoolWorker-2] Task htr2hpc.tasks.train[fb0b3e93-8f72-412f-ab59-a4fbd91957df] raised unexpected: AuthenticationException('Authentication failed.')

But the user receives little information in the GUI. In the Task Usage page the task State will be shown as "Crashed".

The model that was supposed to be trained, however, continues to sit on the models page with the training status shown to be in progress.

Clicking cancel here leads to an error page (the library's "something went wrong" page).

The task has to be cancelled in the Task Monitoring page, even though there too eScr thinks no task is in progress. However, clicking cancel here will get rid of the in progress training indication on the models page.

cmroughan · 2025-01-17T03:26:44Z

The user is also provided with several avenues from which they can try to cancel the training task.

cancelling in the cluster

Cancelling a train job in the cluster's Active Jobs page prompts a "Training finished" message in eScriptorium, but nothing uploads. The model object continues to sit in the models page -- it displays no accuracy but is just the initial copy of the model to be trained on top of.

Preferrable behavior would have the script delete this.
Prefrerable behavior would also say something like "Training cancelled" instead of "Training finished".

cancelling in eScriptorium

Cancelling a train job via eScr's GUI leads to entries like the following in the logs:

[2025-01-17 00:07:23,247: DEBUG/MainProcess] pidbox received method revoke(task_id='072adc81-7263-4c61-a9c3-df7dcdf33f1e', terminate=True, signal='SIGTERM') [reply_to:None ticket:None]
[2025-01-17 00:07:23,248: INFO/MainProcess] Terminating 072adc81-7263-4c61-a9c3-df7dcdf33f1e (15)

The job, however, continues to run on the cluster. When the slurm job completes, the resulting model gets passed back to eScriptorium. Everything gets cleaned up on the cluster correctly.

Preferrable behavior would pass the cancellation forward to the cluster.

cmroughan · 2025-01-27T16:43:25Z

Copying error-handling-related comment from #41 :

🟡 If slurm's train job fails to produce a _best.mlmodel -- whether due to lack of convergence, the slurm job timing out, or the slurm job hitting an OOM error -- no model file is uploaded back into eScr.

(This is expected behavior for the code at this stage, but especially for the latter two cases I think it would be preferrable for the script to locate the highest accuracy model so far (so long as it is higher accuracy than the first model and so represents some improvement) and select that to upload back into eScr.)

The (failed) completion of the slurm job and the end of the script still leads to a "Training finished!" message appearing in eScr, which is confusing for the user when they check the model page and find no model uploaded, or they find what looks like a model but with no accuracy value (and no associated model file, as in the screenshot).

It would be preferable to have a more informative message if no model is uploaded. The relevant empty model object should also be deleted in such cases.

rlskoeser · 2025-01-29T20:47:12Z

@cmroughan I've added code to handle the slurm task being cancelled from mydella web ui, but I don't think there's any easy way to handle the task being canceled from the eScriptorium web ui. The eScriptorium code calls a celery task revoke method, which terminates the task - I did some searching and I don't see any evidence of a signal we can catch to terminate gracefully. There's a task method handler after_return where we could maybe add logic for this case, but that doesn't seem to be compatible with the way shared tasks are defined for celery in django (it requires a class-based task).

I'm also not sure what kind of handling or error messaging you want when della can't be reached, e.g. during maintenance. We could temporarily configure the hpc remote server to a different hostname and see what happens, but I'm hoping that the connection error handling and improved logging I've added is sufficient for this case.

This was referenced Dec 13, 2024

Update htr2hpc-train to support recognition training #35

Merged

expand htr2hpc script to support recognition training #27

Closed

rlskoeser added this to Iteration Planning Board Jan 9, 2025

rlskoeser moved this to IceBox in Iteration Planning Board Jan 9, 2025

mnaydan assigned rlskoeser Jan 13, 2025

mnaydan moved this from IceBox to To Do in Iteration Planning Board Jan 13, 2025

cmroughan mentioned this issue Jan 17, 2025

Revise training script to update model #41

Open

rlskoeser moved this from To Do to In Progress in Iteration Planning Board Jan 28, 2025

rlskoeser moved this from In Progress to Under Review in Iteration Planning Board Jan 29, 2025

rlskoeser mentioned this issue Jan 29, 2025

Pass SLURM and kraken output to Task object messages #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved error handling for htr2hpc train script #36

improved error handling for htr2hpc train script #36

rlskoeser commented Dec 11, 2024 •

edited by cmroughan

Loading

rlskoeser commented Dec 13, 2024

cmroughan commented Jan 17, 2025

cmroughan commented Jan 17, 2025

cmroughan commented Jan 27, 2025

rlskoeser commented Jan 29, 2025

improved error handling for htr2hpc train script #36

improved error handling for htr2hpc train script #36

Comments

rlskoeser commented Dec 11, 2024 • edited by cmroughan Loading

dev todo

nice to have

less important

testing and review

rlskoeser commented Dec 13, 2024

cmroughan commented Jan 17, 2025

cmroughan commented Jan 17, 2025

cancelling in the cluster

cancelling in eScriptorium

cmroughan commented Jan 27, 2025

rlskoeser commented Jan 29, 2025

rlskoeser commented Dec 11, 2024 •

edited by cmroughan

Loading