-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved error handling for htr2hpc train script #36
Comments
Additional error handling we should add is documented here : #35 (comment) |
Tested error handling for running the htr2hpc script from the eScr frontend, firstly: what happens when a user tries to run a train job before setting their SSH key. In our logs we get adequate warnings:
But the user receives little information in the GUI. In the Task Usage page the task State will be shown as "Crashed". The model that was supposed to be trained, however, continues to sit on the models page with the training status shown to be in progress. Clicking cancel here leads to an error page (the library's "something went wrong" page). The task has to be cancelled in the Task Monitoring page, even though there too eScr thinks no task is in progress. However, clicking cancel here will get rid of the in progress training indication on the models page. |
The user is also provided with several avenues from which they can try to cancel the training task. cancelling in the clusterCancelling a train job in the cluster's Active Jobs page prompts a "Training finished" message in eScriptorium, but nothing uploads. The model object continues to sit in the models page -- it displays no accuracy but is just the initial copy of the model to be trained on top of.
cancelling in eScriptoriumCancelling a train job via eScr's GUI leads to entries like the following in the logs:
The job, however, continues to run on the cluster. When the slurm job completes, the resulting model gets passed back to eScriptorium. Everything gets cleaned up on the cluster correctly.
|
Copying error-handling-related comment from #41 :
|
@cmroughan I've added code to handle the slurm task being cancelled from mydella web ui, but I don't think there's any easy way to handle the task being canceled from the eScriptorium web ui. The eScriptorium code calls a celery task I'm also not sure what kind of handling or error messaging you want when della can't be reached, e.g. during maintenance. We could temporarily configure the hpc remote server to a different hostname and see what happens, but I'm hoping that the connection error handling and improved logging I've added is sufficient for this case. |
Originally posted by @cmroughan in #35 (comment)
dev todo
test and handle what happens when della is down for monthly maintenancenice to have
less important
testing and review
The text was updated successfully, but these errors were encountered: