Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify training script to update the model it starts with #39

Closed
13 tasks done
mnaydan opened this issue Dec 13, 2024 · 3 comments
Closed
13 tasks done

modify training script to update the model it starts with #39

mnaydan opened this issue Dec 13, 2024 · 3 comments
Assignees

Comments

@mnaydan
Copy link

mnaydan commented Dec 13, 2024

  • add update flag to script; when specified, the best model is pushed to escriptorium to update that model id
  • when training a new model, the model file is empty; check for that and use no model behavior
  • ~ figure out how to identify the best model generated from segmentation training and use that when update flag is !specified; follow whatever logic current escriptorium uses~
  • use -q early flag to get best segmentation model
  • if no best model, find best by accuracy score and upload if improved on original model file
  • fix bug on checking for model file (need to check if model has a parent before comparing parent file; see Revise training script to update model #41 (comment))
  • add option to control updating model if improved on original (see Revise training script to update model #41 (comment))
  • set new option flag based on overwrite flag in escriptorium form

testing and review (round 2)

  • test behavior when training exceeds slurm allotted time
    • if no best model by name, should look based on accuracy score
    • should only upload if improves on original model
    • should delete placeholder model record in eScriptorium if training does not improve
  • confirm nonetype file error is resolved when checking for no file on model without parent model
  • confirm script uses --update-if-improved when using overwrite option in escriptorium and otherwise --update
  • confirm that update / update-if-improved logic behaves correctly
@cmroughan
Copy link
Collaborator

cmroughan commented Jan 17, 2025

Tested various train tasks from the GUI. Assuming slurm's train job completes successfully and produces a _best.mlmodel, the results:

Test Success? Job scratch / refine overwrite? which GUI
1 transcription refining no old
2 segmentation refining no old
3 transcription refining yes old
4 segmentation refining yes old
5 ⚠️ transcription refining no new
6 ⚠️ segmentation refining no new
7 ⚠️ transcription refining yes new
8 ⚠️ segmentation refining yes new

Tests 5-8 (submitting the refining train job using the new GUI) have a major bug in that the model that is selected to be refined on is somehow deleted from eScr's files. The model object in eScr will still exist and will display as if it is present, but the model file will be gone: any attempt to use that model or download it will lead to a FileNotFoundError such as:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/nfs/cdh/htr/media/models/c26deef8/greek_print_math-11.mlmodel'

I have been able to replicate this with both transcription and segmentation train tasks, regardless of whether the "Overwrite" checkbox is clicked. It only happens in the new GUI (see the toggle to switch between the two here). The deletion occurs at the end of the script's runtime, after the slurm job has completed and the script sends the newly trained model back into eScr -- before that, it still exists in the filesystem.

The deletions are also occurring regardless of whether the user is the owner of the model or not -- I encountered disappearing models with ones that this account had only User permissions for.

@cmroughan
Copy link
Collaborator

Additionally, "Overwrite" behavior is functioning incorrectly in the new GUI. Running a train job with "Overwrite" in the new GUI will produce a new model object that is being trained on, instead of the training happening on top of the input model. See attached screenshot -- "override-seg-newGUI" should not be a new item on this list, the training in progress icon should be appearing on "bnseg_complex2" instead. In the old GUI, this works as it should.

I wouldn't have expected the new GUI to impact the underlying submitted task, so I am not sure why this and the above are happening -- I still need to try to find the relevant eScr code.

Image

@rlskoeser
Copy link
Contributor

Closing based on testing and refinement from @cmroughan

@github-project-automation github-project-automation bot moved this from Under Review to Done in Iteration Planning Board Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants