Revise training script to update model #41

rlskoeser · 2025-01-14T21:56:10Z

work related to #39 #34 and closing out #15

add update option to htr2hpc-train script
- set accuracy and file-size when uploading model
finish implementing train celery task / htr2hpc integration
delete model on training error (unless pre-existing / overwrite)

work towards #39

Co-authored-by: cmroughan <[email protected]>

cmroughan · 2025-02-03T14:16:32Z

There's a bug causing the script to crash with a "'NoneType' object has no attribute 'file'" report. It's occurring after the run.py portion -- I think it might be happening in the portion of tasks.py dealing with refreshing model data from the database, though I have not had the time to fully isolate it.

Cases where it's occurring:

segmentation/transcription jobs training from scratch
segmentation/transcription jobs refining on a model with "overwrite" checked

It does not occur on seg/trans jobs refining on a model but without "overwrite" checked. It also does not occur if an earlier step of the script returns "No best model found".

Update:

Found the bug in the logs, it is indeed happening in tasks.py around where I thought:

File "/srv/www/escriptorium/app/env/lib/python3.11/site-packages/htr2hpc/tasks
.py", line 383, in train
    if model.file is None or model.file == model.parent.file:

cmroughan · 2025-02-03T14:17:54Z

I would request adjusting tasks.py and run.py to pass forward information about whether the "overwrite" checkbox is ticked or not. If it is ticked, it would be good to pass that info into upload_best_model() and get_best_model(). The logic f"Must be better than original model {original_model.name} accuracy {best_accuracy:0.3f}" etc should only run if "overwrite" is checked. If it is not checked, the new model can be uploaded into eScriptorium even if the accuracy is lower than the original model we're refining on (because it won't replace it). If it is checked, then the new model should not be uploaded (which is the current behavior).

This logic is because someone could select some arbitrary model to use as a foundation for refining on and they will still want the resulting model even if its accuracy is lower than the foundation model's. For example, my taking an existing model for Latin and using it to jumpstart training a Greek model -- I don't care if the output Greek model is 96% to the original Latin's 98% because it's a wholly new model for me. But if "overwrite" is checked, then in that use case I am presumably training a Latin model on top of the old Latin model and then I do want to make sure I'm not deleting a older but better model by overwriting on top of it.

rlskoeser added 8 commits January 14, 2025 10:13

Preliminary work to support updating eScriptorium model with best result

3efa78f

work towards #39

Add error handling and reporting for initial API connection

186d0ca

Improve model cleanup script

35da2fb

Handle model id with no file to download

42878a6

Update existing model with best result; include accuracy

fde5793

Fix garbled help string for update param

88f8c2a

Add newly exposed training_accuracy field to ocrmodel api result

80657aa

Add timestamp to working directory; use htr2hpc-train update flag

81840b2

rlskoeser changed the base branch from main to develop January 14, 2025 21:56

rlskoeser added 4 commits January 15, 2025 11:38

Drop --model-name parameter when calling htr2hpc-train script

27e97db

Correct output dir to rely on transcription id instead of doc id

03ab3e5

Update train task to trigger run htr2hpc-train command

86d6375

Get user model from user pk for train task

4f5e05c

rlskoeser force-pushed the feature/script-update-model branch 2 times, most recently from ddb581b to 129f94b Compare January 15, 2025 17:20

Set working_dir for train task

103b36d

rlskoeser force-pushed the feature/script-update-model branch from 129f94b to 103b36d Compare January 15, 2025 17:23

rlskoeser and others added 11 commits January 15, 2025 14:32

Run ketos train with -q early option so we get a best segmentation model

940bd81

Co-authored-by: cmroughan <[email protected]>

Preliminary check for error handling on newly created model

bf5eebd

Report kraken accuracy value as-is (don't convert to %)

51891ee

Adjust error handling and updating model training status

4732836

Don't try to output model creation date before we've loaded the model

a3d0b86

Use timezone-aware timestamp for task start time

9726991

Disable progressbar on recognition training task

5d99700

Don't overwrite model updated via api at end of recognition train task

fb23270

Report on task group creation time and difference with model

0bfb05c

Delete model on training error unless it is a pre-existing model

640caa8

Revise check for new/pre-existing model in error handling

ea3e152

rlskoeser force-pushed the feature/script-update-model branch from 5817c6b to ea3e152 Compare January 16, 2025 17:36

rlskoeser requested a review from cmroughan January 16, 2025 17:59

rlskoeser added 3 commits January 28, 2025 14:33

Catch authentication exception

0bd1008

Catch and report generic exception for now

5d85e74

Adjust error reporting and add details to task report

5593f96

rlskoeser force-pushed the feature/script-update-model branch from 6640513 to 5593f96 Compare January 28, 2025 19:33

rlskoeser added 5 commits January 28, 2025 14:46

Handle slurm job cancellation

62c6939

Add script output to task report messages

2f38a33

Log train command on report; set report cancellation status

cfe7291

Move cancel logic to unexpected exit error handling

fdf8b0a

Revise cancellation handling

bf131ab

rlskoeser force-pushed the feature/script-update-model branch from 2fd9719 to bf131ab Compare January 28, 2025 20:23

rlskoeser added 7 commits January 28, 2025 15:29

Catch cancellation error separately

68c762c

Nicer formatting in task report

f0664ba

Handle case where training finishes but no new model is uploaded

d47be9b

Preliminary logic for task report messages

6c7a0fd

Try new --task-report option on script

816fc19

Refresh task report from db after script runs before updating messages

98933d4

Add reporting on best model identification logic

5f6ea94

rlskoeser force-pushed the feature/script-update-model branch from 3b95c4c to 5f6ea94 Compare January 29, 2025 18:36

Change to original working directory before starting slurm monitor

67b105b

rlskoeser force-pushed the feature/script-update-model branch from 6c6b79b to 67b105b Compare January 29, 2025 20:09

rlskoeser added 2 commits January 29, 2025 15:35

Update recognition training task to set model as training

86adbc9

Fix document.pk variable references

cea5829

rlskoeser force-pushed the feature/script-update-model branch from c2842b6 to cea5829 Compare January 29, 2025 20:52

adding slurm output to segtrain msgs

274ab53

adding back in clean-up on error, slurm cancellation

f3d1765

rlskoeser mentioned this pull request Feb 3, 2025

modify training script to update the model it starts with #39

Closed

13 tasks

rlskoeser merged commit 4208473 into develop Feb 3, 2025

rlskoeser deleted the feature/script-update-model branch February 3, 2025 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise training script to update model #41

Revise training script to update model #41

rlskoeser commented Jan 14, 2025 •

edited

Loading

cmroughan commented Feb 3, 2025 •

edited

Loading

cmroughan commented Feb 3, 2025

Revise training script to update model #41

Revise training script to update model #41

Conversation

rlskoeser commented Jan 14, 2025 • edited Loading

cmroughan commented Feb 3, 2025 • edited Loading

Update:

cmroughan commented Feb 3, 2025

rlskoeser commented Jan 14, 2025 •

edited

Loading

cmroughan commented Feb 3, 2025 •

edited

Loading