-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update htr2hpc-train to support recognition training #35
Conversation
@@ -19,7 +20,7 @@ | |||
|
|||
|
|||
def get_segmentation_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is covering acquiring both training data for segmentation and for transcription tasks, the name get_segmentation_data
might be confusing -- that and the name of the variable segmentation_data
used later could perhaps be tidied up to make it clearer that it is general training_data or something. But not something to worry about until the code is more finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - I wasn't certain if I would need separate code for the two modes, but there is quite a bit of overlap. I like the suggestion of renaming to something like get_training_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I already have a get_training_data
method - and this does return a list of kraken Segmentation
objects. I'm going to leave it for now, maybe if/when we refactor we can figure out better names.
src/htr2hpc/train/data.py
Outdated
build_binary_dataset( | ||
segmentations, | ||
output_file=str(output_dir / "train.arrow"), | ||
num_workers=4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 probably works fine for num_workers
for now, but we might revisit this. I suppose this part of the code is being run on Della but not as the submitted slurm job, so we're working with what resources are available there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is running on della outside of slurm. We can probably use more than 4 but wasn't sure if we wanted to use the same number as we use for the slurm job (which is configurable via command line args).
Co-authored-by: cmroughan <[email protected]>
src/htr2hpc/train/run.py
Outdated
# this is a list of string, relative path, but file does not actually exist | ||
"Recognize": DEFAULT_MODEL[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmroughan have you run into this or do you have any advice about this? The kraken code has a defined variable for a default model for recognition, but it is just a name, not a path and doesn't actually seem to exist anywhere in my local installation. Do we need to require a model for fine-tuning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never interacted with a default model for transcription (unlike the blla.mlmodel
for segmentation). When doing transcription training, we will either be training the model from scratch or we will be finetuning on a different model that the user has selected in eScriptorium. So no need to track down some sort of default transcription model in kraken!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. I've removed the default model for recognition training and made input model optional, we can test it when della is back up after maintenance today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Tested recognition training, both refining and from scratch. Refining worked and the model uploaded successfully -- test ran the model on a page image with success as well. The five pages of shared doc 30 were not enough for the model to successfully converge when training from scratch, so received the warning message after the failed training job.
✅ Successfully ran a transcription train job from scratch, with the best model uploaded to eScriptorium, using 45 images of doc 27: htr2hpc-train transcription -t 29 https://test-htr.lib.princeton.edu/ recog_doc27_t29-2 -d 27 -w 3 --model-name recogtest-scratch -p 54377-54421 --no-clean
❌ Testing on the full doc 27 failed, because the current code is not correctly handling multiple pages of API results for Parts List. Trying to input -p 54377-54436
to force handling all 60 parts ran into the bug when there are multiple pages of Line Transcription List.
src/htr2hpc/train/data.py
Outdated
text_lines[text_line.line] = text_line.content | ||
# if there is another page of results, get them | ||
if transcription_lines.next: | ||
transcription_lines = transcription_lines.next_page() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not functioning correctly -- I get the following error for a part whose Line Transcription List has multiple pages:
htr2hpc/src/htr2hpc/api_client.py:57 in next_page
54 │ │ │ # convert result type (e.g. "list_model") into api method
55 │ │ │ # (this may be brittle...)
56 │ │ │ single_rtype = self.result_type.replace("list", "").strip(
57 │ │ │ list_method = getattr(self.api, f"{single_rtype}_list")
58 │ │ │ return list_method(page=next_page_num)
AttributeError: 'eScriptoriumAPIClient' object has no attribute
'transcription_line_list'
See for example https://test-htr.lib.princeton.edu/api/documents/27/parts/54422/transcriptions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hahaha wow, even says in my comments that this approach might be brittle. sure enough! 😂
src/htr2hpc/train/data.py
Outdated
if part_ids is None: | ||
doc_parts = api.document_parts_list(document_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to currently only be getting all the results from the first page of API results, and not accounting for further pages. I tried test running transcription training on the full document in doc 27 -- by default, this should have downloaded all 60 parts of the document. But it only downloaded the first 10 images (the first page of API results).
if best_model: | ||
print(f"Uploaded {best_model} to eScriptorum") | ||
else: | ||
# possibly best model found but uploade failed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connection issues (such as if the VM goes down for an inconvenient moment) could cause the upload to fail, so the fuller version of this should have a greater range of error handling. Not a priority, but not something we want to forget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good point - could also happen when we're uploading lots of models as well. I'll create an issue for error handling and we can add more items to the list as we think of them.
@cmroughan thanks as always for the careful testing and finding these flaws! I've revised how the code retrieves the next page of results from the api and tested the api methods locally before pushing the code. I started the training script on della with the document and transcription ids you noted in your comments above, and successfully downloaded all 60 pages and the transcription lines without errors, and the slurm job is running now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test ran transcription and segmentation training on all 60 images of doc 27.
Transcription:
✅ Running the train task from scratch worked, and a model only was not uploaded because we hit the time limit for the slurm job, which is okay for testing.
✅ Refining on an input model is still in progress, but it has reached the point of the slurm job running correctly with no issues so far. Refining on an input model also worked, but we again hit the slurm time limit, which is okay for testing.
Segmentation:
❌ Running the train task "from scratch" failed, I think because the script did not correctly grab blla.mlmodel
.
htr2hpc-train segmentation https://test-htr.lib.princeton.edu/ segtrain_doc27-upload --document 27 --model-name segtest
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:355 in main
│
│ 352 │ │ training_mgr.training_prep()
│ 353 │ │ # run training for requested mode
│ 354 │ │ if args.mode == "segmentation":
│ ❱ 355 │ │ │ training_mgr.segmentation_training()
│ 356 │ │ if args.mode == "transcription":
│ 357 │ │ │ training_mgr.recognition_training()
│ 358 │ except (NotFound, NotAllowed) as err:
│
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:142 in
│ segmentation_training
│
│ 139 │ def segmentation_training(self):
│ 140 │ │ # get absolute versions of these paths _before_ changing worki
│ 141 │ │ abs_training_data_dir = self.training_data_dir.absolute()
│ ❱ 142 │ │ abs_model_file = self.model_file.absolute()
│ 143 │ │ abs_output_modelfile = self.output_modelfile.absolute()
│ 144 │ │
│ 145 │ │ # change directory to working directory, since by default,
AttributeError: 'NoneType' object has no attribute 'absolute'
✅ Refining on an input model worked. All epochs completed within the time limit, and all 50 models uploaded to the eScriptorium instance.
Thanks for finding this - when I revised to only supply a default model for segmentation I wrote the training mode check incorrectly. I've corrected that now. |
Tested behavior of The script failed, I believe because in the API the value for Lines without regions are "orphans" in eScriptorium -- in an export, a user can choose whether or not to filter them out (code). When training a model, we do want to include this data both for segmentation or transcription. The exported ALTO XML file usually handles orphan lines by placing them inside a The command tested:
The error:
|
@cmroughan thanks for finding! I've added handling for when |
Excellent! Ran tests on my end to confirm that orphan lines are being handled correctly, both for segmentation and transcription, and both tests were successful. ✅ ✅ I inspected the XML download for segmentation and confirmed that the orphan lines are correctly being input into that format. |
Also tested some odd cases, and nothing broke! Just some comments about better handling we might want later, but not a priority.
|
@cmroughan Thanks again for the thorough and thoughtful testing. I've added a link to your most recent comment on our issue for adding error handling. #36 I'll go ahead and merge this and mark the issue as complete. |
towards implementing #27
ketos train
command