Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update htr2hpc-train to support recognition training #35

Merged
merged 23 commits into from
Dec 13, 2024

Conversation

rlskoeser
Copy link
Contributor

@rlskoeser rlskoeser commented Dec 9, 2024

towards implementing #27

  • generate recognition training data; adapts eScriptorium approach and compiles straight to binary pyarrow format
  • start and monitor slurm job to run ketos train command

@rlskoeser rlskoeser marked this pull request as draft December 9, 2024 20:58
@rlskoeser rlskoeser requested a review from cmroughan December 9, 2024 20:59
@@ -19,7 +20,7 @@


def get_segmentation_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is covering acquiring both training data for segmentation and for transcription tasks, the name get_segmentation_data might be confusing -- that and the name of the variable segmentation_data used later could perhaps be tidied up to make it clearer that it is general training_data or something. But not something to worry about until the code is more finalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - I wasn't certain if I would need separate code for the two modes, but there is quite a bit of overlap. I like the suggestion of renaming to something like get_training_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I already have a get_training_data method - and this does return a list of kraken Segmentation objects. I'm going to leave it for now, maybe if/when we refactor we can figure out better names.

build_binary_dataset(
segmentations,
output_file=str(output_dir / "train.arrow"),
num_workers=4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 probably works fine for num_workers for now, but we might revisit this. I suppose this part of the code is being run on Della but not as the submitted slurm job, so we're working with what resources are available there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is running on della outside of slurm. We can probably use more than 4 but wasn't sure if we wanted to use the same number as we use for the slurm job (which is configurable via command line args).

src/htr2hpc/train/slurm.py Outdated Show resolved Hide resolved
Comment on lines 39 to 40
# this is a list of string, relative path, but file does not actually exist
"Recognize": DEFAULT_MODEL[0],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmroughan have you run into this or do you have any advice about this? The kraken code has a defined variable for a default model for recognition, but it is just a name, not a path and doesn't actually seem to exist anywhere in my local installation. Do we need to require a model for fine-tuning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never interacted with a default model for transcription (unlike the blla.mlmodel for segmentation). When doing transcription training, we will either be training the model from scratch or we will be finetuning on a different model that the user has selected in eScriptorium. So no need to track down some sort of default transcription model in kraken!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I've removed the default model for recognition training and made input model optional, we can test it when della is back up after maintenance today.

@rlskoeser rlskoeser marked this pull request as ready for review December 10, 2024 21:14
Copy link
Collaborator

@cmroughan cmroughan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Tested recognition training, both refining and from scratch. Refining worked and the model uploaded successfully -- test ran the model on a page image with success as well. The five pages of shared doc 30 were not enough for the model to successfully converge when training from scratch, so received the warning message after the failed training job.

✅ Successfully ran a transcription train job from scratch, with the best model uploaded to eScriptorium, using 45 images of doc 27: htr2hpc-train transcription -t 29 https://test-htr.lib.princeton.edu/ recog_doc27_t29-2 -d 27 -w 3 --model-name recogtest-scratch -p 54377-54421 --no-clean

❌ Testing on the full doc 27 failed, because the current code is not correctly handling multiple pages of API results for Parts List. Trying to input -p 54377-54436 to force handling all 60 parts ran into the bug when there are multiple pages of Line Transcription List.

text_lines[text_line.line] = text_line.content
# if there is another page of results, get them
if transcription_lines.next:
transcription_lines = transcription_lines.next_page()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not functioning correctly -- I get the following error for a part whose Line Transcription List has multiple pages:

htr2hpc/src/htr2hpc/api_client.py:57 in next_page

54 │   │   │   # convert result type (e.g. "list_model") into api method  
55 │   │   │   # (this may be brittle...)   
56 │   │   │   single_rtype = self.result_type.replace("list", "").strip( 
57 │   │   │   list_method = getattr(self.api, f"{single_rtype}_list") 
58 │   │   │   return list_method(page=next_page_num) 

AttributeError: 'eScriptoriumAPIClient' object has no attribute 
'transcription_line_list'

See for example https://test-htr.lib.princeton.edu/api/documents/27/parts/54422/transcriptions/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha wow, even says in my comments that this approach might be brittle. sure enough! 😂

Comment on lines 177 to 178
if part_ids is None:
doc_parts = api.document_parts_list(document_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to currently only be getting all the results from the first page of API results, and not accounting for further pages. I tried test running transcription training on the full document in doc 27 -- by default, this should have downloaded all 60 parts of the document. But it only downloaded the first 10 images (the first page of API results).

if best_model:
print(f"Uploaded {best_model} to eScriptorum")
else:
# possibly best model found but uploade failed?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection issues (such as if the VM goes down for an inconvenient moment) could cause the upload to fail, so the fuller version of this should have a greater range of error handling. Not a priority, but not something we want to forget.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point - could also happen when we're uploading lots of models as well. I'll create an issue for error handling and we can add more items to the list as we think of them.

@rlskoeser rlskoeser requested a review from cmroughan December 11, 2024 14:37
@rlskoeser
Copy link
Contributor Author

@cmroughan thanks as always for the careful testing and finding these flaws! I've revised how the code retrieves the next page of results from the api and tested the api methods locally before pushing the code. I started the training script on della with the document and transcription ids you noted in your comments above, and successfully downloaded all 60 pages and the transcription lines without errors, and the slurm job is running now.

Copy link
Collaborator

@cmroughan cmroughan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test ran transcription and segmentation training on all 60 images of doc 27.

Transcription:

✅ Running the train task from scratch worked, and a model only was not uploaded because we hit the time limit for the slurm job, which is okay for testing.

Refining on an input model is still in progress, but it has reached the point of the slurm job running correctly with no issues so far. Refining on an input model also worked, but we again hit the slurm time limit, which is okay for testing.

Segmentation:

❌ Running the train task "from scratch" failed, I think because the script did not correctly grab blla.mlmodel.

htr2hpc-train segmentation https://test-htr.lib.princeton.edu/ segtrain_doc27-upload --document 27 --model-name segtest

│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:355 in main 
│
│   352 │   │   training_mgr.training_prep() 
│   353 │   │   # run training for requested mode 
│   354 │   │   if args.mode == "segmentation":
│ ❱ 355 │   │   │   training_mgr.segmentation_training() 
│   356 │   │   if args.mode == "transcription":       
│   357 │   │   │   training_mgr.recognition_training()
│   358 │   except (NotFound, NotAllowed) as err:                           
│                                                                           
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:142 in            
│ segmentation_training                                                     
│                                                                           
│   139 │   def segmentation_training(self):                                
│   140 │   │   # get absolute versions of these paths _before_ changing worki 
│   141 │   │   abs_training_data_dir = self.training_data_dir.absolute()   
│ ❱ 142 │   │   abs_model_file = self.model_file.absolute()                 
│   143 │   │   abs_output_modelfile = self.output_modelfile.absolute()     
│   144 │   │                                                               
│   145 │   │   # change directory to working directory, since by default,  

AttributeError: 'NoneType' object has no attribute 'absolute'

✅ Refining on an input model worked. All epochs completed within the time limit, and all 50 models uploaded to the eScriptorium instance.

@rlskoeser
Copy link
Contributor Author

❌ Running the train task "from scratch" failed, I think because the script did not correctly grab blla.mlmodel.

Thanks for finding this - when I revised to only supply a default model for segmentation I wrote the training mode check incorrectly. I've corrected that now.

@cmroughan
Copy link
Collaborator

Tested behavior of htr2hpc-train when run on a selection of pages where some pages do not have region-level data (only line-level data).

The script failed, I believe because in the API the value for region for these lines is null. See example in the API here.

Lines without regions are "orphans" in eScriptorium -- in an export, a user can choose whether or not to filter them out (code). When training a model, we do want to include this data both for segmentation or transcription.

The exported ALTO XML file usually handles orphan lines by placing them inside a <TextBlock ID="eSc_dummyblock_"> element, with no attributes beyond the dummyblock ID and no <Shape> or <Polygon> child elements.


The command tested:

htr2hpc-train transcription -t 40 https://test-htr.lib.princeton.edu/ recog_doc33_t40 -d 33 -w 3 --model-name recogtest-scratch40 --no-clean -m 119

The error:

DEBUG:htr2hpc.api_client:get https://test-htr.lib.princeton.edu/api/documents/33/parts/54989/ 200: 0.185237 sec
DEBUG:htr2hpc.api_client:Creating namedtuple with name part
DEBUG:htr2hpc.api_client:Creating namedtuple with name region
DEBUG:htr2hpc.api_client:Creating namedtuple with name line
DEBUG:htr2hpc.api_client:get https://test-htr.lib.princeton.edu/api/documents/33/parts/54989/transcriptions/ 200: 0.161417 sec
DEBUG:htr2hpc.api_client:Creating namedtuple with name transcription_line
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/croughan/.conda/envs/htr2hpc/bin/htr2hpc-train:8 in <module>           │
│                                                                              │
│   5 from htr2hpc.train.run import main                                       │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(main())                                                     │
│   9                                                                          │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:352 in main          │
│                                                                              │
│   349 │   )                                                                  │
│   350 │   try:                                                               │
│   351 │   │   # prep data for training                                       │
│ ❱ 352 │   │   training_mgr.training_prep()                                   │
│   353 │   │   # run training for requested mode                              │
│   354 │   │   if args.mode == "segmentation":                                │
│   355 │   │   │   training_mgr.segmentation_training()                       │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:68 in training_prep  │
│                                                                              │
│    65 │   │   self.training_data_dir = self.work_dir / "parts"               │
│    66 │   │   if not self.existing_data:                                     │
│    67 │   │   │   self.training_data_dir.mkdir()                             │
│ ❱  68 │   │   get_training_data(                                             │
│    69 │   │   │   self.api,                                                  │
│    70 │   │   │   self.training_data_dir,                                    │
│    71 │   │   │   self.document_id,                                          │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:212 in              │
│ get_training_data                                                            │
│                                                                              │
│   209 │   document_details = api.document_details(document_id)               │
│   210 │                                                                      │
│   211 │   # get segmentation data for each part of the document that is requ │
│ ❱ 212 │   segmentation_data = [                                              │
│   213 │   │   get_segmentation_data(                                         │
│   214 │   │   │   api, document_details, part_id, output_dir, transcription_ │
│   215 │   │   )                                                              │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:213 in <listcomp>   │
│                                                                              │
│   210 │                                                                      │
│   211 │   # get segmentation data for each part of the document that is requ │
│   212 │   segmentation_data = [                                              │
│ ❱ 213 │   │   get_segmentation_data(                                         │
│   214 │   │   │   api, document_details, part_id, output_dir, transcription_ │
│   215 │   │   )                                                              │
│   216 │   │   for part_id in part_ids                                        │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:95 in               │
│ get_segmentation_data                                                        │
│                                                                              │
│    92 │   else:                                                              │
│    93 │   │   text_lines = {}                                                │
│    94 │                                                                      │
│ ❱  95 │   baselines = [                                                      │
│    96 │   │   BaselineLine(                                                  │
│    97 │   │   │   id=line.external_id,                                       │
│    98 │   │   │   baseline=line.baseline,                                    │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:102 in <listcomp>   │
│                                                                              │
│    99 │   │   │   boundary=line.mask,                                        │
│   100 │   │   │   # eScriptorium api returns a single region pk              │
│   101 │   │   │   # kraken takes a list of string ids                        │
│ ❱ 102 │   │   │   regions=[region_pk_to_id[line.region]],                    │
│   103 │   │   │   # mark as default if type is not in the public list        │
│   104 │   │   │   # db includes more types but they are not marked as public │
│   105 │   │   │   tags={"type": line_types.get(line.typology, "default")},   │
╰──────────────────────────────────────────────────────────────────────────────╯
KeyError: None

@rlskoeser
Copy link
Contributor Author

@cmroughan thanks for finding! I've added handling for when line.region is None. I tested the data export locally with the command you provide, I was able to duplicate your error before this change, and successfully compiled the binary data after making this change.

@cmroughan
Copy link
Collaborator

Excellent! Ran tests on my end to confirm that orphan lines are being handled correctly, both for segmentation and transcription, and both tests were successful.

htr2hpc-train segmentation https://test-htr.lib.princeton.edu/ segtest_doc33 -d 33 -w 3 --model-name segtest-scratch40 --no-clean

htr2hpc-train transcription -t 40 https://test-htr.lib.princeton.edu/ recog_doc33_t40 -d 33 -w 3 --model-name recogtest-scratch40 --no-clean

I inspected the XML download for segmentation and confirmed that the orphan lines are correctly being input into that format.

@cmroughan
Copy link
Collaborator

Also tested some odd cases, and nothing broke! Just some comments about better handling we might want later, but not a priority.

  • 🟡 Running recognition training on pages with segmentation data but no transcription data:
    • slurm job submitted but failed (because there was no valid training data).
    • Not a requirement right now, but we might consider adding code to catch when there is no valid training data in the parts/transcription submitted.
  • 🟡 Running recognition training on pages with no segmentation data (& therefore no transcription data):
    • slurm job submitted but failed (because there was no valid training data).
    • Same comment as above.
  • 🟢 Running recognition training on mixed pages, some with seg+trans data, some with no data:
    • slurm job submitted, ran successfully using the training data that does exist
  • 🟡 Running segmentation training on pages with no segmentation data:
    • slurm job submitted, ran... technically successfully, I guess? The model learned that there are indeed no regions to recognize!
    • Really there should be some error handling to catch when seg training data has 0 lines AND 0 regions, since that's useless and we don't need to spend cluster time on that.
  • 🟢 Running segmentation training on mixed pages, some with seg data and some without:
    • slurm job submitted, ran successfully. This is desired behavior (there may be training data with pages where the model should learn that there are no regions/lines!)

@rlskoeser
Copy link
Contributor Author

@cmroughan Thanks again for the thorough and thoughtful testing.

I've added a link to your most recent comment on our issue for adding error handling. #36

I'll go ahead and merge this and mark the issue as complete.

@rlskoeser rlskoeser merged commit 6dece38 into develop Dec 13, 2024
@rlskoeser rlskoeser deleted the feature/recognition-data branch December 13, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants