Update htr2hpc-train to support recognition training #35

rlskoeser · 2024-12-09T20:47:50Z

towards implementing #27

generate recognition training data; adapts eScriptorium approach and compiles straight to binary pyarrow format
start and monitor slurm job to run ketos train command

towards #27

cmroughan · 2024-12-10T14:28:01Z

src/htr2hpc/train/data.py

@@ -19,7 +20,7 @@


 def get_segmentation_data(


Since this is covering acquiring both training data for segmentation and for transcription tasks, the name get_segmentation_data might be confusing -- that and the name of the variable segmentation_data used later could perhaps be tidied up to make it clearer that it is general training_data or something. But not something to worry about until the code is more finalized.

agreed - I wasn't certain if I would need separate code for the two modes, but there is quite a bit of overlap. I like the suggestion of renaming to something like get_training_data

Ah, I already have a get_training_data method - and this does return a list of kraken Segmentation objects. I'm going to leave it for now, maybe if/when we refactor we can figure out better names.

cmroughan · 2024-12-10T14:31:02Z

src/htr2hpc/train/data.py

+        build_binary_dataset(
+            segmentations,
+            output_file=str(output_dir / "train.arrow"),
+            num_workers=4,


4 probably works fine for num_workers for now, but we might revisit this. I suppose this part of the code is being run on Della but not as the submitted slurm job, so we're working with what resources are available there.

Yes, this is running on della outside of slurm. We can probably use more than 4 but wasn't sure if we wanted to use the same number as we use for the slurm job (which is configurable via command line args).

src/htr2hpc/train/slurm.py

Co-authored-by: cmroughan <[email protected]>

rlskoeser · 2024-12-10T15:23:32Z

src/htr2hpc/train/run.py

+    # this is a list of string, relative path, but file does not actually exist
+    "Recognize": DEFAULT_MODEL[0],


@cmroughan have you run into this or do you have any advice about this? The kraken code has a defined variable for a default model for recognition, but it is just a name, not a path and doesn't actually seem to exist anywhere in my local installation. Do we need to require a model for fine-tuning?

I have never interacted with a default model for transcription (unlike the blla.mlmodel for segmentation). When doing transcription training, we will either be training the model from scratch or we will be finetuning on a different model that the user has selected in eScriptorium. So no need to track down some sort of default transcription model in kraken!

Thanks for explaining. I've removed the default model for recognition training and made input model optional, we can test it when della is back up after maintenance today.

cmroughan

✅ Tested recognition training, both refining and from scratch. Refining worked and the model uploaded successfully -- test ran the model on a page image with success as well. The five pages of shared doc 30 were not enough for the model to successfully converge when training from scratch, so received the warning message after the failed training job.

✅ Successfully ran a transcription train job from scratch, with the best model uploaded to eScriptorium, using 45 images of doc 27: htr2hpc-train transcription -t 29 https://test-htr.lib.princeton.edu/ recog_doc27_t29-2 -d 27 -w 3 --model-name recogtest-scratch -p 54377-54421 --no-clean

❌ Testing on the full doc 27 failed, because the current code is not correctly handling multiple pages of API results for Parts List. Trying to input -p 54377-54436 to force handling all 60 parts ran into the bug when there are multiple pages of Line Transcription List.

cmroughan · 2024-12-11T02:31:11Z

src/htr2hpc/train/data.py

+                text_lines[text_line.line] = text_line.content
+            # if there is another page of results, get them
+            if transcription_lines.next:
+                transcription_lines = transcription_lines.next_page()


This is not functioning correctly -- I get the following error for a part whose Line Transcription List has multiple pages:

htr2hpc/src/htr2hpc/api_client.py:57 in next_page 54 │ │ │ # convert result type (e.g. "list_model") into api method 55 │ │ │ # (this may be brittle...) 56 │ │ │ single_rtype = self.result_type.replace("list", "").strip( 57 │ │ │ list_method = getattr(self.api, f"{single_rtype}_list") 58 │ │ │ return list_method(page=next_page_num) AttributeError: 'eScriptoriumAPIClient' object has no attribute 'transcription_line_list'

See for example https://test-htr.lib.princeton.edu/api/documents/27/parts/54422/transcriptions/

hahaha wow, even says in my comments that this approach might be brittle. sure enough! 😂

cmroughan · 2024-12-11T02:39:46Z

src/htr2hpc/train/data.py

    if part_ids is None:
        doc_parts = api.document_parts_list(document_id)


This seems to currently only be getting all the results from the first page of API results, and not accounting for further pages. I tried test running transcription training on the full document in doc 27 -- by default, this should have downloaded all 60 parts of the document. But it only downloaded the first 10 images (the first page of API results).

cmroughan · 2024-12-11T02:50:20Z

src/htr2hpc/train/run.py

+        if best_model:
+            print(f"Uploaded {best_model} to eScriptorum")
+        else:
+            # possibly best model found but uploade failed?


Connection issues (such as if the VM goes down for an inconvenient moment) could cause the upload to fail, so the fuller version of this should have a greater range of error handling. Not a priority, but not something we want to forget.

yes, good point - could also happen when we're uploading lots of models as well. I'll create an issue for error handling and we can add more items to the list as we think of them.

rlskoeser · 2024-12-11T14:40:34Z

@cmroughan thanks as always for the careful testing and finding these flaws! I've revised how the code retrieves the next page of results from the api and tested the api methods locally before pushing the code. I started the training script on della with the document and transcription ids you noted in your comments above, and successfully downloaded all 60 pages and the transcription lines without errors, and the slurm job is running now.

cmroughan

Test ran transcription and segmentation training on all 60 images of doc 27.

Transcription:

✅ Running the train task from scratch worked, and a model only was not uploaded because we hit the time limit for the slurm job, which is okay for testing.

✅ ~~Refining on an input model is still in progress, but it has reached the point of the slurm job running correctly with no issues so far.~~ Refining on an input model also worked, but we again hit the slurm time limit, which is okay for testing.

Segmentation:

❌ Running the train task "from scratch" failed, I think because the script did not correctly grab blla.mlmodel.

htr2hpc-train segmentation https://test-htr.lib.princeton.edu/ segtrain_doc27-upload --document 27 --model-name segtest

│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:355 in main 
│
│   352 │   │   training_mgr.training_prep() 
│   353 │   │   # run training for requested mode 
│   354 │   │   if args.mode == "segmentation":
│ ❱ 355 │   │   │   training_mgr.segmentation_training() 
│   356 │   │   if args.mode == "transcription":       
│   357 │   │   │   training_mgr.recognition_training()
│   358 │   except (NotFound, NotAllowed) as err:                           
│                                                                           
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:142 in            
│ segmentation_training                                                     
│                                                                           
│   139 │   def segmentation_training(self):                                
│   140 │   │   # get absolute versions of these paths _before_ changing worki 
│   141 │   │   abs_training_data_dir = self.training_data_dir.absolute()   
│ ❱ 142 │   │   abs_model_file = self.model_file.absolute()                 
│   143 │   │   abs_output_modelfile = self.output_modelfile.absolute()     
│   144 │   │                                                               
│   145 │   │   # change directory to working directory, since by default,  

AttributeError: 'NoneType' object has no attribute 'absolute'

✅ Refining on an input model worked. All epochs completed within the time limit, and all 50 models uploaded to the eScriptorium instance.

rlskoeser · 2024-12-11T19:52:31Z

❌ Running the train task "from scratch" failed, I think because the script did not correctly grab blla.mlmodel.

Thanks for finding this - when I revised to only supply a default model for segmentation I wrote the training mode check incorrectly. I've corrected that now.

cmroughan · 2024-12-12T20:35:14Z

Tested behavior of htr2hpc-train when run on a selection of pages where some pages do not have region-level data (only line-level data).

The script failed, I believe because in the API the value for region for these lines is null. See example in the API here.

Lines without regions are "orphans" in eScriptorium -- in an export, a user can choose whether or not to filter them out (code). When training a model, we do want to include this data both for segmentation or transcription.

The exported ALTO XML file usually handles orphan lines by placing them inside a <TextBlock ID="eSc_dummyblock_"> element, with no attributes beyond the dummyblock ID and no <Shape> or <Polygon> child elements.

The command tested:

htr2hpc-train transcription -t 40 https://test-htr.lib.princeton.edu/ recog_doc33_t40 -d 33 -w 3 --model-name recogtest-scratch40 --no-clean -m 119

The error:

DEBUG:htr2hpc.api_client:get https://test-htr.lib.princeton.edu/api/documents/33/parts/54989/ 200: 0.185237 sec
DEBUG:htr2hpc.api_client:Creating namedtuple with name part
DEBUG:htr2hpc.api_client:Creating namedtuple with name region
DEBUG:htr2hpc.api_client:Creating namedtuple with name line
DEBUG:htr2hpc.api_client:get https://test-htr.lib.princeton.edu/api/documents/33/parts/54989/transcriptions/ 200: 0.161417 sec
DEBUG:htr2hpc.api_client:Creating namedtuple with name transcription_line
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/croughan/.conda/envs/htr2hpc/bin/htr2hpc-train:8 in <module>           │
│                                                                              │
│   5 from htr2hpc.train.run import main                                       │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(main())                                                     │
│   9                                                                          │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:352 in main          │
│                                                                              │
│   349 │   )                                                                  │
│   350 │   try:                                                               │
│   351 │   │   # prep data for training                                       │
│ ❱ 352 │   │   training_mgr.training_prep()                                   │
│   353 │   │   # run training for requested mode                              │
│   354 │   │   if args.mode == "segmentation":                                │
│   355 │   │   │   training_mgr.segmentation_training()                       │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/run.py:68 in training_prep  │
│                                                                              │
│    65 │   │   self.training_data_dir = self.work_dir / "parts"               │
│    66 │   │   if not self.existing_data:                                     │
│    67 │   │   │   self.training_data_dir.mkdir()                             │
│ ❱  68 │   │   get_training_data(                                             │
│    69 │   │   │   self.api,                                                  │
│    70 │   │   │   self.training_data_dir,                                    │
│    71 │   │   │   self.document_id,                                          │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:212 in              │
│ get_training_data                                                            │
│                                                                              │
│   209 │   document_details = api.document_details(document_id)               │
│   210 │                                                                      │
│   211 │   # get segmentation data for each part of the document that is requ │
│ ❱ 212 │   segmentation_data = [                                              │
│   213 │   │   get_segmentation_data(                                         │
│   214 │   │   │   api, document_details, part_id, output_dir, transcription_ │
│   215 │   │   )                                                              │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:213 in <listcomp>   │
│                                                                              │
│   210 │                                                                      │
│   211 │   # get segmentation data for each part of the document that is requ │
│   212 │   segmentation_data = [                                              │
│ ❱ 213 │   │   get_segmentation_data(                                         │
│   214 │   │   │   api, document_details, part_id, output_dir, transcription_ │
│   215 │   │   )                                                              │
│   216 │   │   for part_id in part_ids                                        │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:95 in               │
│ get_segmentation_data                                                        │
│                                                                              │
│    92 │   else:                                                              │
│    93 │   │   text_lines = {}                                                │
│    94 │                                                                      │
│ ❱  95 │   baselines = [                                                      │
│    96 │   │   BaselineLine(                                                  │
│    97 │   │   │   id=line.external_id,                                       │
│    98 │   │   │   baseline=line.baseline,                                    │
│                                                                              │
│ /scratch/gpfs/croughan/htr2hpc/src/htr2hpc/train/data.py:102 in <listcomp>   │
│                                                                              │
│    99 │   │   │   boundary=line.mask,                                        │
│   100 │   │   │   # eScriptorium api returns a single region pk              │
│   101 │   │   │   # kraken takes a list of string ids                        │
│ ❱ 102 │   │   │   regions=[region_pk_to_id[line.region]],                    │
│   103 │   │   │   # mark as default if type is not in the public list        │
│   104 │   │   │   # db includes more types but they are not marked as public │
│   105 │   │   │   tags={"type": line_types.get(line.typology, "default")},   │
╰──────────────────────────────────────────────────────────────────────────────╯
KeyError: None

rlskoeser · 2024-12-12T21:31:28Z

@cmroughan thanks for finding! I've added handling for when line.region is None. I tested the data export locally with the command you provide, I was able to duplicate your error before this change, and successfully compiled the binary data after making this change.

cmroughan · 2024-12-12T22:36:13Z

Excellent! Ran tests on my end to confirm that orphan lines are being handled correctly, both for segmentation and transcription, and both tests were successful.

✅ htr2hpc-train segmentation https://test-htr.lib.princeton.edu/ segtest_doc33 -d 33 -w 3 --model-name segtest-scratch40 --no-clean

✅ htr2hpc-train transcription -t 40 https://test-htr.lib.princeton.edu/ recog_doc33_t40 -d 33 -w 3 --model-name recogtest-scratch40 --no-clean

I inspected the XML download for segmentation and confirmed that the orphan lines are correctly being input into that format.

cmroughan · 2024-12-12T23:03:54Z

Also tested some odd cases, and nothing broke! Just some comments about better handling we might want later, but not a priority.

🟡 Running recognition training on pages with segmentation data but no transcription data:
- slurm job submitted but failed (because there was no valid training data).
- Not a requirement right now, but we might consider adding code to catch when there is no valid training data in the parts/transcription submitted.
🟡 Running recognition training on pages with no segmentation data (& therefore no transcription data):
- slurm job submitted but failed (because there was no valid training data).
- Same comment as above.
🟢 Running recognition training on mixed pages, some with seg+trans data, some with no data:
- slurm job submitted, ran successfully using the training data that does exist
🟡 Running segmentation training on pages with no segmentation data:
- slurm job submitted, ran... technically successfully, I guess? The model learned that there are indeed no regions to recognize!
- Really there should be some error handling to catch when seg training data has 0 lines AND 0 regions, since that's useless and we don't need to spend cluster time on that.
🟢 Running segmentation training on mixed pages, some with seg data and some without:
- slurm job submitted, ran successfully. This is desired behavior (there may be training data with pages where the model should learn that there are no regions/lines!)

rlskoeser · 2024-12-13T13:53:17Z

@cmroughan Thanks again for the thorough and thoughtful testing.

I've added a link to your most recent comment on our issue for adding error handling. #36

I'll go ahead and merge this and mark the issue as complete.

rlskoeser added 5 commits December 9, 2024 12:45

Add logic for including transcription text in segmentation data

9c50520

towards #27

Update transcription mode to generate data for training

0e666c9

Compile recognition data to binary format

6d33b25

Add methods for starting and managing recognition training

323523f

Fix script mode check, missing import for best model method, slurm cmd

11c3920

rlskoeser marked this pull request as draft December 9, 2024 20:58

rlskoeser requested a review from cmroughan December 9, 2024 20:59

cmroughan reviewed Dec 10, 2024

View reviewed changes

rlskoeser and others added 5 commits December 10, 2024 10:04

Update src/htr2hpc/train/slurm.py

3a7b592

Co-authored-by: cmroughan <[email protected]>

Increase script log level while testing recognition training

47dfbbf

Run recognition training slurm job for 15 minutes for dev/test

3318c5a

Add comments to clarify what get_segmentation_data does

12c6847

Handle multiple pages of transcription line results

90cb8cb

rlskoeser commented Dec 10, 2024

View reviewed changes

rlskoeser added 7 commits December 10, 2024 11:22

Make input model optional for recognition training

63be924

Fix transcription api looping logic so we don't skip single page results

7fc0ced

Adjust resize option for recognition training command

45e8f93

Add logic to upload best model to eScriptorium

6235252

Clean up imports

85188ea

Remove copy-pasto code for uploading all models

6d141c5

Clean up disclaimer and todos

16b90f9

rlskoeser marked this pull request as ready for review December 10, 2024 21:14

cmroughan reviewed Dec 11, 2024

View reviewed changes

rlskoeser mentioned this pull request Dec 11, 2024

improved error handling for htr2hpc train script #36

Closed

12 tasks

rlskoeser added 3 commits December 11, 2024 09:18

Simplify & fix logic for retrieving the next page of ResultsList

75881f1

Handle multiple api pages of document parts and transcription lines

8137b7f

Revise to also work when there is only a single page of models

a8636b5

rlskoeser requested a review from cmroughan December 11, 2024 14:37

cmroughan reviewed Dec 11, 2024

View reviewed changes

Correct the training mode check for default segmentation model

e41eb6e

cmroughan and others added 2 commits December 12, 2024 16:26

'transcribe' also needs handling in Workflow class

61b1f75

Handle orphan lines with no region when generating training data

d97af10

rlskoeser merged commit 6dece38 into develop Dec 13, 2024

rlskoeser deleted the feature/recognition-data branch December 13, 2024 13:53

rlskoeser mentioned this pull request Dec 13, 2024

expand htr2hpc script to support recognition training #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update htr2hpc-train to support recognition training #35

Update htr2hpc-train to support recognition training #35

rlskoeser commented Dec 9, 2024 •

edited

Loading

cmroughan Dec 10, 2024

rlskoeser Dec 10, 2024

rlskoeser Dec 10, 2024

cmroughan Dec 10, 2024

rlskoeser Dec 10, 2024

rlskoeser Dec 10, 2024

cmroughan Dec 10, 2024

rlskoeser Dec 10, 2024

cmroughan left a comment

cmroughan Dec 11, 2024

rlskoeser Dec 11, 2024

cmroughan Dec 11, 2024

cmroughan Dec 11, 2024

rlskoeser Dec 11, 2024

rlskoeser commented Dec 11, 2024

cmroughan left a comment •

edited

Loading

rlskoeser commented Dec 11, 2024

cmroughan commented Dec 12, 2024

rlskoeser commented Dec 12, 2024

cmroughan commented Dec 12, 2024

cmroughan commented Dec 12, 2024

rlskoeser commented Dec 13, 2024

		# this is a list of string, relative path, but file does not actually exist
		"Recognize": DEFAULT_MODEL[0],

		if part_ids is None:
		doc_parts = api.document_parts_list(document_id)

Update htr2hpc-train to support recognition training #35

Update htr2hpc-train to support recognition training #35

Conversation

rlskoeser commented Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmroughan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser commented Dec 11, 2024

cmroughan left a comment • edited Loading

Choose a reason for hiding this comment

Transcription:

Segmentation:

rlskoeser commented Dec 11, 2024

cmroughan commented Dec 12, 2024

rlskoeser commented Dec 12, 2024

cmroughan commented Dec 12, 2024

cmroughan commented Dec 12, 2024

rlskoeser commented Dec 13, 2024

rlskoeser commented Dec 9, 2024 •

edited

Loading

cmroughan left a comment •

edited

Loading