-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test compute/time requirements for different train tasks #26
Comments
Consider calculating compile time for producing |
First round of tests done, measuring five epochs of segmentation and transcription training on differently-sized batches of training data, where all the images were normalized to approximately 612.7 KB. Training data batch sizes were 10, 20, 50, 100, 200, and 500 images. Tests were run with a GPU and with 8 CPUs with 6 GB memory each. Full details in GSheets, summary below: SegmentationTraining "from scratch" (refining on blla.mlmodel):
Training by refining on existing model:
TranscriptionTraining from scratch, straight XML:
Training by refining on existing model, straight XML:
Training from scratch, binary:
Training by refining on existing model, binary:
Future rounds of testing will examine the impact of different filesizes and of different worker counts. |
Another round of tests examined what impact different inputs for the Tests were run by changing the parameter for Varying the worker count seems to have had no appreciable impact on transcription training, but does have an impact on segmentation training. TranscriptionTraining from scratch, straight XML:
Training from scratch, binary:
SegmentationTraining "from scratch" (refining on blla.mlmodel):
* The fourth entry above never completed an epoch -- it hit an OOM killed error before training began. There seems to be some lower limit on memory per worker required to succeed, depending on the size of the input training data. Relatedly, for 500 images, training "from scratch" (refining on blla.mlmodel):
Review of the kraken code might help illuminate what minimums need to be met to avoid the OOM killed error. |
@cmroughan this is great. Could you also test increasing workers without decreasing memory? When I was fighting the OOM errors due to the input data problem I did find some comments on github issues about that and vaguely recall seeing 3gb as a minimum for segmentation. I may be able to find again if that would be helpful. |
Ran the tests: increasing workers without decreasing memory does not produce tangible benefits, as can be seen in the comparisons below. The code does not end up using the extra resources, which makes it an overly-expensive slurm request and -- if done too often -- can lead to future jobs being ranked lower in the queue as a result. TranscriptionTraining from scratch, straight XML:
SegmentationTraining "from scratch" (refining on blla.mlmodel):
|
Ran further tests to track how complexity of the training data impacts resource usage, starting with transcription training tasks. For transcription, the line count metric is more relevant than the page count metric (which makes sense, considering the lines are the input training data). Line length, here tracked in character counts, also has an impact. Tracking line countsTraining from scratch, straight XML, same CPU counts + memory:
Tracking line lengths:
In addition to line counts, I'll have to go back and extract the average line lengths across the datasets that have been tested. |
To better automate the creation of slurm batch scripts, we will want a clearer sense of how variations in the training data size and types affects the compute and time requirements for the training job. With that in mind, we will want to evaluate the following:
How does X factor impact time requirements for a training task? What is the optimal number of CPU cores? Run tests addressing these questions for the following factors across both segmentation and transcription training tasks:
The text was updated successfully, but these errors were encountered: