Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of resizing codec #277

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ChWick
Copy link

@ChWick ChWick commented Dec 13, 2017

If a pretrained model is used that has a different codec than the target text (e. g. historical documents) one has to adapt the codec to match the desired characters.

This pull request allows to automatically extend or shrink the codec based on the provided ground truth data after loading a pretrained model. This is done by changing the dimension of the output LSTM layer (before Softmax), whereby the old trained values are kept. Obviously, to learn the new characters the model must be retrained on the new data.

@amitdo
Copy link
Contributor

amitdo commented Dec 13, 2017

I like the feature.

Did you test it? How does it compare to training from scratch?

Related:
tmbdev/clstm#106

@ChWick
Copy link
Author

ChWick commented Dec 13, 2017

Our paper based on this technique applied to historical documents will hopefully be published this month. I will reference it as soon as it is available. Our findings and the improvements compared to training from scratch are documented there.

@amitdo
Copy link
Contributor

amitdo commented Dec 13, 2017

Sounds promising :-)

ocrolib/lstm.py Outdated
@@ -645,6 +663,8 @@ def states(self):
def weights(self):
for w,dw,n in self.net.weights():
yield w,dw,"Reversed/%s"%n
def resizeoutput(self, no, deleted_positions):
self.net.resizeOutput(no, deleted_positions)
Copy link
Contributor

@amitdo amitdo Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python is case sensitive. 'resizeOutput' != 'resizeoutput'

Use the standard Python style for naming functions/methods
https://www.python.org/dev/peps/pep-0008/#method-names-and-instance-variables

@ChWick
Copy link
Author

ChWick commented Dec 15, 2017

Function is renamed. If you prefer a resize_output to resizeOutput let me know.
In the new commit I added support for a FloatingPointingError exception during training. The codec will be resized in this case.
The paper should be available on arXiv on monday.

@amitdo
Copy link
Contributor

amitdo commented Dec 15, 2017

Are you talking about this paper?
https://arxiv.org/abs/1711.09670

I added it to the 'Publications' wiki page:
https://github.com/tmbdev/ocropy/wiki/Publications

One of the authors names match your user name...

@ChWick
Copy link
Author

ChWick commented Dec 15, 2017

No, this is another paper, but it is not using the resizing of the codec. The new paper was sumitted today to arXiv and therefore should be available on monday

@amitdo
Copy link
Contributor

amitdo commented Dec 15, 2017

I saw your deleted comment that says 'You already found it'...
:-)

@chreul
Copy link

chreul commented Dec 18, 2017

The corresponding paper is now available at arXiv: https://arxiv.org/abs/1712.05586

@amitdo
Copy link
Contributor

amitdo commented Dec 18, 2017

Thanks for sharing your research and code.

Related, Training Tesseract 4.00:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

  • Fine Tuning for Impact(new-font-style)
  • Fine Tuning for ± a few characters
  • Training Just a Few Layers

The second option is similar to what your patch does.
Ocropy does not have the third option.

@amitdo
Copy link
Contributor

amitdo commented Dec 19, 2017

@chreul, @ChWick

Figure 1. Different example lines from the seven books used
From top to bottom: excerpts from books 1476, 1488, 1495, 1500, 1505, 1509, and 1572.

For example, book 1505 shows the least improvement over the default approach (but still 23% and 8%, respectively). Most likely this is caused by the fact that the distances between two characters in book 1505 are considerably smaller compared to all other books used for training and testing (see Figure 1, line 4).

Update: Fixed in version 2 (v2) of the paper.

@amitdo
Copy link
Contributor

amitdo commented Dec 19, 2017

The location of the description of this patch in the paper:

3.3 Utilizing Arbitrary Pretrained Models in OCRopus

3.3.1 Extending the Codec

3.3.2 Reducing the Codec

@mittagessen
Copy link

Fine Tuning for Impact(new-font-style)
Fine Tuning for ± a few characters
Training Just a Few Layers

The second option is similar to what your patch does.

Technically the second and third option are equivalent. In both cases it is slicing off the final linear projection and just training a new one, although the weights are already somewhat meaningful when just a few rows are deleted. It is possible to add a complete weight reinitialization here although I'm unsure if the single LSTM layer learns representations well enough to be worth the effort.

@zuphilip
Copy link
Collaborator

@ChWick Thank you this looks very interesting! I have seen, that your paper in the journal 027.7 appeared http://0277.ch/ojs/index.php/cdrs_0277/article/view/169. This needs some time to check in details and test it through...

ocrolib/lstm.py Outdated
@@ -839,7 +862,7 @@ def ctc_align_targets(outputs,targets,threshold=100.0,verbose=0,debug=0,lo=1e-5)
return aligned

def normalize_nfkc(s):
return unicodedata.normalize('NFKC',s)
return unicodedata.normalize('NFC',s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change? Confusing since the method is called normalize_nfkc

Copy link
Collaborator

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

Let's merge this once clear whether the NFC/NFKC change was deliberate.

Also, a minimalist test for CI would be helpful: Train a minimal model, extend&shrink the character set, make sure it doesn't break. Maybe you have such sample data from developing this?

@ChWick
Copy link
Author

ChWick commented Feb 20, 2018

The NFC/NFKC change was needed for our purposes but apparently not for this pull request. The change is undone, my branch is rebased onto the current master.

I propose as test 2 single text lines with different alphabet. Use the --codec argument to generate the appropriate codec for the initial model and the second model that loads the first one.
E.g.:

  1. ocropus-rtrain --codec text_line_gt.txt --ntrain 2 -F 1 --output tmp test_file.png
  2. ocropus-rtrain --codec 2nd_text_line_gt.txt --ntrain 2 --load tmp-00000001.pyrnn.gz 2nd_test_file.png

Also this must work (default codec in the initial model)

  1. ocropus-rtrain --ntrain 2 -F 1 --output tmp test_file.png
  2. ocropus-rtrain --codec 2nd_text_line_gt.txt --ntrain 2 --load tmp-00000001.pyrnn.gz 2nd_test_file.png

And the content of test_line_gt.txt is e.g. ABCDEFG, the one of 2nd_test_line_gt.txt: EFGHIJKL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants