-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of resizing codec #277
base: master
Are you sure you want to change the base?
Conversation
I like the feature. Did you test it? How does it compare to training from scratch? Related: |
Our paper based on this technique applied to historical documents will hopefully be published this month. I will reference it as soon as it is available. Our findings and the improvements compared to training from scratch are documented there. |
Sounds promising :-) |
ocrolib/lstm.py
Outdated
@@ -645,6 +663,8 @@ def states(self): | |||
def weights(self): | |||
for w,dw,n in self.net.weights(): | |||
yield w,dw,"Reversed/%s"%n | |||
def resizeoutput(self, no, deleted_positions): | |||
self.net.resizeOutput(no, deleted_positions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python is case sensitive. 'resizeOutput' != 'resizeoutput'
Use the standard Python style for naming functions/methods
https://www.python.org/dev/peps/pep-0008/#method-names-and-instance-variables
Function is renamed. If you prefer a resize_output to resizeOutput let me know. |
Are you talking about this paper? I added it to the 'Publications' wiki page: One of the authors names match your user name... |
No, this is another paper, but it is not using the resizing of the codec. The new paper was sumitted today to arXiv and therefore should be available on monday |
I saw your deleted comment that says 'You already found it'... |
The corresponding paper is now available at arXiv: https://arxiv.org/abs/1712.05586 |
Thanks for sharing your research and code. Related, Training Tesseract 4.00:
The second option is similar to what your patch does. |
Update: Fixed in version 2 (v2) of the paper. |
The location of the description of this patch in the paper: 3.3 Utilizing Arbitrary Pretrained Models in OCRopus3.3.1 Extending the Codec3.3.2 Reducing the Codec |
Technically the second and third option are equivalent. In both cases it is slicing off the final linear projection and just training a new one, although the weights are already somewhat meaningful when just a few rows are deleted. It is possible to add a complete weight reinitialization here although I'm unsure if the single LSTM layer learns representations well enough to be worth the effort. |
@ChWick Thank you this looks very interesting! I have seen, that your paper in the journal 027.7 appeared http://0277.ch/ojs/index.php/cdrs_0277/article/view/169. This needs some time to check in details and test it through... |
ocrolib/lstm.py
Outdated
@@ -839,7 +862,7 @@ def ctc_align_targets(outputs,targets,threshold=100.0,verbose=0,debug=0,lo=1e-5) | |||
return aligned | |||
|
|||
def normalize_nfkc(s): | |||
return unicodedata.normalize('NFKC',s) | |||
return unicodedata.normalize('NFC',s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change? Confusing since the method is called normalize_nfkc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you.
Let's merge this once clear whether the NFC/NFKC change was deliberate.
Also, a minimalist test for CI would be helpful: Train a minimal model, extend&shrink the character set, make sure it doesn't break. Maybe you have such sample data from developing this?
The NFC/NFKC change was needed for our purposes but apparently not for this pull request. The change is undone, my branch is rebased onto the current master. I propose as test 2 single text lines with different alphabet. Use the
Also this must work (default codec in the initial model)
And the content of |
If a pretrained model is used that has a different codec than the target text (e. g. historical documents) one has to adapt the codec to match the desired characters.
This pull request allows to automatically extend or shrink the codec based on the provided ground truth data after loading a pretrained model. This is done by changing the dimension of the output LSTM layer (before Softmax), whereby the old trained values are kept. Obviously, to learn the new characters the model must be retrained on the new data.