-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couple of errors with Tesseract v3.02.02 #3
Comments
Hi, One other thing: TesseractTrainer was initially written for v3.0.1. I do Cheers
|
Thanks for you reply! |
Tesseract 3.02 has introduced a new clustering command: It's seems to be important, as the following message appears in your traceback:
I'll add an automatic version check, and if tesseract >= 3.02, then the |
It seems that we'll have to wait a little more for 3.02 support. All I get is a super-long error log looking like this
I've found reports of people experiencing the same behaviour with 3.02 and tried to contribute. See here . As this bug is a pure tesseract one, I hope you understand I cannot guarantee when I'll be able to support tesseract 3.02. As an alternative solution, I suggest you fall back on tesseract 3.01, which seems to work fairly well with TesseractTrainer. |
Hi! |
Well, I believe that aforementioned errors (couldn't find matching blob...etc) are originating from using training text with very very long words (words that can't be wrapped in one line) and it has nothing to do with Tisseract version. |
I don't believe so... My text is 26 letters, double spaced... And the author himself suggests "Couldn't find a matching blob" error are pure "tesseract ones"... |
@marcolino: you are welcome to test your believe with evidence ;-): https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16 |
Hi, About v3.02: As you've both read https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16, you've seen that if was suggested to increase the resolution (>72 DPI) or increase the inter character spacing. @marcolino your example tif suggests that increasing the inter character spacing does not have any effect either. The only solution I could offer now is trying to compile tesseract 3.01. I reacll that you needed leptonica-lib to compile it. Maybe they are not shipped anymore (wild undocumented guess here)? Thanks for your reports. B |
@marcolino: I wrote about evidence that "Couldn't find a matching blob" error are pure "tesseract ones"... This is not true at lease for latin script based inputs (situation for hieroglyph, arabic, azian scripts is different IMO). When I tested reported issues it always came out, that problem is in: If you post somewhere your files (image -> try to use 2 color png ;-) & box file) I can analyse it and hopefully to offer you some suggestion. 3.02 version is no (so much ;-) ) sensible for spacing. BTW: you are aware that 26 single letters do not meet requirements, right? @BaltoRouberol: Root of problem in 698#c16 is not in DPI, but in the boxes. DPI is just minor issue IMO. You have to be aware, that tesseract will convert (binarize) images to 2 colours, and than will run training. Maybe is you visualize "your" and tesseract box files, you can see what makes difference. |
@BaltoRouberol: no problem for your slow response, of course... @zdenop: thanks for your support... I'm not OCR expert... My goal is to digitalize as well as I can a bunch of old books (really old and precious Italian books :-)... So I'm trying to automate the training process with TesseractTrainer... I did hope to be able not to "dirty my hands" with box files and input images, but just to:
I try to post here all the data I use:
I hope it's enough... :-) Thanks again for your interest, everybody! |
@marcolino: problem is (in) box file. I posted correct one on pastebin (it will expire in one month). I suggest you to compare it with your version e.g. in kdiff3. I create it with tesseract and I just need to correct one "1" to "l" and m-dash to minus. Are you sure you need to run training if there is such result? Tesseract users experience is that user are not able to create such good language data as Google did for supported languages. (e.g. training is reasonable only for uncommon font like fraktur). Instead for training it make sense to focus on input image quality and image preprocessing. @BaltoRouberol: Problem of TesseractTrainer is that PIL.ImageFont returns always the same height for different chars ('T', 'g', '.', 'x'). This is not correct. Tesseract 3.02 requires than box file is rectangle of char only without empty space. I think you are not able to create such box file with PIL. |
Thanks, zdenop. You are right, I'm not sure I need training... What I miss is the understanding of the kind of work I have to do to perform OCR on many books with different fonts: should I build a box file for each font I have? |
@marcolino: this is off-topic for this issue. I suggest you to post example image and ask on tesseract user forum for suggestion. In my opinion scantailor is most complex (with simple user interface) from free software. You should not expect 100% result (even commercial OCR will not provide it). |
@zdenop That's very interesting, thanks for your input. I guess that would mean that the whole tif+boxfile generation would have to be re-written using another Image Processing tool (eg: ImageMagick). See http://www.imagemagick.org/Usage/text/#font_info At this point, I would be happy to assist anyone willing to fork TesseractTrainer and fix this issue, but I feel I currently do not have the time to fix this (and I'm really sorry about that). Thanks again! B |
FYI: For anyone looking for further information in to this (one interested in forking the project, perhaps?), another post was made in the Tesseract bug listing related to this particular issue. https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c17 |
Thanks for this handy tool, it's really helpful except that I couldn't get it to work :).
I'm trying to train Tesseract with a new English font called KidKosmic with the following command
And here is the output
Any clues?
Fyi, I'm running the script on mac os 10.8 and dependencies insalled.
The text was updated successfully, but these errors were encountered: