Multi-page support (TIFF) #43

darklajid · 2022-05-08T08:01:53Z

Hey.

Most OCR work I've seen so far uses (b/w, CCITT compressed) multi-page documents. I'd like to make these work with leptess, but it seems (unless I'm missing something?) that there's only support for Pix (not: PixA), nor a mapping for direct TIFF I/O (say pixaReadMultipageTiff from Leptonica). The high level wrapper (leptess:LepTess) also doesn't expose a method to directly set_image a Pix, but that would be the most trivial thing to change.

In other words: I was hoping for a Rust (leptess) workflow that allows

reading a multi-page TIFF as PixA
iterating over each page -> Pix and collecting the recognition results

Is that something you'd be willing to support? Am I missing a way how this would work today already? I could offer to look into this, but I admit that I'm a Rust beginner at this point in time.

The text was updated successfully, but these errors were encountered:

ccouzens · 2022-05-08T11:04:54Z

Hey, I might be able to look at this but it wouldn't be until next weekend

I think this might be possible today using set_image_from_mem and the image crate but I haven't tried it.

Some notes for myself:
https://tpgit.github.io/Leptonica/pix_8h_source.html#l00363
https://github.com/DanBloomberg/leptonica/blob/5aaf1c187deeef7f47288c6b0833a07021940da7/src/tiffiostub.c#L99-L103

darklajid · 2022-05-09T05:35:21Z

Thanks a ton for the reply. Looking at the linked image crate / into_bytes it probably should NOT copy for this to be a decent workaround? Otherwise my naive understanding is that the image would be read once, then copied for each page (and .. anyway already re-read by leptonica).

Leptonica does provide the required functionality already, right? PixA is a collection of Pix/an "A"rray of Pix that allows access to the individual entries (which could be passed to tess_api.set_image directly, if that would be exposed in the high level LepTess: This is already what's happening in set_image_from_mem anyway: Reading a buffer into a Pix, then handing that to tesseract.

My armchair idea - and I would be willing to help where I can - is therefore that

LepTess gets an overload for set_image_* that accepts a Pix
the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)
for this particular use case (which I argue is common for OCR though?) being able to directly read a multi-page TIFF into a PixA (either from file or from memory/a buffer would be cool. Like the existing pix_read and pix_read_mem

In this case there would be no need for another crate and it would probably avoid re-reading (and potentially copying) the image(s) around?

My comment about tiff and windows is because of this documentation https://tpgit.github.io/Leptonica/leptprotos_8h.html#a027a927dc3438192e3bdae8c219d7f6a > On windows, this will only read tiff formatted files from memory. For other formats, it requires fmemopen(3). Attempts to read those formats will fail at runtime. (3) Whilst it won't resolve the issue, this is my first step at tackling #43. The next step will be to add and use the leptonica methods that support tiff from disk and `PixA`.

ccouzens · 2022-05-16T07:54:41Z

Hi,

I haven't forgotten about this.

I'm going to try and get to this step tonight

the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)

I'm someone suspicious that calling `pixaDestroy` after doesn't change a lot according to valgrind ``` valgrind --leak-check=yes --error-exitcode=1 --trace-children=yes cargo test read_multipage_tiff_test 2>&1 ``` I believe I'm doing the right thing even if the tooling doesn't confirm it. houqp/leptess#43

I'm someone suspicious that calling `pixaDestroy` after doesn't change a lot according to valgrind ``` valgrind --leak-check=yes --error-exitcode=1 --trace-children=yes cargo test read_multipage_tiff_test 2>&1 ``` I believe I'm doing the right thing even if the tooling doesn't confirm it. houqp/leptess#43 Evince (Gnome PDF viewer) says the tiff file has 3 pages. GIMP (and leptonica) say that the tiff file has 2 pages. I created it with 2, so I believe this is correct. I noticed an off by one error in Pixa::get_pix which I also corrected in Boxa::get. If an array has n elements, we can't access the nth element. It looks like Leptonica follows the c string convention of adding a null element at the end, so the array has the space, but we can't dereference it.

ccouzens · 2022-05-16T22:25:37Z

You may be interested in this PR. Github won't let me assign you as a reviewer.

ccouzens/leptonica-plumbing#2

ccouzens mentioned this issue May 15, 2022

Document using the image crate to load images #44

Merged

ccouzens mentioned this issue May 16, 2022

Add Pixa::read_multipage_tiff support ccouzens/leptonica-plumbing#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-page support (TIFF) #43

Multi-page support (TIFF) #43

darklajid commented May 8, 2022

ccouzens commented May 8, 2022

darklajid commented May 9, 2022 •

edited

Loading

ccouzens commented May 16, 2022

ccouzens commented May 16, 2022

Multi-page support (TIFF) #43

Multi-page support (TIFF) #43

Comments

darklajid commented May 8, 2022

ccouzens commented May 8, 2022

darklajid commented May 9, 2022 • edited Loading

ccouzens commented May 16, 2022

ccouzens commented May 16, 2022

darklajid commented May 9, 2022 •

edited

Loading