Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-page support (TIFF) #43

Open
darklajid opened this issue May 8, 2022 · 4 comments
Open

Multi-page support (TIFF) #43

darklajid opened this issue May 8, 2022 · 4 comments

Comments

@darklajid
Copy link

Hey.

Most OCR work I've seen so far uses (b/w, CCITT compressed) multi-page documents. I'd like to make these work with leptess, but it seems (unless I'm missing something?) that there's only support for Pix (not: PixA), nor a mapping for direct TIFF I/O (say pixaReadMultipageTiff from Leptonica). The high level wrapper (leptess:LepTess) also doesn't expose a method to directly set_image a Pix, but that would be the most trivial thing to change.

In other words: I was hoping for a Rust (leptess) workflow that allows

  • reading a multi-page TIFF as PixA
  • iterating over each page -> Pix and collecting the recognition results

Is that something you'd be willing to support? Am I missing a way how this would work today already? I could offer to look into this, but I admit that I'm a Rust beginner at this point in time.

@ccouzens
Copy link
Collaborator

ccouzens commented May 8, 2022

Hey, I might be able to look at this but it wouldn't be until next weekend

I think this might be possible today using set_image_from_mem and the image crate but I haven't tried it.

Some notes for myself:
https://tpgit.github.io/Leptonica/pix_8h_source.html#l00363
https://github.com/DanBloomberg/leptonica/blob/5aaf1c187deeef7f47288c6b0833a07021940da7/src/tiffiostub.c#L99-L103

@darklajid
Copy link
Author

darklajid commented May 9, 2022

Thanks a ton for the reply. Looking at the linked image crate / into_bytes it probably should NOT copy for this to be a decent workaround? Otherwise my naive understanding is that the image would be read once, then copied for each page (and .. anyway already re-read by leptonica).

Leptonica does provide the required functionality already, right? PixA is a collection of Pix/an "A"rray of Pix that allows access to the individual entries (which could be passed to tess_api.set_image directly, if that would be exposed in the high level LepTess: This is already what's happening in set_image_from_mem anyway: Reading a buffer into a Pix, then handing that to tesseract.

My armchair idea - and I would be willing to help where I can - is therefore that

  • LepTess gets an overload for set_image_* that accepts a Pix
  • the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)
  • for this particular use case (which I argue is common for OCR though?) being able to directly read a multi-page TIFF into a PixA (either from file or from memory/a buffer would be cool. Like the existing pix_read and pix_read_mem

In this case there would be no need for another crate and it would probably avoid re-reading (and potentially copying) the image(s) around?

ccouzens added a commit that referenced this issue May 15, 2022
My comment about tiff and windows is because of this documentation
https://tpgit.github.io/Leptonica/leptprotos_8h.html#a027a927dc3438192e3bdae8c219d7f6a

> On windows, this will only read tiff formatted files from memory. For other formats, it requires fmemopen(3). Attempts to read those formats will fail at runtime. (3)

Whilst it won't resolve the issue, this is my first step at tackling
#43. The next step will be to add
and use the leptonica methods that support tiff from disk and `PixA`.
ccouzens added a commit that referenced this issue May 15, 2022
My comment about tiff and windows is because of this documentation
https://tpgit.github.io/Leptonica/leptprotos_8h.html#a027a927dc3438192e3bdae8c219d7f6a

> On windows, this will only read tiff formatted files from memory. For other formats, it requires fmemopen(3). Attempts to read those formats will fail at runtime. (3)

Whilst it won't resolve the issue, this is my first step at tackling
#43. The next step will be to add
and use the leptonica methods that support tiff from disk and `PixA`.
ccouzens added a commit that referenced this issue May 16, 2022
My comment about tiff and windows is because of this documentation
https://tpgit.github.io/Leptonica/leptprotos_8h.html#a027a927dc3438192e3bdae8c219d7f6a

> On windows, this will only read tiff formatted files from memory. For other formats, it requires fmemopen(3). Attempts to read those formats will fail at runtime. (3)

Whilst it won't resolve the issue, this is my first step at tackling
#43. The next step will be to add
and use the leptonica methods that support tiff from disk and `PixA`.
@ccouzens
Copy link
Collaborator

Hi,

I haven't forgotten about this.

I'm going to try and get to this step tonight

the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)

ccouzens added a commit to ccouzens/leptonica-plumbing that referenced this issue May 16, 2022
I'm someone suspicious that calling `pixaDestroy` after doesn't change a
lot according to valgrind

```
valgrind --leak-check=yes --error-exitcode=1 --trace-children=yes cargo test read_multipage_tiff_test 2>&1
```

I believe I'm doing the right thing even if the tooling doesn't confirm
it.

houqp/leptess#43
ccouzens added a commit to ccouzens/leptonica-plumbing that referenced this issue May 16, 2022
I'm someone suspicious that calling `pixaDestroy` after doesn't change a
lot according to valgrind

```
valgrind --leak-check=yes --error-exitcode=1 --trace-children=yes cargo test read_multipage_tiff_test 2>&1
```

I believe I'm doing the right thing even if the tooling doesn't confirm
it.

houqp/leptess#43

Evince (Gnome PDF viewer) says the tiff file has 3 pages.
GIMP (and leptonica) say that the tiff file has 2 pages.
I created it with 2, so I believe this is correct.

I noticed an off by one error in Pixa::get_pix which I also corrected in
Boxa::get. If an array has n elements, we can't access the nth element.
It looks like Leptonica follows the c string convention of adding
a null element at the end, so the array has the space, but we can't
dereference it.
@ccouzens
Copy link
Collaborator

You may be interested in this PR. Github won't let me assign you as a reviewer.

ccouzens/leptonica-plumbing#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants