Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PdfParagraph to allow for more natural processing of multi-line text. #29

Open
ajrcarey opened this issue Jun 28, 2022 · 14 comments
Open
Assignees

Comments

@ajrcarey
Copy link
Owner

Follow-on from #17, #22, #25. Add a PdfParagraph object that allows for easier handling of multi-line text with embedded character formatting changes.

Ideally, it would be possible to generate a PdfParagraph from an existing set of PdfPageTextObject objects, each one containing a formatted fragment of a paragraph.

@ajrcarey ajrcarey self-assigned this Jun 28, 2022
@ajrcarey
Copy link
Owner Author

PdfParagraph object construction under way. Hidden from crate prelude for now.

ajrcarey pushed a commit that referenced this issue Jun 28, 2022
@russellwmy
Copy link

I have a pdf file which render each char with Td and Tj operation. Some page takes more than 10 seconds to extract text.
I tested with page level extraction. it is under a second.
Do you think PdfParagraph would solve this problem?

@ajrcarey
Copy link
Owner Author

ajrcarey commented Apr 18, 2023

Hi @russellwmy , if you are just looking to extract text, then no, PdfParagraph will not be useful to you. The goal of PdfParagraph is to make it easier to work with the formatting and justification of multiple text objects. If you just want to extract the raw text, then page level extraction via PdfPage::text()?.all() is the fastest way. See https://github.com/ajrcarey/pdfium-render/blob/master/examples/text_extract.rs for an example.

@russellwmy
Copy link

@ajrcarey Good to know. I find a way to speed it up now.

  • first call PdfPage::text()?.all()
  • then iterate the TextObject, map the font, location, etc. and make use of PdfPageText.for_object(text_object)
    In this way, we don't need to load the page again and again for each object.
    it is 100x faster. :)

Just an idea, do you think this can implement internally?

@ajrcarey
Copy link
Owner Author

ajrcarey commented Apr 18, 2023

Every time you create PdfPageText, Pdfium analyses all the text on the page. So you're right, the most efficient way is to create PdfPageText once, then reuse it:

let page_text = page.text()?; // this creates PdfPageText once

// now can use page_text.for_object(...) in an iterator, etc.

However, I would expect this to still be slower than page_text.all(). Calling all() avoids the need to iterate.

ajrcarey pushed a commit that referenced this issue Nov 10, 2023
ajrcarey pushed a commit that referenced this issue Nov 10, 2023
@ajrcarey
Copy link
Owner Author

Made improvements to segment detection. Implemented prototype lines to paragraphs accumulator. Made PdfParagraph public in response to #121, although it isn't part of the crate prelude.

@ajrcarey
Copy link
Owner Author

Consider also adding handling of tables as suggested in #149.

@ajrcarey
Copy link
Owner Author

ajrcarey commented Aug 4, 2024

Moved PdfParagraph behind new feature flag paragraph. The change will take effect in release 0.8.23.

@ziimakc
Copy link

ziimakc commented Nov 11, 2024

@ajrcarey Thanks you for this library and this super useful feature. Currently it seems like something is broken in imports:

use crate::page::PdfPage;
   |            ^^^^
   |            |
   |            unresolved import
   |            help: a similar path exists: `pdf::document::page`

.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdfium-render-0.8.26/src/pdf/document/page/paragraph.rs:635:20
    |
242 | pub struct PdfParagraph<'a> {
    | --------------------------- doesn't satisfy `PdfParagraph<'_>: Sized`
...
635 |         paragraphs.push(Self::paragraph_from_lines(
    |         -----------^^^^ method cannot be called on `Vec<PdfParagraph<'_>>` due to unsatisfied trait bounds

@ajrcarey
Copy link
Owner Author

ajrcarey commented Nov 11, 2024

Hi @ziimakc , thank you for reporting the issue. Because PdfParagraph is hidden behind a feature flag, it wasn't being included in the normal test runs, and so it was easy for it to fall out-of-sync with the code reorganisation that took place as part of #153.

I've pushed a commit that updates the import paths, and adjusts the github workflow to include the paragraph feature when running tests to prevent this happening again in the future.

ajrcarey pushed a commit that referenced this issue Nov 11, 2024
@ziimakc
Copy link

ziimakc commented Nov 12, 2024

@ajrcarey thanks, but how to import it? use pdfium_render::paragraph::PdfParagraph; seems not working

@ajrcarey
Copy link
Owner Author

ajrcarey commented Nov 12, 2024

Because PdfParagraph is an in-development, pre-release feature, it is not included in the pdfium-render prelude. You must import it directly from its crate path location, which is currently:

use crate::pdf::document::page::paragraph::PdfParagraph;

@ziimakc
Copy link

ziimakc commented Nov 12, 2024

Would it be possible to include image objects, so you could extract something like this:

text_paragraph
text_paragraph
image_placeholder_filename
text_paragraph
image_placeholder_filename
image_placeholder_filename

This would be helpful for case of generating some json structures from text with the help of AI and replacing image_placeholder_filename with images later on.

@ajrcarey
Copy link
Owner Author

There is no intention to develop this feature further before crate release 1.0.

Of course, nothing is stopping you from doing so yourself, and making a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants