Add PdfParagraph to allow for more natural processing of multi-line text. #29

ajrcarey · 2022-06-28T17:54:38Z

Follow-on from #17, #22, #25. Add a PdfParagraph object that allows for easier handling of multi-line text with embedded character formatting changes.

Ideally, it would be possible to generate a PdfParagraph from an existing set of PdfPageTextObject objects, each one containing a formatted fragment of a paragraph.

The text was updated successfully, but these errors were encountered:

ajrcarey · 2022-06-28T17:55:00Z

PdfParagraph object construction under way. Hidden from crate prelude for now.

russellwmy · 2023-04-18T12:17:35Z

I have a pdf file which render each char with Td and Tj operation. Some page takes more than 10 seconds to extract text.
I tested with page level extraction. it is under a second.
Do you think PdfParagraph would solve this problem?

ajrcarey · 2023-04-18T12:22:58Z

Hi @russellwmy , if you are just looking to extract text, then no, PdfParagraph will not be useful to you. The goal of PdfParagraph is to make it easier to work with the formatting and justification of multiple text objects. If you just want to extract the raw text, then page level extraction via PdfPage::text()?.all() is the fastest way. See https://github.com/ajrcarey/pdfium-render/blob/master/examples/text_extract.rs for an example.

russellwmy · 2023-04-18T12:31:54Z

@ajrcarey Good to know. I find a way to speed it up now.

first call PdfPage::text()?.all()
then iterate the TextObject, map the font, location, etc. and make use of PdfPageText.for_object(text_object)
In this way, we don't need to load the page again and again for each object.
it is 100x faster. :)

Just an idea, do you think this can implement internally?

ajrcarey · 2023-04-18T12:42:35Z

Every time you create PdfPageText, Pdfium analyses all the text on the page. So you're right, the most efficient way is to create PdfPageText once, then reuse it:

let page_text = page.text()?; // this creates PdfPageText once

// now can use page_text.for_object(...) in an iterator, etc.

However, I would expect this to still be slower than page_text.all(). Calling all() avoids the need to iterate.

ajrcarey · 2023-11-10T16:55:43Z

Made improvements to segment detection. Implemented prototype lines to paragraphs accumulator. Made PdfParagraph public in response to #121, although it isn't part of the crate prelude.

ajrcarey · 2024-06-18T21:53:22Z

Consider also adding handling of tables as suggested in #149.

ajrcarey · 2024-08-04T15:51:23Z

Moved PdfParagraph behind new feature flag paragraph. The change will take effect in release 0.8.23.

ziimakc · 2024-11-11T13:07:02Z

@ajrcarey Thanks you for this library and this super useful feature. Currently it seems like something is broken in imports:

use crate::page::PdfPage;
   |            ^^^^
   |            |
   |            unresolved import
   |            help: a similar path exists: `pdf::document::page`

.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdfium-render-0.8.26/src/pdf/document/page/paragraph.rs:635:20
    |
242 | pub struct PdfParagraph<'a> {
    | --------------------------- doesn't satisfy `PdfParagraph<'_>: Sized`
...
635 |         paragraphs.push(Self::paragraph_from_lines(
    |         -----------^^^^ method cannot be called on `Vec<PdfParagraph<'_>>` due to unsatisfied trait bounds

ajrcarey · 2024-11-11T20:51:36Z

Hi @ziimakc , thank you for reporting the issue. Because PdfParagraph is hidden behind a feature flag, it wasn't being included in the normal test runs, and so it was easy for it to fall out-of-sync with the code reorganisation that took place as part of #153.

I've pushed a commit that updates the import paths, and adjusts the github workflow to include the paragraph feature when running tests to prevent this happening again in the future.

ziimakc · 2024-11-12T08:40:09Z

@ajrcarey thanks, but how to import it? use pdfium_render::paragraph::PdfParagraph; seems not working

ajrcarey · 2024-11-12T12:35:15Z

Because PdfParagraph is an in-development, pre-release feature, it is not included in the pdfium-render prelude. You must import it directly from its crate path location, which is currently:

use crate::pdf::document::page::paragraph::PdfParagraph;

ziimakc · 2024-11-12T14:37:09Z

Would it be possible to include image objects, so you could extract something like this:

text_paragraph
text_paragraph
image_placeholder_filename
text_paragraph
image_placeholder_filename
image_placeholder_filename

This would be helpful for case of generating some json structures from text with the help of AI and replacing image_placeholder_filename with images later on.

ajrcarey · 2024-11-12T15:29:12Z

There is no intention to develop this feature further before crate release 1.0.

Of course, nothing is stopping you from doing so yourself, and making a PR.

ajrcarey self-assigned this Jun 28, 2022

ajrcarey pushed a commit that referenced this issue Jun 28, 2022

Progressing #29

89c0f11

ajrcarey pushed a commit that referenced this issue Nov 10, 2023

Progressing #29

a117be9

ajrcarey pushed a commit that referenced this issue Nov 10, 2023

Progressing #29

88c9169

ajrcarey mentioned this issue Nov 10, 2023

Page segment text extracted as individual letter #121

Closed

ajrcarey mentioned this issue Jun 18, 2024

Example of extracting tables? #149

Closed

ajrcarey mentioned this issue Jul 2, 2024

Can we modify the text of a PDF in-place? #150

Closed

ajrcarey pushed a commit that referenced this issue Nov 11, 2024

Progressing #29

4eb4eb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PdfParagraph to allow for more natural processing of multi-line text. #29

Add PdfParagraph to allow for more natural processing of multi-line text. #29

ajrcarey commented Jun 28, 2022

ajrcarey commented Jun 28, 2022

russellwmy commented Apr 18, 2023

ajrcarey commented Apr 18, 2023 •

edited

Loading

russellwmy commented Apr 18, 2023

ajrcarey commented Apr 18, 2023 •

edited

Loading

ajrcarey commented Nov 10, 2023

ajrcarey commented Jun 18, 2024

ajrcarey commented Aug 4, 2024

ziimakc commented Nov 11, 2024 •

edited

Loading

ajrcarey commented Nov 11, 2024 •

edited

Loading

ziimakc commented Nov 12, 2024

ajrcarey commented Nov 12, 2024 •

edited

Loading

ziimakc commented Nov 12, 2024

ajrcarey commented Nov 12, 2024

Add PdfParagraph to allow for more natural processing of multi-line text. #29

Add PdfParagraph to allow for more natural processing of multi-line text. #29

Comments

ajrcarey commented Jun 28, 2022

ajrcarey commented Jun 28, 2022

russellwmy commented Apr 18, 2023

ajrcarey commented Apr 18, 2023 • edited Loading

russellwmy commented Apr 18, 2023

ajrcarey commented Apr 18, 2023 • edited Loading

ajrcarey commented Nov 10, 2023

ajrcarey commented Jun 18, 2024

ajrcarey commented Aug 4, 2024

ziimakc commented Nov 11, 2024 • edited Loading

ajrcarey commented Nov 11, 2024 • edited Loading

ziimakc commented Nov 12, 2024

ajrcarey commented Nov 12, 2024 • edited Loading

ziimakc commented Nov 12, 2024

ajrcarey commented Nov 12, 2024

ajrcarey commented Apr 18, 2023 •

edited

Loading

ajrcarey commented Apr 18, 2023 •

edited

Loading

ziimakc commented Nov 11, 2024 •

edited

Loading

ajrcarey commented Nov 11, 2024 •

edited

Loading

ajrcarey commented Nov 12, 2024 •

edited

Loading