-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PdfParagraph to allow for more natural processing of multi-line text. #29
Comments
PdfParagraph object construction under way. Hidden from crate prelude for now. |
I have a pdf file which render each char with Td and Tj operation. Some page takes more than 10 seconds to extract text. |
Hi @russellwmy , if you are just looking to extract text, then no, |
@ajrcarey Good to know. I find a way to speed it up now.
Just an idea, do you think this can implement internally? |
Every time you create
However, I would expect this to still be slower than |
Made improvements to segment detection. Implemented prototype lines to paragraphs accumulator. Made |
Consider also adding handling of tables as suggested in #149. |
Moved PdfParagraph behind new feature flag paragraph. The change will take effect in release 0.8.23. |
@ajrcarey Thanks you for this library and this super useful feature. Currently it seems like something is broken in imports:
|
Hi @ziimakc , thank you for reporting the issue. Because I've pushed a commit that updates the import paths, and adjusts the github workflow to include the paragraph feature when running tests to prevent this happening again in the future. |
@ajrcarey thanks, but how to import it? |
Because
|
Would it be possible to include image objects, so you could extract something like this:
This would be helpful for case of generating some json structures from text with the help of AI and replacing |
There is no intention to develop this feature further before crate release 1.0. Of course, nothing is stopping you from doing so yourself, and making a PR. |
Follow-on from #17, #22, #25. Add a
PdfParagraph
object that allows for easier handling of multi-line text with embedded character formatting changes.Ideally, it would be possible to generate a PdfParagraph from an existing set of
PdfPageTextObject
objects, each one containing a formatted fragment of a paragraph.The text was updated successfully, but these errors were encountered: