-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page segment text extracted as individual letter #121
Comments
I'm not sure what you're asking, @reyjexter - you say there is an issue, but you haven't clearly explained what it is. Text segments != text. Separate lines, words, and characters can be placed into distinct text segments by Pdfium based on positioning, font characteristics, or other properties. If you examine your source document in a tool such as PDF Explorer, you will see that, instead of coalescing all the characters in your text together into a single If your complaint is that Pdfium should be collapsing the separate segments down into one, then you need to raise an issue upstream with Pdfium, not here. I am closing the issue, pending you explaining in sufficient detail exactly the problem you are reporting with |
Thanks and understood. The expectation is for some text for example the "Font Demo" to be in a single segment because it was created with Canva using a single text layer tool, using the same font family and size. It appears that we have been running with some issues with PDF created with Canva though Adobe Acrobat doesn't have much issue. For comparison we also checked and compare this against a tool called PDF.js and the same PDF document, it's able to extract the correct letters for a text segment. Here's the demo link which we used to upload the PDF file into: https://mozilla.github.io/pdf.js/web/viewer.html The reason for why we need this capability is because we are implementing both text highlighting and text to speech feature which if the segment extracted is the individual letters, the text to speech capability pronounce it as F-O-N-T-D-E-M-O instead of "Font Demo". |
I understand, but nothing you have said here indicates a bug in If you believe Pdfium is not correctly collapsing the text into the correct text segments, then you need to file a bug report/feature request upstream with the Pdfium authors. I can understand why you would want to use text segments for highlighting - that is its intended purpose - but I don't know why you would want to use it for text-to-speech. Surely retrieving the complete text via the |
There is a feature in development as part of #29 , use pdfium_render::paragraph::PdfParagraph; // in-development struct not included in prelude
use pdfium_render::prelude::*;
pub fn main() -> Result<(), PdfiumError> {
let pdfium = Pdfium::default();
let document = pdfium.load_pdf_from_file("font-extract-test.pdf", None)?;
let page = document.pages().first()?;
let objects = page.objects().iter().collect::<Vec<_>>();
let paragraphs = PdfParagraph::from_objects(objects.as_slice());
for paragraph in paragraphs.iter() {
println!("{}", paragraph.text())
}
Ok(())
} This outputs the two paragraphs, "Font Demo" and "Hello World of Fonts", as expected. Feel free to play around with it further by taking |
Thanks and we will look using this experimental feature. |
Hello,
There seems to be an issue which depending on document, extracting of text segment returns individual letters instead.
When calling
page.text().unwrap().all()
, it returns the correct text but when usingpage.text().unwrap().segments()
, the result is returned as an individual letters.Here's the PDF that we tested with: https://github.com/reyjexter/pdfium-render-wasm/blob/master/www/font-extract-test.pdf
And the results:
Thanks
The text was updated successfully, but these errors were encountered: