Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page segment text extracted as individual letter #121

Closed
reyjexter opened this issue Nov 9, 2023 · 6 comments
Closed

Page segment text extracted as individual letter #121

reyjexter opened this issue Nov 9, 2023 · 6 comments
Assignees

Comments

@reyjexter
Copy link

Hello,

There seems to be an issue which depending on document, extracting of text segment returns individual letters instead.

When calling page.text().unwrap().all(), it returns the correct text but when using page.text().unwrap().segments(), the result is returned as an individual letters.

Here's the PDF that we tested with: https://github.com/reyjexter/pdfium-render-wasm/blob/master/www/font-extract-test.pdf

And the results:

image

Thanks

@ajrcarey
Copy link
Owner

ajrcarey commented Nov 9, 2023

I'm not sure what you're asking, @reyjexter - you say there is an issue, but you haven't clearly explained what it is.

Text segments != text. Separate lines, words, and characters can be placed into distinct text segments by Pdfium based on positioning, font characteristics, or other properties. If you examine your source document in a tool such as PDF Explorer, you will see that, instead of coalescing all the characters in your text together into a single \Tj function call, your document creator has output each letter as a separate \Tj function call. This is presumably why Pdfium considers each letter a distinct text segment.

If your complaint is that Pdfium should be collapsing the separate segments down into one, then you need to raise an issue upstream with Pdfium, not here. pdfium-render simply returns the text segments reported by Pdfium.

I am closing the issue, pending you explaining in sufficient detail exactly the problem you are reporting with pdfium-render.

@ajrcarey ajrcarey closed this as completed Nov 9, 2023
@ajrcarey ajrcarey self-assigned this Nov 9, 2023
@reyjexter
Copy link
Author

reyjexter commented Nov 10, 2023

Thanks and understood.

The expectation is for some text for example the "Font Demo" to be in a single segment because it was created with Canva using a single text layer tool, using the same font family and size. It appears that we have been running with some issues with PDF created with Canva though Adobe Acrobat doesn't have much issue.

For comparison we also checked and compare this against a tool called PDF.js and the same PDF document, it's able to extract the correct letters for a text segment. Here's the demo link which we used to upload the PDF file into:

https://mozilla.github.io/pdf.js/web/viewer.html

The reason for why we need this capability is because we are implementing both text highlighting and text to speech feature which if the segment extracted is the individual letters, the text to speech capability pronounce it as F-O-N-T-D-E-M-O instead of "Font Demo".

@ajrcarey
Copy link
Owner

I understand, but nothing you have said here indicates a bug in pdfium-render. The bug would be if pdfium-render did not report the same text segments as Pdfium.

If you believe Pdfium is not correctly collapsing the text into the correct text segments, then you need to file a bug report/feature request upstream with the Pdfium authors.

I can understand why you would want to use text segments for highlighting - that is its intended purpose - but I don't know why you would want to use it for text-to-speech. Surely retrieving the complete text via the page.text()?.all(), or on a text object by text object basis using text() or chars(), would be better for that purpose.

@reyjexter
Copy link
Author

reyjexter commented Nov 10, 2023

I think it's indeed not a bug on pdfium-render but more of a new feature request though this feature can also just be implemented on our own application or codes. Initially I thought it is a bug because the behavior seems something that one will normally use and expect.

What we were thinking and when we checked the bounding box of each text segment is that those that are lined up are very close to each other and only about 1-2pt difference on top and bottom and same small difference on left and right on their bounding boxes. Given a certain threshold, we can merge multiple segments into a single segment for example by providing a function segments_merged(threshold_x: f64, threshold_y: f64).

As for the reason why we cannot use text()?.all(), we would like to achieve a behavior something that PDF.js did where in the transparent html text appears on top of the image or segment which do two things:

  • Allows highlighting and copying of the specific text that was highlighted.
  • Text to speech feature which for some browser, you are able to announce all or only those that are highlighted.

Something that looks like this on PDF.js:

image

We will also need to use the same technique or approach to implement accessibility compliance requirements like bulleted contents needs some sort of hidden ul and li tags.

@ajrcarey
Copy link
Owner

ajrcarey commented Nov 10, 2023

There is a feature in development as part of #29 , PdfParagraph, that should eventually make extracting entire blocks of text like this much simpler, but it's still in development and there is no firm timeframe for delivery. It is partially complete, so you could try playing around with it if you like. I have pushed a small change that makes the PdfParagraph struct available publicly (it was previously crate private). Using the following sample code:

use pdfium_render::paragraph::PdfParagraph; // in-development struct not included in prelude
use pdfium_render::prelude::*;

pub fn main() -> Result<(), PdfiumError> {
    let pdfium = Pdfium::default();
    let document = pdfium.load_pdf_from_file("font-extract-test.pdf", None)?;
    let page = document.pages().first()?;
    let objects = page.objects().iter().collect::<Vec<_>>();
    let paragraphs = PdfParagraph::from_objects(objects.as_slice());

    for paragraph in paragraphs.iter() {
        println!("{}", paragraph.text())
    }

    Ok(())
}

This outputs the two paragraphs, "Font Demo" and "Hello World of Fonts", as expected. Feel free to play around with it further by taking pdfium-render as a git dependency in your Cargo.toml.

@reyjexter
Copy link
Author

Thanks and we will look using this experimental feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants