Page segment text extracted as individual letter #121

reyjexter · 2023-11-09T20:13:20Z

Hello,

There seems to be an issue which depending on document, extracting of text segment returns individual letters instead.

When calling page.text().unwrap().all(), it returns the correct text but when using page.text().unwrap().segments(), the result is returned as an individual letters.

Here's the PDF that we tested with: https://github.com/reyjexter/pdfium-render-wasm/blob/master/www/font-extract-test.pdf

And the results:

Thanks

The text was updated successfully, but these errors were encountered:

ajrcarey · 2023-11-09T20:37:16Z

I'm not sure what you're asking, @reyjexter - you say there is an issue, but you haven't clearly explained what it is.

Text segments != text. Separate lines, words, and characters can be placed into distinct text segments by Pdfium based on positioning, font characteristics, or other properties. If you examine your source document in a tool such as PDF Explorer, you will see that, instead of coalescing all the characters in your text together into a single \Tj function call, your document creator has output each letter as a separate \Tj function call. This is presumably why Pdfium considers each letter a distinct text segment.

If your complaint is that Pdfium should be collapsing the separate segments down into one, then you need to raise an issue upstream with Pdfium, not here. pdfium-render simply returns the text segments reported by Pdfium.

I am closing the issue, pending you explaining in sufficient detail exactly the problem you are reporting with pdfium-render.

reyjexter · 2023-11-10T04:30:42Z

Thanks and understood.

The expectation is for some text for example the "Font Demo" to be in a single segment because it was created with Canva using a single text layer tool, using the same font family and size. It appears that we have been running with some issues with PDF created with Canva though Adobe Acrobat doesn't have much issue.

For comparison we also checked and compare this against a tool called PDF.js and the same PDF document, it's able to extract the correct letters for a text segment. Here's the demo link which we used to upload the PDF file into:

https://mozilla.github.io/pdf.js/web/viewer.html

The reason for why we need this capability is because we are implementing both text highlighting and text to speech feature which if the segment extracted is the individual letters, the text to speech capability pronounce it as F-O-N-T-D-E-M-O instead of "Font Demo".

ajrcarey · 2023-11-10T13:24:00Z

I understand, but nothing you have said here indicates a bug in pdfium-render. The bug would be if pdfium-render did not report the same text segments as Pdfium.

If you believe Pdfium is not correctly collapsing the text into the correct text segments, then you need to file a bug report/feature request upstream with the Pdfium authors.

I can understand why you would want to use text segments for highlighting - that is its intended purpose - but I don't know why you would want to use it for text-to-speech. Surely retrieving the complete text via the page.text()?.all(), or on a text object by text object basis using text() or chars(), would be better for that purpose.

reyjexter · 2023-11-10T14:03:34Z

I think it's indeed not a bug on pdfium-render but more of a new feature request though this feature can also just be implemented on our own application or codes. Initially I thought it is a bug because the behavior seems something that one will normally use and expect.

What we were thinking and when we checked the bounding box of each text segment is that those that are lined up are very close to each other and only about 1-2pt difference on top and bottom and same small difference on left and right on their bounding boxes. Given a certain threshold, we can merge multiple segments into a single segment for example by providing a function segments_merged(threshold_x: f64, threshold_y: f64).

As for the reason why we cannot use text()?.all(), we would like to achieve a behavior something that PDF.js did where in the transparent html text appears on top of the image or segment which do two things:

Allows highlighting and copying of the specific text that was highlighted.
Text to speech feature which for some browser, you are able to announce all or only those that are highlighted.

Something that looks like this on PDF.js:

We will also need to use the same technique or approach to implement accessibility compliance requirements like bulleted contents needs some sort of hidden ul and li tags.

ajrcarey · 2023-11-10T16:53:47Z

There is a feature in development as part of #29 , PdfParagraph, that should eventually make extracting entire blocks of text like this much simpler, but it's still in development and there is no firm timeframe for delivery. It is partially complete, so you could try playing around with it if you like. I have pushed a small change that makes the PdfParagraph struct available publicly (it was previously crate private). Using the following sample code:

use pdfium_render::paragraph::PdfParagraph; // in-development struct not included in prelude
use pdfium_render::prelude::*;

pub fn main() -> Result<(), PdfiumError> {
    let pdfium = Pdfium::default();
    let document = pdfium.load_pdf_from_file("font-extract-test.pdf", None)?;
    let page = document.pages().first()?;
    let objects = page.objects().iter().collect::<Vec<_>>();
    let paragraphs = PdfParagraph::from_objects(objects.as_slice());

    for paragraph in paragraphs.iter() {
        println!("{}", paragraph.text())
    }

    Ok(())
}

This outputs the two paragraphs, "Font Demo" and "Hello World of Fonts", as expected. Feel free to play around with it further by taking pdfium-render as a git dependency in your Cargo.toml.

reyjexter · 2023-11-14T13:44:49Z

Thanks and we will look using this experimental feature.

ajrcarey closed this as completed Nov 9, 2023

ajrcarey self-assigned this Nov 9, 2023

ajrcarey mentioned this issue Nov 10, 2023

Add PdfParagraph to allow for more natural processing of multi-line text. #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page segment text extracted as individual letter #121

Page segment text extracted as individual letter #121

reyjexter commented Nov 9, 2023

ajrcarey commented Nov 9, 2023 •

edited

Loading

reyjexter commented Nov 10, 2023 •

edited

Loading

ajrcarey commented Nov 10, 2023

reyjexter commented Nov 10, 2023 •

edited

Loading

ajrcarey commented Nov 10, 2023 •

edited

Loading

reyjexter commented Nov 14, 2023

Page segment text extracted as individual letter #121

Page segment text extracted as individual letter #121

Comments

reyjexter commented Nov 9, 2023

ajrcarey commented Nov 9, 2023 • edited Loading

reyjexter commented Nov 10, 2023 • edited Loading

ajrcarey commented Nov 10, 2023

reyjexter commented Nov 10, 2023 • edited Loading

ajrcarey commented Nov 10, 2023 • edited Loading

reyjexter commented Nov 14, 2023

ajrcarey commented Nov 9, 2023 •

edited

Loading

reyjexter commented Nov 10, 2023 •

edited

Loading

reyjexter commented Nov 10, 2023 •

edited

Loading

ajrcarey commented Nov 10, 2023 •

edited

Loading