Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide memory-safe options for moving PdfPageObjects from one PdfPage to another. #60

Open
N3xed opened this issue Dec 1, 2022 · 31 comments
Assignees

Comments

@N3xed
Copy link

N3xed commented Dec 1, 2022

Hi there.

I've been trying hard to get moving objects from one page to another to work, without success.

On the one hand, using PdfPageObjects::remove_object gives borrow checker errors because getting a PdfPage object borrows PdfPageObjectsimmutable and then you can't borrow the samePdfPageObjectsmutably to call theremove_object` function.

I've also tried it with good old index access and with that I can remove them fine, but re-adding them to another PdfPage produces segfaults.

For example:

        let mut objs = page.objects();
        let new_objs = new_page.objects_mut();
        let mut i = 0;
        while i < objs.len() {
            let obj = objs.get(i).map_err(PdfiumErr)?;
            let Ok(bounds) = obj.bounds() else {continue};
            if bounds.bottom >= y {
                log::debug!("item = {:?}", obj.object_type());
                new_objs.take_object_from_page(&mut page, I); // <-- segfault here
                objs = page.objects();
            } else {
                i += 1;
            }
        }

produces segfaults.

There is also the PdfPageGroupObject, but this doesn't help since calling remove_objects_from_page also deletes the objects (and somehow mirrors the source page horizontally).

So, am I doing this wrong? Is it even possible?

Either way, many thanks for your great work.

@ajrcarey
Copy link
Owner

ajrcarey commented Dec 1, 2022

Hi @N3xed, thank you for raising the issue!

This is a problematic area; I will reply more fully tomorrow when I have more time, but for now you may wish to have a read through #18 because we will probably be referring to that a lot in our discussion :)

@ajrcarey
Copy link
Owner

ajrcarey commented Dec 1, 2022

(The best way forward may be to try to clone the objects, place the clone on the new page, and then delete the original from its containing page. Cloning an object exactly wasn't plausible at the time of #18 but is much closer to being plausible now because pdfium-render implements much more of the Pdfium API now. That said, there are still certain properties that Pdfium doesn't expose via API, so it will never be possible to clone those properties.)

@ajrcarey
Copy link
Owner

ajrcarey commented Dec 3, 2022

Apologies for the delay. I spent quite a bit of time today examining this problem. It is indeed the same basic problem as #18, although you have discovered an additional related problem. There are two separate issues here:

  1. Transferring memory ownership of a page object from one page's content stream to another crashes Pdfium. This is discussed in Pdfium crashes during content regeneration after transferring page object memory ownership from one page to another. #18.
  2. Removing (not transferring) memory ownership of a page object from one page's content stream results in a curious transformation of the remaining page objects (or possibly the entire page itself, I'm not sure yet). This is what you reported when you said "There is also the PdfPageGroupObject, but this doesn't help since calling remove_objects_from_page also deletes the objects (and somehow mirrors the source page horizontally)."

While I'm sure these two problems are related, I propose we focus on the first one. The second problem, while certainly weird, can probably be worked around by applying transformations, either to the page objects remaining on the affected page, or to the entire page itself. (Perhaps the fact that the page is transformed is somehow a hint as to the cause of the memory ownership problem, but if it is, I'm not clever enough to see it.)

Changing remove_objects_from_page() so it doesn't delete the objects is easy enough. Transferring memory ownership of those objects to a different page, either in the same document or a new in-memory document, reliably segfaults Pdfium. I haven't found a way to work around this yet, despite several hours of trying. The segfault actually occurs during content stream regeneration; in your sample code, this occurs when your loop gets to the objs = page.objects(); line. It is possible to deactivate automatic content stream regeneration and control it manually, in which case I can get your loop to work successfully; but calling page.regenerate_content() after that (which is necessary in order for the changes to take effect) will reliably segfault Pdfium.

While my instinct is that this is an upstream problem in Pdfium, I'd like to completely eliminate any possibility of a bug in pdfium-render being the cause of this. I think the next thing for me to do is to write a sequence of FPDF_*() function calls that reproduce the problem, i.e. avoiding the pdfium-render high-level interface entirely.

If we're lucky, then the cause of this problem will turn out to be some sort of subtle bug in pdfium-render, in which case it's just a matter of tracking it down and fixing it.

Assuming the bug is definitely in Pdfium, not pdfium-render, then I think we have no choice but to resort to attempting to clone the page objects. As I mentioned previously, Pdfium does not actually expose every attribute of a page object, so producing exact clones is probably not possible. For instance, cloning a path object exactly is not really possible because Pdfium provides no way to retrieve the bezier curve control points of any bezier curve segments in a subpath. That's pretty fundamental information needed in order to accurately clone any path object containing bezier curves.

@ajrcarey
Copy link
Owner

ajrcarey commented Dec 13, 2022

I am able to reproduce the segmentation fault using pure FPDF_*() functions with the following test code:

use pdfium_render::prelude::*;

fn main() -> Result<(), PdfiumError> {
    let pdfium = Pdfium::new(
        Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("../pdfium/"))
            .or_else(|_| Pdfium::bind_to_system_library())?,
    );

    let bindings = pdfium.bindings();

    let document = bindings.FPDF_LoadDocument("../pdfium/test/export-test.pdf", None);

    assert!(bindings.get_pdfium_last_error().is_none());

    let page_count = bindings.FPDF_GetPageCount(document);

    let page = bindings.FPDF_LoadPage(document, 0);

    assert!(bindings.get_pdfium_last_error().is_none());

    println!(
        "{} page objects on page",
        bindings.FPDFPage_CountObjects(page)
    );

    let object = bindings.FPDFPage_GetObject(page, 0);

    assert!(bindings.get_pdfium_last_error().is_none());

    println!("1");
    let result = bindings.FPDFPage_RemoveObject(page, object);

    assert!(bindings.is_true(result));

    println!("2");
    let result = bindings.FPDFPage_GenerateContent(page);

    assert!(bindings.is_true(result));

    println!("3");
    bindings.FPDF_ClosePage(page);

    assert!(bindings.get_pdfium_last_error().is_none());

    println!("4");
    let new_page = bindings.FPDFPage_New(document, page_count, 600.0, 600.0);

    assert!(bindings.get_pdfium_last_error().is_none());

    println!("5");
    bindings.FPDFPage_InsertObject(new_page, object);

    assert!(bindings.get_pdfium_last_error().is_none());

    println!("6");
    let result = bindings.FPDFPage_GenerateContent(new_page); // <---- segfaults here

    assert!(bindings.is_true(result));

    println!("7");
    bindings.FPDF_ClosePage(new_page);

    assert!(bindings.get_pdfium_last_error().is_none());

    println!("8");
    bindings.FPDF_CloseDocument(document);

    assert!(bindings.get_pdfium_last_error().is_none());

    bindings.FPDF_DestroyLibrary();

    Ok(())
}

This suggests that the problem is indeed in Pdfium, not pdfium-render. (I was sort of hoping that carefully sequencing the opening and closing of pages would bypass the problem, indicating a bug in pdfium-render rather than Pdfium, but no such luck.) I want to continue experimenting with this a little more tomorrow to see if I can find a work around, but it seems likely that the problem is indeed upstream and so the question becomes how to work around it in pdfium-render - or if a work around is even possible.

@ajrcarey
Copy link
Owner

ajrcarey commented Dec 26, 2022

I have spent more time on this, experimenting with the sequencing of page opens, page closes, and content regeneration, but I can find no way around Pdfium segfaulting when it regenerates content on the destination page. I think at this point we can be confident that the problem is not a bug in pdfium-render but is definitely upstream in Pdfium itself.

The only work-around possible that I can see is to attempt to clone each object, rather than moving them. This is problematic because Pdfium does not provide access to every property of every type of page object, making exact cloning impossible in certain situations. For instance, cloning a path object exactly is not possible if the path includes bezier curves because Pdfium provides no way to retrieve the bezier curve control points of any bezier curve segments in a subpath. However, for path objects not including curves, and for text objects and image objects, it should be possible to clone the objects pretty precisely.

I propose implementing a new try_clone() function on the PdfPageObject struct. Your sample code will then look something like this:

    let mut objs = page.objects();
    let new_objs = new_page.objects_mut();
    let mut i = 0;
    while i < objs.len() {
        let obj = objs.get(i).map_err(PdfiumErr)?;
        let Ok(bounds) = obj.bounds() else {continue};
        if bounds.bottom >= y {
            log::debug!("item = {:?}", obj.object_type());
            // OLD: new_objs.take_object_from_page(&mut page, i);
            // NEW:
            new_objs.add_page_object(obj.try_clone()?)?; // <-- uses proposed new .try_clone() method
            objs = page.objects();
        } else {
            i += 1;
        }
    }

The take_*() methods in PdfPageObjects unfortunately must be removed, since I have no way to guarantee the stability of the underlying upstream functionality.

ajrcarey pushed a commit that referenced this issue Dec 28, 2022
@ajrcarey
Copy link
Owner

ajrcarey commented Dec 28, 2022

Removed PdfPageObjects::take_*() functions. Added PdfPageObject::is_cloneable() and PdfPageObject::try_clone() functions, along with initial implementations of cloning for all page object types. Added PdfPageGroupObject::retain(), PdfPageGroupObject::retain_if_cloneable(), PdfPageGroupObject::is_cloneable(), and PdfPageGroupObject::try_clone_onto_page() functions.

The new functionality is demonstrated in examples/clone.rs. This example moves all page objects from the bottom half of a test document onto a new page, without segfaulting. The source page ends up mirrored when its content stream is regenerated, but we will deal with that separately; for now, the key thing is that it is possible to move a page object (within the bounds of certain limitations) from one page to another.

While working on this, it did occur to me that there might be another way: using the PdfPages::copy_page_from_document() function. The basic idea is that, for page object x on page y, we copy page y to a new in-memory document, delete every page object in that in-memory document except page object x, then copy the page back into the source document. It has the disadvantage that it only allows copying objects to a new page, not an existing one; but in theory it would clone the objects exactly, without any limitations. I will experiment with this approach later this week.

ajrcarey pushed a commit that referenced this issue Jan 28, 2023
@ajrcarey
Copy link
Owner

ajrcarey commented Jan 28, 2023

My sincere apologies for the delay in progressing this; I really did mean to work on this after Christmas, but unfortunately I was rather ill over that period. The last few days have been the next opportunity.

I have pushed a commit that makes the following changes:

  • Mitigates the Pdfium reflection bug by "unreflecting" all page objects remaining on a page after a group of objects is removed.
  • Renames PdfPageGroupObject::try_clone_onto_page() to PdfPageGroupObject::try_clone_onto_existing_page() to better differentiate it from the new functions added in this commit.
  • Adds new cloning functions that use the PdfPages::copy_page_from_document() technique mentioned in my previous comment. More details below.
  • Fixes a bug in examples/clone.rs.

If you run the clone.rs example now, you will see that it correctly moves objects from one page to another, and no objects are incorrectly reflected.

In addition to improving the PdfPageGroupObject::try_clone_onto_existing_page() function previously added, this commit adds the following new functions that leverage the PdfPages::copy_page_from_document() technique mentioned in my previous comment:

  • PdfPageObjectGroup::clone_onto_new_page_at_start(): clones objects onto a new page at the start of the document containing the object group.
  • PdfPageObjectGroup::clone_onto_new_page_at_start_in_document(): clones objects onto a new page at the start of a different document.
  • PdfPageObjectGroup::clone_onto_new_page_at_end(): clones objects onto a new page at the end of the document containing the object group.
  • PdfPageObjectGroup::clone_onto_new_page_at_end_in_document(): clones objects onto a new page at the end of a different document.
  • PdfPageObjectGroup::clone_onto_new_page_at_index(): clones objects onto a new page at a given index position in the document containing the object group.
  • PdfPageObjectGroup::clone_onto_new_page_at_index_in_document(): clones objects onto a new page at a given index position in a different document.

Generally speaking, I am a bit disappointed in the results from these new functions. I was expecting Pdfium to make perfect clones of each page object, since (unlike pdfium-render) it has complete access to all object properties. In practice, this is not the case: text objects, for instance, are not cloned with per-character horizontal scaling; since this is the technique typically used to achieve paragraph justification, this means that justification is lost. This makes Pdfium's text object cloning no more effective than pdfium-render's try_clone() function. This is disappointing. However, Pdfium can still clone objects that pdfium-render cannot, such as path objects containing Bézier curves. So there is still some utility to these functions, even if they aren't perfect.

None of the provided functions offer a truly satisfying solution, but I believe this is likely the best that can be done within the constraints imposed by Pdfium.

Careful testing is required to make sure that my approach to "unreflecting" objects isn't overzealous. I suspect it may be. If you have sample documents you can try, I'd be keen to hear your results.

ajrcarey pushed a commit that referenced this issue Jan 30, 2023
@ajrcarey
Copy link
Owner

ajrcarey commented Jan 30, 2023

Relaxed mutability requirements in PdfPageObjectGroup::clone_*() functions to reduce API surface from six functions to three. Renamed all *_clone_*() functions to *_copy_*(), since "clone" implies exact duplicates, a promise we cannot categorically make in all circumstances. Replaced use of lazy_static! macro with once_cell::sync::Lazy. Improved source page index lookup in PdfPageObjectGroup::copy_to_new_page_at_*() functions with a new dedicated PdfPageIndexCache cache. Updated documentation.

@ajrcarey ajrcarey self-assigned this Jan 30, 2023
@ajrcarey ajrcarey changed the title Moving PdfPageObjects from one PdfPage to another Provide memory-safe options for moving PdfPageObjects from one PdfPage to another. Feb 4, 2023
@ajrcarey
Copy link
Owner

ajrcarey commented Feb 4, 2023

Completed unit tests for PdfPageIndexCache. Renamed examples/clone.rs to examples/copy_objects.rs. Minor tweaks to documentation. Packaged for release as crate version 0.7.28.

ajrcarey pushed a commit that referenced this issue Feb 4, 2023
ajrcarey pushed a commit that referenced this issue Feb 4, 2023
@ajrcarey
Copy link
Owner

ajrcarey commented Feb 4, 2023

Released as crate version 0.7.28.

This issue will remain open to allow time for @N3xed to provide feedback, and to allow more investigation as to whether my approach to "unreflecting" objects isn't overzealous.

@ajrcarey
Copy link
Owner

ajrcarey commented Feb 5, 2023

Interestingly, the examples/concat.rs example uses the PdfPages::append() and PdfPages::copy_pages_from_document() functions and the output produced by that example has text that's correctly justified, unlike the examples/copy_objects.rs example. Look into this further - why is there a difference?

@ajrcarey
Copy link
Owner

ajrcarey commented Mar 1, 2023

Noticed that PdfPageObjectPrivate trait still includes references to try_clone() rather than try_copy(). Correct this, and perform an exhaustive check for other dangling references to cloning rather than copying.

@ajrcarey
Copy link
Owner

ajrcarey commented Mar 7, 2023

Renamed PdfPageObjectPrivate::is_cloneable() and PdfPageObjectPrivate::try_clone() to is_copyable() and try_copy() respectively. Updated all references in documentation from "clone" to "copy".

@mara004
Copy link

mara004 commented Mar 21, 2023

I haven't read all of this due to lack of time, but I ran into the same problem as described in the issue title while developing helpers for pypdfium2. Apparently it's a bug/limitation of pdfium. Is upstream aware of this? Has someone already filed an issue?

@ajrcarey
Copy link
Owner

Hi @mara004 , yes, I am confident this is an upstream problem in Pdfium. I am not aware of any upstream issue being filed. Did you find a suitable way of working around the problem in pypdfium?

I have a couple of approaches to mitigation implemented in pdfium-render but none of them are perfect and overall I'm not super happy with the results.

@mara004
Copy link

mara004 commented Mar 21, 2023

pypdfium2's support model currently just prohibits moving page objects across pages, which seemed like the only reasonable thing I could do until this is made possible upstream. I don't know any c++, unfortunately...
From roughly skimming pdfium's bug tracker, I was so far unable to find a matching report. Maybe no one's informed them about it yet. I suppose we should do that.
@ajrcarey Did you find a way to actually move page objects, or are you also just preventing it?

@ajrcarey
Copy link
Owner

ajrcarey commented Mar 21, 2023

I didn't find a memory-safe way to move objects with Pdfium, no. pdfium-render implements two work-arounds:

  • For most (but not all) types of objects, it is possible to duplicate the object by reading its properties from Pdfium and creating a new object with the same properties. By deleting the originals from the source page and applying the duplicates to the destination page, it is possible to effect a move of sorts.
  • Using Pdfium's page import feature, it is possible to copy the source page containing the object(s) to be moved or copied into a new in-memory document, delete everything from that copied page except the target objects, then copy that page back into the source document at the destination page index. This works for all types of objects, but can only be used when copying or moving objects onto a new destination page, not an existing page.

Neither of these approaches is entirely satisfactory, but they do cover many use cases and they are memory safe. pdfium-render supports both moving and copying objects using these two approaches.

The copy methods are part of the object group interface, a flexible page object container provided by pdfium-render, and the documentation is at https://docs.rs/pdfium-render/0.7.34/pdfium_render/page_object_group/struct.PdfPageGroupObject.html.

@mara004
Copy link

mara004 commented Mar 21, 2023

Thanks for the hints! Is approach 1 possible with pdfium's public API, or are you also using non-public APIs for that?
I see that approach 2 will work, but as you mention, this can only be used to isolate single page objects onto new pages, which hardly helps.

@ajrcarey
Copy link
Owner

ajrcarey commented Mar 21, 2023

I am not aware of any non-public FPDF_* API functions. (If you are, please tell me - especially if you have a way of reading an object dictionary! I would love to know.)

Public FPDF_* API functions are used to retrieve property values for supported object types. New objects of the same type are created with the same value. As you know, Pdfium does not provide a way to retrieve all property values for all object types - for example, there is no API (that I know of) to retrieve the Bezier control points of a Bezier curve path segment in a path object. So not all objects can be copied successfully in this way. Each object provides an is_copyable() function that indicates whether the object can be duplicated, and a try_copy() method that actually performs the duplication (or returns an error if the object cannot be duplicated).

Actually, method 2 can copy any number of source page objects, not just one - but it is limited to placing them all on a new page. So it can be very powerful and useful in specific circumstances, but it's useless if you want to merge objects onto an existing page.

@mara004
Copy link

mara004 commented Mar 21, 2023

especially if you have a way of reading an object dictionary! I would love to know.

This is https://crbug.com/pdfium/1694. I'm sorely missing that, too, and hope they'd get a move on.
Dictionary access would theoretically be available through pdfium's non-public backend (the lower-level code layer behind the public APIs), but unfortunately they're C++ (not plain C), so we cannot bind to them on ABI level with ctypes (and non-public members would not be exported into the ABI in any case), which means I can't use this in pypdfium2.
That said, using non-public API would not be ideal anyway.

(edit: fixed bug link, sorry)

@mara004
Copy link

mara004 commented Mar 22, 2023

A further hack amending your second approach came to my mind:
When the pageobject in question is isolated on a page in a temporary document, you could crop down the page to the area of the pageobject (FPDFPageObj_GetBounds(), FPDFPage_SetCropBox() or FPDFPage_TransFormWithClip()), capture it as a new XObject attached to the initial document (FPDF_NewXObjectFromPage()), and then spawn new pageobjects from the XObject (FPDF_NewFormObjectFromXObject()), which may then be inserted into existing pages.
From a technical point of view that would be a really nasty thing to do, but I believe it might work as POC.

@mara004
Copy link

mara004 commented Mar 22, 2023

@ajrcarey FYI, I reported this as https://crbug.com/pdfium/2015.
Very short description due to lack of time; I just saw related commits happening in pdfium lately and thus wanted to report it. Feel free to comment and add more information if you like.

@mara004
Copy link

mara004 commented Mar 22, 2023

@ajrcarey Small hint on your code sample (#60 (comment)):

assert!(bindings.get_pdfium_last_error().is_none());, which calls FPDF_GetLastError() under the hood, only makes sense after failed document loading. It doesn't work with the other APIs (see post by pdfium dev Lei Zhang). So unless you actually have indication for a failure by return code, such assertions can be considered harmful as they threaten to re-raise a previous error that may have been caught by the caller (FPDF_GetLastError() doesn't clear its status).

@ajrcarey
Copy link
Owner

ajrcarey commented Mar 22, 2023

If that post by the platform dev is correct then it is a serious problem in the Pdfium documentation because the documentation explicitly states that any failing SDK call, not just FPDF_LoadDocument(), can set an error value.

Many Pdfium API functions return null or -1 to indicate they failed, with the expectation being that FPDF_GetLastError() will give some indication as to the cause of the error. This is a standard C and C++ design pattern (albeit an old-fashioned one by today's standards), it's mentioned in the Pdfium documentation that several functions communicate error values via FPDF_GetLastError(), and I woud be baffled if they genuinely didn't do this. It would be an astounding oversight in the documentation that flies in the face of any C or C++ programmer's expectations.

Your suggestion of capturing page objects for duplication inside an XObject is an intriguing one and I will experiment with this another time.

@mara004
Copy link

mara004 commented Mar 22, 2023

then it is a serious problem in the Pdfium documentation because the documentation explicitly states that any failing SDK call, not just FPDF_LoadDocument(), can set an error value.

IIRC, that is the old Foxit docs, which were just wrong.
But yes, it's really confusing and I initially made the same mistake before being informed by Lei.

@mara004
Copy link

mara004 commented Mar 22, 2023

// Function: FPDF_GetLastError
//          Get last error code when a function fails.
// Parameters:
//          None.
// Return value:
//          A 32-bit integer indicating error code as defined above.
// Comments:
//          If the previous SDK call succeeded, the return value of this
//          function is not defined. This function only works in conjunction
//          with APIs that mention FPDF_GetLastError() in their documentation.

In the headers, the only function docs that mention FPDF_GetLastError() are the document loading APIs.

@ajrcarey
Copy link
Owner

ajrcarey commented Mar 22, 2023

Indeed. Looking at the git log, they snuck that change into the docs mid last year, so it's relatively recent compared to the lifetime of the project. (It doesn't appear in the copy of the header files I have on my local clone of Pdfium, for instance.) I suppose it's good that the documentation is updated, but I wish they'd done more to publicise this; this is effectively a massive breaking change which, as you rightly point out, can actually be harmful when misused.

I will spawn a new issue to examine all pdfium-render code that calls FPDF_GetLastError() and see what needs to be changed.

@mara004
Copy link

mara004 commented Mar 22, 2023

Ahh, now I remember. That improvement to the docs was probably an aftermath of the confusion on the mailing list.
I agree it was misleading before that, but pdfium's just a large project and any developer can overlook something.

(Note that there was no actual code change, AFAIK pdfium has always behaved this way -- it was just the docs that changed.)

@mara004
Copy link

mara004 commented Apr 2, 2023

@ajrcarey Found something out: The issue is limited to moving objects to new pages.
Exchanging objects between existing pages actually does work.
See https://bugs.chromium.org/p/pdfium/issues/detail?id=2015#c9

@ajrcarey
Copy link
Owner

ajrcarey commented Jan 1, 2025

Possible XFormObject based approach provided by @vmiklos as part of #181 and PR #182. Look to rework this into a more general solution if possible.

@vmiklos
Copy link

vmiklos commented Jan 20, 2025

@ajrcarey I wonder, given that pdfium-render is meant to be a Rust wrapper around pdfium, would it be an option to first just expose the underlying FPDF_NewFormObjectFromXObject() as-is, and then incrementally provide some more general solution later, if you see how that would be possible?

I'm asking as I just realized that this issues is primarily about moving objects, while FPDF_NewFormObjectFromXObject() is about copying, so perhaps having both would be fine. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants