-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide memory-safe options for moving PdfPageObject
s from one PdfPage
to another.
#60
Comments
(The best way forward may be to try to clone the objects, place the clone on the new page, and then delete the original from its containing page. Cloning an object exactly wasn't plausible at the time of #18 but is much closer to being plausible now because |
Apologies for the delay. I spent quite a bit of time today examining this problem. It is indeed the same basic problem as #18, although you have discovered an additional related problem. There are two separate issues here:
While I'm sure these two problems are related, I propose we focus on the first one. The second problem, while certainly weird, can probably be worked around by applying transformations, either to the page objects remaining on the affected page, or to the entire page itself. (Perhaps the fact that the page is transformed is somehow a hint as to the cause of the memory ownership problem, but if it is, I'm not clever enough to see it.) Changing While my instinct is that this is an upstream problem in Pdfium, I'd like to completely eliminate any possibility of a bug in If we're lucky, then the cause of this problem will turn out to be some sort of subtle bug in Assuming the bug is definitely in Pdfium, not |
I am able to reproduce the segmentation fault using pure use pdfium_render::prelude::*;
fn main() -> Result<(), PdfiumError> {
let pdfium = Pdfium::new(
Pdfium::bind_to_library(Pdfium::pdfium_platform_library_name_at_path("../pdfium/"))
.or_else(|_| Pdfium::bind_to_system_library())?,
);
let bindings = pdfium.bindings();
let document = bindings.FPDF_LoadDocument("../pdfium/test/export-test.pdf", None);
assert!(bindings.get_pdfium_last_error().is_none());
let page_count = bindings.FPDF_GetPageCount(document);
let page = bindings.FPDF_LoadPage(document, 0);
assert!(bindings.get_pdfium_last_error().is_none());
println!(
"{} page objects on page",
bindings.FPDFPage_CountObjects(page)
);
let object = bindings.FPDFPage_GetObject(page, 0);
assert!(bindings.get_pdfium_last_error().is_none());
println!("1");
let result = bindings.FPDFPage_RemoveObject(page, object);
assert!(bindings.is_true(result));
println!("2");
let result = bindings.FPDFPage_GenerateContent(page);
assert!(bindings.is_true(result));
println!("3");
bindings.FPDF_ClosePage(page);
assert!(bindings.get_pdfium_last_error().is_none());
println!("4");
let new_page = bindings.FPDFPage_New(document, page_count, 600.0, 600.0);
assert!(bindings.get_pdfium_last_error().is_none());
println!("5");
bindings.FPDFPage_InsertObject(new_page, object);
assert!(bindings.get_pdfium_last_error().is_none());
println!("6");
let result = bindings.FPDFPage_GenerateContent(new_page); // <---- segfaults here
assert!(bindings.is_true(result));
println!("7");
bindings.FPDF_ClosePage(new_page);
assert!(bindings.get_pdfium_last_error().is_none());
println!("8");
bindings.FPDF_CloseDocument(document);
assert!(bindings.get_pdfium_last_error().is_none());
bindings.FPDF_DestroyLibrary();
Ok(())
} This suggests that the problem is indeed in Pdfium, not |
I have spent more time on this, experimenting with the sequencing of page opens, page closes, and content regeneration, but I can find no way around Pdfium segfaulting when it regenerates content on the destination page. I think at this point we can be confident that the problem is not a bug in pdfium-render but is definitely upstream in Pdfium itself. The only work-around possible that I can see is to attempt to clone each object, rather than moving them. This is problematic because Pdfium does not provide access to every property of every type of page object, making exact cloning impossible in certain situations. For instance, cloning a path object exactly is not possible if the path includes bezier curves because Pdfium provides no way to retrieve the bezier curve control points of any bezier curve segments in a subpath. However, for path objects not including curves, and for text objects and image objects, it should be possible to clone the objects pretty precisely. I propose implementing a new let mut objs = page.objects();
let new_objs = new_page.objects_mut();
let mut i = 0;
while i < objs.len() {
let obj = objs.get(i).map_err(PdfiumErr)?;
let Ok(bounds) = obj.bounds() else {continue};
if bounds.bottom >= y {
log::debug!("item = {:?}", obj.object_type());
// OLD: new_objs.take_object_from_page(&mut page, i);
// NEW:
new_objs.add_page_object(obj.try_clone()?)?; // <-- uses proposed new .try_clone() method
objs = page.objects();
} else {
i += 1;
}
} The |
Removed The new functionality is demonstrated in While working on this, it did occur to me that there might be another way: using the |
My sincere apologies for the delay in progressing this; I really did mean to work on this after Christmas, but unfortunately I was rather ill over that period. The last few days have been the next opportunity. I have pushed a commit that makes the following changes:
If you run the In addition to improving the
Generally speaking, I am a bit disappointed in the results from these new functions. I was expecting Pdfium to make perfect clones of each page object, since (unlike None of the provided functions offer a truly satisfying solution, but I believe this is likely the best that can be done within the constraints imposed by Pdfium. Careful testing is required to make sure that my approach to "unreflecting" objects isn't overzealous. I suspect it may be. If you have sample documents you can try, I'd be keen to hear your results. |
Relaxed mutability requirements in |
PdfPageObject
s from one PdfPage
to anotherPdfPageObject
s from one PdfPage
to another.
Completed unit tests for |
Released as crate version 0.7.28. This issue will remain open to allow time for @N3xed to provide feedback, and to allow more investigation as to whether my approach to "unreflecting" objects isn't overzealous. |
Interestingly, the |
Noticed that |
Renamed |
I haven't read all of this due to lack of time, but I ran into the same problem as described in the issue title while developing helpers for pypdfium2. Apparently it's a bug/limitation of pdfium. Is upstream aware of this? Has someone already filed an issue? |
Hi @mara004 , yes, I am confident this is an upstream problem in Pdfium. I am not aware of any upstream issue being filed. Did you find a suitable way of working around the problem in pypdfium? I have a couple of approaches to mitigation implemented in pdfium-render but none of them are perfect and overall I'm not super happy with the results. |
pypdfium2's support model currently just prohibits moving page objects across pages, which seemed like the only reasonable thing I could do until this is made possible upstream. I don't know any c++, unfortunately... |
I didn't find a memory-safe way to move objects with Pdfium, no.
Neither of these approaches is entirely satisfactory, but they do cover many use cases and they are memory safe. The copy methods are part of the object group interface, a flexible page object container provided by |
Thanks for the hints! Is approach 1 possible with pdfium's public API, or are you also using non-public APIs for that? |
I am not aware of any non-public Public Actually, method 2 can copy any number of source page objects, not just one - but it is limited to placing them all on a new page. So it can be very powerful and useful in specific circumstances, but it's useless if you want to merge objects onto an existing page. |
This is https://crbug.com/pdfium/1694. I'm sorely missing that, too, and hope they'd get a move on. (edit: fixed bug link, sorry) |
A further hack amending your second approach came to my mind: |
@ajrcarey FYI, I reported this as https://crbug.com/pdfium/2015. |
@ajrcarey Small hint on your code sample (#60 (comment)):
|
If that post by the platform dev is correct then it is a serious problem in the Pdfium documentation because the documentation explicitly states that any failing SDK call, not just Many Pdfium API functions return null or -1 to indicate they failed, with the expectation being that Your suggestion of capturing page objects for duplication inside an XObject is an intriguing one and I will experiment with this another time. |
IIRC, that is the old Foxit docs, which were just wrong. |
In the headers, the only function docs that mention |
Indeed. Looking at the git log, they snuck that change into the docs mid last year, so it's relatively recent compared to the lifetime of the project. (It doesn't appear in the copy of the header files I have on my local clone of Pdfium, for instance.) I suppose it's good that the documentation is updated, but I wish they'd done more to publicise this; this is effectively a massive breaking change which, as you rightly point out, can actually be harmful when misused. I will spawn a new issue to examine all |
Ahh, now I remember. That improvement to the docs was probably an aftermath of the confusion on the mailing list. (Note that there was no actual code change, AFAIK pdfium has always behaved this way -- it was just the docs that changed.) |
@ajrcarey Found something out: The issue is limited to moving objects to new pages. |
@ajrcarey I wonder, given that pdfium-render is meant to be a Rust wrapper around pdfium, would it be an option to first just expose the underlying FPDF_NewFormObjectFromXObject() as-is, and then incrementally provide some more general solution later, if you see how that would be possible? I'm asking as I just realized that this issues is primarily about moving objects, while FPDF_NewFormObjectFromXObject() is about copying, so perhaps having both would be fine. Thanks. |
Hi there.
I've been trying hard to get moving objects from one page to another to work, without success.
On the one hand, using
PdfPageObjects::remove_object
gives borrow checker errors because getting aPdfPage
object borrows PdfPageObjectsimmutable and then you can't borrow the same
PdfPageObjectsmutably to call the
remove_object` function.I've also tried it with good old index access and with that I can remove them fine, but re-adding them to another
PdfPage
produces segfaults.For example:
produces segfaults.
There is also the
PdfPageGroupObject
, but this doesn't help since callingremove_objects_from_page
also deletes the objects (and somehow mirrors the source page horizontally).So, am I doing this wrong? Is it even possible?
Either way, many thanks for your great work.
The text was updated successfully, but these errors were encountered: