Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency of generated Zugferd PDF files #201

Closed
Disservin opened this issue Dec 5, 2024 · 12 comments
Closed

Consistency of generated Zugferd PDF files #201

Disservin opened this issue Dec 5, 2024 · 12 comments
Assignees
Labels
question Further information is requested

Comments

@Disservin
Copy link

Hi,

I've noticed two features in the zugferd PDF creation that make it difficult to achieve deterministic output:

  1. Random bits are being added here: https://github.com/horstoeko/zugferd/blob/master/src/ZugferdPdfWriter.php#L97
    What's the purpose of this randomization?

  2. Are the ModifyDate/CreationDate fields strictly required by the specification?

Context: I'm trying to write unit tests where input files A & B should consistently produce the same output PDF C. The random bits and timestamps make this challenging. While timestamps could potentially be handled by mocking system time, this feels like a hacky solution.

@Disservin Disservin added the question Further information is requested label Dec 5, 2024
@Disservin
Copy link
Author

& do we really need the /ID trailer ? Encrypt isn't even given?

@horstoeko
Copy link
Owner

horstoeko commented Dec 5, 2024

Hi!

Unfortunately, I don't understand the question because I can't see the problem you seem to have with it either. How can I help you? The implementation works. So I don't know what's bothering you? You may be able to adapt your unit tests so that you really only test the critical areas.

In a PDF document, the random bytes after the PDF version are not a mandatory technical requirement, but a common practice to support certain compatibility and security aspects.

And you are right: Neither CreateDate nor ModifyDate are defined as mandatory fields in the PDF standard. They are part of the optional information dictionary, which can contain metadata such as title, author, keywords, etc. However, I will not remove this information as it is useful for the recipient of such an e-invoice for user and archiving purposes.

Kind regards?

@Disservin
Copy link
Author

Disservin commented Dec 5, 2024

Well, automated testing becomes difficult...

Non-deterministic PDF generation significantly complicates testing.
In unit tests, you often validate PDF generation by comparing file content with a known good reference document. However, since identical inputs produce different binary outputs, simple validations like diff file1 file2 or checksum comparisons fail.

This kinda forces you to:

  1. Extract and validate the XML separately
  2. Use some PDF parsing tool to skip the non deterministic output, to verify the human readable content
  3. Overall more complex test assertions

It makes testing more complex and time consuming, especially if you treat this as a blackbox test. Simple assertions like "given these inputs, the output should exactly match this reference file" become impossible.

In a PDF document, the random bytes after the PDF version are not a mandatory technical requirement, but a common practice to support certain compatibility and security aspects.

First time that I'm hearing about this, do you have any reference for this?

If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the
beginning of a file to determine whether to treat the file’s contents as text or as binar

Ah, but no reason to have those be random?

And you are right: Neither CreateDate nor ModifyDate are defined as mandatory fields in the PDF standard. They are part of the optional information dictionary, which can contain metadata such as title, author, keywords, etc. However, I will not remove this information as it is useful for the recipient of such an e-invoice for user and archiving purposes.

Maybe make those optional fields optional for a user who's consuming this library too?

@PurHur
Copy link

PurHur commented Dec 5, 2024

+1 for deterministic Testing. In the near future i will build the same as disservin does here for us. We absolutly need to ensure the viewable PDF is the same as the XML.

@horstoeko
Copy link
Owner

Hi,

but I can't change an existing implementation just so that your unit tests go through.

Kind regards

@Disservin
Copy link
Author

This has like no breaking changes though ? The random bytes are random and aren't used for anything. It makes no difference if they are not random?
The timestamps can be optionally disabled.

@horstoeko
Copy link
Owner

Unfortunately, I find it a little disrespectful the way you are hammering me here. I develop and maintain the library in my spare time - and I love doing it. Perhaps a small thank you would have been appropriate at the beginning. I want to see what I can do for you. At the moment I'm very tied up with my “real” job.

Kind regards

@Disservin
Copy link
Author

I wasn't trying to hammer you into anything - I simply wanted to understand the technical reasoning for these things and if you were open for changes. As someone who maintains open source software myself, I appreciate your work on this library.

I'd be happy to implement this myself if you'll approve the general direction of these changes.

@horstoeko
Copy link
Owner

Hi @Disservin,
Hi @PurHur,

Of course I am open to changes - no question. But because this library is already quite widespread and widely used, I have to make sure that I don't make any changes that I'll regret at some point. As I said: I want to see what I can do for you. If you like, you are very welcome to submit a PR.

Kind regards

@horstoeko
Copy link
Owner

Hi @Disservin,
Hi @PurHur,

I have created a PR #203 and tried to meet your requirements. I ask you for a review and feedback.

Thank you very much and have a nice weekend.

@Disservin
Copy link
Author

This looks good from a consistency standpoint (which this issue was about) I think, thank you.
Though I'd like to say that the benefit of having it be deterministic is not solely beneficial for testing as stated in the variable comment

this mode should only be used for testing purposes

@horstoeko
Copy link
Owner

Hi @Disservin,

Okay. Thanks for the feedback. I would then transfer this to the productive branch and am pleased to have satisfied you with it.

Though I'd like to say that the benefit of having it be deterministic is not solely beneficial for testing as stated in the variable comment

I don't think so. This metadata has made sense for decades and therefore has a right to exist... ;-)

Kind regards

Repository owner locked as resolved and limited conversation to collaborators Dec 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants