Forum Feature requests

Deterministic output: a way to specify the PDF File Identifier (/ID) property

Chris Thorman
Hi folks,

I would like to produce a PDF with a deterministic /ID file property, so 2 runs of the same document with same inputs, would produce an identical output.

Option 1: A command line param where I could give the ID to be used: I would specify the /ID property of the generated PDF based on (my own) hash of its input files. I am responsible for determining what inputs are relevant.

Option 2: A command line param asking Prince to use only a hash of the actual input resources (HTML stream, css files, images, fonts, anything else?) used to render the document, disregarding ephemera such as timestamp, file names, etc. These inputs would have to be processed in a deterministic order to produce a single resulting hash. It would be OK if new versions of prince generated different document IDs. But, the same version of prince, running on the same type of OS, with the same input resources, and same command line parameters, (and any other hidden input parameters it may use such as environment params) should generate the same hash. For extra credit, it could be a more modern hash such as SHA512, instead of MD5, as there seems to be no length limit specified in the PDF spec.

I prefer option 2 because the Prince engine knows the total set of parameters that can affect the output produced by its engine, whereas as a user, I can't always know that.

There are many use cases, but one example is: Imagine I have a source project in which it is desirable to commit a PDF built as an intermediate step from other project resources. (Yes, I realize that committing built outputs is an anti-pattern, but there are reasonable exceptions.) I would like git to be able to ignore a rebuilt artifact if is contents happen to be byte for byte identical. Conversely, I would like an engineer to question why a built artifact might have changed during a full rebuild if it should not have.

Thanks for considering!

-c

P. S. In the meanwhile I'll fake this feature myself by post-processing the PDF: I have observed that so far, prince seems to produce deterministic output except for this field. (For example, it seems to not include the datestamp metadata unless asked.)

Edited by Chris Thorman