Forum How do I...?

Metadata in generated PDF

Morten
We are new to Prince, and we have a requirement to have some metadata in the document that is used by our printing and mailing provider.

How can we do this in Prince?

Here is a sample of this metadata:


8 0 obj
<</Length 585/Type/Metadata/Subtype/XML>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="" xmlns:sbank="http://schemas.somebank.no/2015/11/">
<sbank:invoiceMetadata>PAGE1;55555555;SomeBank;Morten Middlename Lastname;444444;5555555;3333;N;;;1980-01-01;P;</sbank:invoiceMetadata>
<sbank:documentRef>DREF-00123456789</sbank:documentRef>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/" />
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>
endstream
endobj
mikeday
You cannot do this in Prince yet, although we hope to add a feature for this in the future. In the meantime it may be possible to achieve with a third-party PDF processing tool.
Morten
Hmmm... This may be a showstopper for us using Prince. With the volume of pdfs going to print that we have, we cannot go for a half and half solution. We need to go for a real solution.
mikeday
Perhaps you could provide me with a bit more information about the metadata, then I can take a look. Is it specified as just a chunk of text which should be embedded as is, does it have varying IDs etc.
Morten
Our specific need is basically what you see in my original post. There are basically two variables here: one is the invoiceMetadata and the other is the documentRef. Both are simple strings. The rest is static as far as I am concerned. I am no PDF expert, but I suspect that the length needs to be calculated, and I have no idea what the purpose of the "id" attribute is. But the whole chunk is a standard xmp metadata thingie, so the interwebs should have some info on that.
Morten
Looks like the ID is static as well, so it is just the Length that needs to be calculated: https://www.crossref.org/blog/w5m0mpcehihzreszntczkc9d/
howcome
So, you will supply a string starting with the '<?xpacket ...' and ending with '<?xpacket end="r"?>' and expect Prince to insert it into the PDF file?
mikeday
I've attached a PDF file that includes the XMP metadata packet, does this work for you?

One issue we need to resolve is how to combine this metadata with any other XMP metadata required by specific PDF profiles like PDF/A or PDF/X. I don't think we can just blindly concatenate the two XML chunks, so we will either need to parse the file and merge them ourselves or embed them in two separate places.
  1. test.pdf14.9 kB
Morten
As I suggested to Howcome in an email, one solution would be to imitate some other pdf libraries which will programmatically let you build the xmp metadata.

Perhaps something like this?

var ns = PDF.registerXmpNamespace(namespacetag, namespaceuri);
ns.setValue(tag, value);


To build my example from the original post, the html would then contain something like this:

var ns = PDF.registerXmpNamespace("sbank", "http://schemas.somebank.no/2015/11/");
ns.setValue("invoiceMetadata", "PAGE1;55555555;SomeBank;Morten Middlename Lastname;444444;5555555;3333;N;;;1980-01-01;P;");
ns.setValue("documentRef", "DREF-00123456789");

Edited by Morten

mikeday
Would you wish to call this from JavaScript within the document itself, or from the Java/C# wrappers?
Morten
BTW, I sent the test PDF to our printer to get a verification that the metadata looks OK, but from what I can see it is golden.
Morten
We are using a rich template engine with data binding, so it would probably be easiest to just include this as data binding elements in the template, which means that having this in javascript in the document would be the way to go. That way we don't have to send along the data model when we go to render the PDF.
mikeday
Yes I think we can do something like this, we will investigate.
mikeday
The latest build of Prince has a new --pdf-xmp option which can be used to include additional XMP metadata in the PDF. Currently it is taken from an external file, but it is also possible to specify it in JavaScript as a data URL string via the PDF.xmp property. Since encoding the XMP as a data URL isn't very convenient we plan to add a simpler interface for this in future.
Morten
Thanks :)

Is there documentation anywhere as to how this metadata need to be formatted?
mikeday
It's an XMP file, so basically the <x:xmpmeta> element and its contents (the xpacket processing instructions are ignored as Prince generates those itself when it produces the PDF file).
Morten
OK. Will try this shortly. Thanks again!
Morten
This has taken a while before I got a chance to test.

I am trying now, and I am unable to get it to work. Can you please check if you can see anything obvious that I am missing? I have included the entire script tag in my html template here.

	<script>
		PDF.embedFonts(true);
		PDF.subsetFonts(true);
		//PDF.artificialFonts (boolean)

		PDF.compress(false);

		PDF.encrypt(false);
		//PDF.userPassword, ownerPassword (string, can be null)
		PDF.allowPrint(true);
		PDF.allowModify(false);
		PDF.allowCopy(true);
		PDF.allowAnnotate(false);
		//PDF.keyBits (40 | 128)

		//PDF.script (string, can be null)
		//PDF.openAction (eg. "print")
		//PDF.pageLayout (single-page | one-column | two-column[-left/right)
		//PDF.pageMode (auto | show-bookmarks | fullscreen | show-attachments)
		//PDF.printScaling (auto | none)

		//PDF.profile (string, can be null)
		//PDF.outputIntent (URL string, can be null)

        @{
            var xmp = @"<x:xmpmeta xmlns:x=""adobe:ns:meta/"">
                        <rdf:RDF xmlns:rdf=""http://www.w3.org/1999/02/22-rdf-syntax-ns#"">
                        <rdf:Description rdf:about="""" xmlns:sbank=""http://schemas.somebank.no/2015/11/"">
                        <sbank:invoiceMetadata>PAGE1;55555555;SomeBank;Morten Middlename Lastname;444444;5555555;3333;N;;;1980-01-01;P;</sbank:invoiceMetadata>
                        <sbank:documentRef>DREF-00123456789</sbank:documentRef>
                        </rdf:Description>
                        <rdf:Description rdf:about="""" xmlns:xmp=""http://ns.adobe.com/xap/1.0/"" />
                        </rdf:RDF>
                        </x:xmpmeta>";
        }
        PDF.xmp = @xmp;
        //PDF.xmp(@xmp);
	</script>
mikeday
What are the @ symbols in the JavaScript?
Morten
Sorry, that is Razor syntax. Inside the @{ } is C# code. The same with the @xmp, that is just accessing the C# variable.

I tweaked the xml a little bit to remove line breaks to make sure that wasn't the issue. But it still does not work. I have included the rendered html template here (script part only) so that you can have a look at exactly what goes into Prince.

	<script>
		PDF.embedFonts(true);
		PDF.subsetFonts(true);
		//PDF.artificialFonts (boolean)

		PDF.compress(false);

		PDF.encrypt(false);
		//PDF.userPassword, ownerPassword (string, can be null)
		PDF.allowPrint(true);
		PDF.allowModify(false);
		PDF.allowCopy(true);
		PDF.allowAnnotate(false);
		//PDF.keyBits (40 | 128)

		//PDF.script (string, can be null)
		//PDF.openAction (eg. "print")
		//PDF.pageLayout (single-page | one-column | two-column[-left/right)
		//PDF.pageMode (auto | show-bookmarks | fullscreen | show-attachments)
		//PDF.printScaling (auto | none)

		//PDF.profile (string, can be null)
		//PDF.outputIntent (URL string, can be null)

        PDF.xmp('<x:xmpmeta xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="" xmlns:sbank="http://schemas.somebank.no/2015/11/"><sbank:invoiceMetadata>PAGE1;55555555;SomeBank;Morten Middlename Lastname;444444;5555555;3333;N;;;1980-01-01;P;</sbank:invoiceMetadata><sbank:documentRef>DREF-00123456789</sbank:documentRef></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/" /></rdf:RDF></x:xmpmeta>');
	</script>
mikeday
Thanks. The PDF.xmp property expects a URL, but you can give it an XML string by encoding this as a data URL like so:
var x = '<x:xmpmeta...';

PDF.xmp = "data:application/xml," + encodeURIComponent(x);
Morten
I tried your code and modified my template like this:

        PDF.xmp = "data:application/xml," + encodeURIComponent('<x:xmpmeta xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="" xmlns:sbank="http://schemas.somebank.no/2015/11/"><sbank:invoiceMetadata>PAGE1;55555555;SomeBank;Morten Middlename Lastname;444444;5555555;3333;N;;;1980-01-01;P;</sbank:invoiceMetadata><sbank:documentRef>DREF-00123456789</sbank:documentRef></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/" /></rdf:RDF></x:xmpmeta>');



but the xmp still is still not included in the pdf.
mikeday
Is JavaScript enabled? If you run Prince like this on the attached HTML document:
prince --javascript xmp.html

It should produce xmp.pdf, which has the XMP included.
  1. xmp.html0.6 kB
  2. xmp.pdf1.3 kB
Morten
Adding the --javascript did the trick. Now for the $64.000 question: How do we enable javascript when we use Prince within C#?

Here is our code:
                var events = new PdfGenerationEvents();
                var prn = new Prince(@"Prince/bin/prince.exe", events);
                lock(outputStream)
                    prn.ConvertMemoryStream(html, outputStream);
mikeday
prn.SetJavaScript(true);
Morten
Of course... This was a bit of a palm to forehead moment...

It seems to work fine now. Have sent a sample to our printer for verification.

Thanks for implementing this and for assisting me in getting it to run.
Morten
I have a reply from our printer that they need the metadata to look exactly as our current production setup, which includes line breaks in the correct places. But it looks like this encodeURIComponent function is stripping away \r\n characters.
Is there a way to send in metadata and keep the line breaks?
mikeday
Oh now that's just nasty, this is RDF/XML and whitespace characters between the tags are semantically irrelevant and should be completely ignored!

How are they processing the metadata and why does it need linebreaks in specific places? This sounds like their software is not conforming to the XMP specification at all.
Morten
I agree, it sounds like they are doing some half-assed string parsing. But it's what I have to work with right now. If there is no way to keep the linebreaks in your solution as it is, I will work with them to see what they can do on their end.
mikeday
Currently Prince is parsing the XML so that it can check correctness and merge in any additional metadata properties needed for the PDF profile (eg. PDF/A or PDF/X, which require additional XMP).

Copying the text through verbatim is not compatible with this approach.

We could add newlines after the elements, but I would be very nervous about maintaining this in the future; what if a different printer requires \n and chokes on \r\n, or some other weirdness.

If everyone sticks to the spec it should guarantee compatibility with the widest range of vendors.
Morten
Thanks. I will pursue this with our printer.