UTF-8 HTML title gives broken PDF title

davidbalch
27 Jan 2021

Hi,

I think this may be a Prince issue, but don't have the low-level skills to prove it. (I'm using Prince 12.5.1)

I have a HTML file with non-latin characters (e.g. ☙ – ❧) in the head/title element which, after running through Prince, show up in the PDF metadata title tag - but it seems that they may be in the wrong encoding, as after processing with Ghostscript [1], the title is mangled.

Some web searching suggested that the title needs to be either PDFDocEncoding or UTF-16BE with a Byte Order Mark (page 158 of the 1.7 PDF Reference Manual). [2]

Files from a reduced test case attached.

optimised.pdf was generated from test.pdf with the command:

gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=optimised.pdf test.pdf

Any ideas?

Cheers,
Dave.

[1] Using gs to linearize, per https://www.princexml.com/doc/prince-output/#pdf-compression
[2] https://stackoverflow.com/questions/9188189/wrong-encode-when-update-pdf-meta-data-using-ghostscript-and-pdfmark

optimised.pdf‎ 18.1 kB
test.pdf‎ 16.7 kB
test.xhtml‎ 0.3 kB

mikeday
27 Jan 2021

Looks like GhostScript might be generating a broken XMP metadata packet? The actual document title still looks fine in Evince though.

davidbalch
28 Jan 2021

Are there any tools you can suggest that would help me inspect PDF metadata to try and figure this out?

mikeday
28 Jan 2021

You can actually open the PDF file in a text editor and take a look, or you could use pdftk with the drop_xmp option to remove it perhaps?

davidbalch
28 Jan 2021

Ah, I've found pdf-parser.py (https://blog.didierstevens.com/programs/pdf-tools/), which seems helpful...

On the Prince PDF:

$ ./pdf-parser.py -s Title -w test.pdf 
obj 14 0
 Type: 
 Referencing: 
 
<</Producer (Prince 12.5.1 \(www.princexml.com\))
/Title <FEFF005400690074006C00650020007400610067002000770069007400680020005500540046002D0038003A0020261920132767>>>

That FEFF looks like a BOM to me, but only from what I read on Wikipedia.

On the Ghostscript PDF:

$ ./pdf-parser.py -s Title -w optimised.pdf 
obj 6 0
 Type: 
 Referencing: 4 0 R, 5 0 R
 
<< /Title(See document properties)
/Dest [4 0 R /XYZ 0 738.0 0]
/Parent 5 0 R
>>


  <<
    /Title (See document properties)
    /Dest [4 0 R /XYZ 0 738.0 0]
    /Parent 5 0 R
  >>


obj 2 0
 Type: 
 Referencing: 
 
<</Producer(GPL Ghostscript 9.27)
/CreationDate(D:20210127165844Z00'00')
/ModDate(D:20210127165844Z00'00')
/Title(\376\377\000T\000i\000t\000l\000e\000 \000t\000a\000g\000 \000w\000i\000t\000h\000 \000U\000T\000F\000-\0008\000:\000 &\031 \023'g)>>

  <<
    /Producer (GPL Ghostscript 9.27)
    /CreationDate "(D:20210127165844Z00'00')"
    /ModDate "(D:20210127165844Z00'00')"
    /Title "(\\376\\377\\000T\\000i\\000t\\000l\\000e\\000 \\000t\\000a\\000g\\000 \\000w\\000i\\000t\\000h\\000 \\000U\\000T\\000F\\000-\\0008\\000:\\000 &\\031 \\023'g)"
  >>

I don't know how to interpret that. Does it confirm that gs is doing it wrong?

mikeday
28 Jan 2021

\376\377 is just 0xFEFF in octal, but the special characters at the end of the string don't look like they are encoded correctly.

davidbalch
28 Jan 2021

Ok, I've opened a bug with gs: https://bugs.ghostscript.com/show_bug.cgi?id=703428

Thanks for your help.

Forum › Bugs

UTF-8 HTML title gives broken PDF title