Font issue, PDF nothing but gibberish
I'm testing out Prince 9.0 rev 2 on Mac OS X 10.7.5 with the standard set of fonts installed from the initial OS install. When I run Prince on a site I'm trying to use the product for (a client's private web site, so I can't include the path, credentials, etc), I get nothing but gibberish and question marks.
I'm attaching the debug log. I'm hoping someone can steer me in the right direction as to why the PDF is garbled. There aren't any non-Roman type characters in the page I'm loading, so I'm not sure why Prince is having such difficulty in converting it to a PDF.
- prince.txt 91.8 kB
By the look of the debug output, Prince appears to think that the page contains Chinese and Hindi text. Is the document really UTF-8 encoded? Do you get the same results if you download the HTML and run Prince on it as a local file?
It definitely has something to do with the encoding. If I open the page directly in Safari from the website, the Encoding setting in Safari is 'Default' and the page displays fine. If I then save the page source locally, and reopen the local html, the Encoding is still 'Default', but the page has some garbled characters. If I set the "Encoding" in Safari to UTF-8, then the page displays the same way locally from source as it does directly from the website.
Using Prince on the local saved 'source' html either from curl or from manually saving it from Safari gives the same garbled looking PDF.
There are some non-ASCII Spanish characters like: PRECAUCIÓN or ¿PREGUNTAS displayed on the page. These are the characters that don't look right from source if the Encoding in Safari is not set to UTF-8. When set to Default, they look like: PRECAUCIÃ“N or Â¿PREGUNTAS.
Does the document contain a <meta charset> directive at the top? Would you be able to upload the beginning of the document so I can try it here? (Better to do this as an attachment rather than pasting it into the post, as that will preserve the current encoding, whatever it is).
Yes, there is a META tag with content="text/html; charset=utf-16". I'll strip down a version and send it to you offline.
Okay, that appears to be causing the problem: the document claims to be UTF-16, but actually is UTF-8. Prince trusts the meta charset declaration, while the browser realises it's wrong.
The best fix for this is to correctly declare that the document is UTF-8. The UTF-16 encoding is almost never used on the web, even though it is commonly used internally on Microsoft systems.
We will doublecheck our encoding detection and see if we can handle this situation better in the future.
This issue has been fixed in Prince 9 rev 4, which checks the true document encoding and uses UTF-8 even if it is incorrectly declared as UTF-16. Thanks for letting us know about this issue.