Forum How do I...?

Prince behavior with Unix Pipes

mikepk
I've found an odd behavior with Prince and I'm not sure if it's a bug or not. It seems that passing UTF-8 encoded files to prince through a unix pipe is broken in some way and the data is not be read properly.

I have a test file "test.html" which is a utf-8 encoded file containing non latin (in this case Chinese) characters. I have the requisite fonts installed to generate the proper characters in the output.

running it like this:
prince --input=html -v --debug -i test.html -o test.pdf

works fine and produces the proper output PDF. However if I pipe utf-8 into prince like so:

cat test.html | prince -v --debug --input=html - -o test.pdf
- or -
prince -v --debug --input=html - -o test.pdf < test.html

I get a garbled / mangled file with lots of "no glyphs for character" messages.

I am trying to use a perl process with pipes to pass utf-8 data into prince and I've been getting garbled data. I thought it was my perl app until I found that directly piping data into prince on the command line seems to be broken as well.

Is there something I need to do to enable this, a hidden command line option or something? Is utf piped into the prince process not supported?


Thanks!
-Mike

Using: Linux ubuntu dapper, Intel x86
mikeday
The HTML parser we are using has problems with UTF-8 documents that include a byte order mark (BOM) at the beginning. This can be added when the file is edited in Notepad on Windows. Other than that, there should be no problems piping data into Prince. Would you be able to email me a small test document that demonstrates the problem?
mikepk
Aha! Guess it just takes a new day. Turns out that the encoding problems seem to arise when the --input=html option is used. We're using XHTML files so, technically they're xml files, but we'd been using that option for our english files and it seemed to work. For english-like characters (latin) this option seems to work, but for extended characters like Chinese, it garbles the output. So maybe something about that switch is stepping on the file encoding inside prince (different parser?)? I tried some normal html files with utf-8 encoding, as well as some experiments with BOMs, and I seem to get the garbled output with input=html so it seems to be something about that switch in particular.

I'll go ahead and e-mail you my test files so you can try to reproduce the issue. I'm sure you already know, but you'll need a unicode TTF font with Chinese characters to see the correct output.

I know I can now get the correct output by making sure to use xml (xhtml) files and no --input= option (or --input=xml) so that fixes the issue for me.
mikeday
Right, this does seem to be the UTF-8 BOM issue that affects the HTML parser. The XML parser is a lot more reliable when it comes to character encoding issues, so switching to XHTML is one workaround.
mikeday
Today we have released Prince 6.0 rev 8, which correctly processes HTML documents beginning with a UTF-8 byte order mark.