Forum Bugs

Support for ISO-8859-1 control characters.

dtognazzini
I have a document that has character \u0012. Prince's handling of this character across various DOCTYPEs, charsets, encodings, and document content is confusing. Results vary across messages on STDERR, the exit status, and whether a PDF file is generated.

I'm invoking prince via the following command line:

prince --media=print --no-xinclude --ssl-blindly-trust-server --no-embed-fonts #{file_name} --output #{file_name}.pdf

Description of files:
example1 - xhtml.strict DOCTYPE; using charset=iso-8859-1; character encoded as 
example2 - xhtml.strict DOCTYPE; using charset=utf-8; character encoded as 
example3 - html DOCTYPE; using charset=utf-8; character encoded as 
example4 - html DOCTYPE; using charset=utf-8; raw character
example5 - no DOCTYPE; using charset=utf-8; raw character
example6 - no DOCTYPE; no charset=utf-8; raw character
example7 - xhtml.strict DOCTYPE; using charset=utf-8; raw character
example8 - xhtml.strict DOCTYPE; using charset=utf-8; using <style>; raw character
example9 - xhtml.strict DOCTYPE; using charset=utf-8; no <style>; character encoded as &#x0012;
example10 - xhtml.strict DOCTYPE; using charset=utf-8; using <style>; character encoded as &#x0012;

Result for each file:
example1-3, example9:
exitstatus 0
PDF generated with character stripped
STDERR:
prince: example1.html:7: error: htmlParseCharRef: invalid xmlChar value 18

example4-7:
exitstatus 0
PDF generated with character stripped
STDERR:
prince: example4.html:7: error: Invalid char in CDATA 0x12

example8, example10:
exitstatus 1
PDF not generated
STDERRR:
prince: example8.html:20: error: PCDATA invalid Char value 18
prince: example8.html: error: could not load input file
prince: error: no input documents to process


Questions:
Does Prince support this character? If so, how?
Why are there error messages in every case, but in some cases a PDF file is still generated?
Why are the error messages different?
Why is the exitstatus different between the cases?
Why is the exitstatus success (0) for cases where there are error messages?
Is there a way to tell Prince to strip characters it can't handle?

Running tests in attached .zip file:
Run: 'ruby run.rb'
Clean: 'sh clean.sh'
  1. tests.zip7.5 kB
    Contains all the test files with scripts for running them. See post for how-to
mikeday
The character U+0012 is the control character "DEVICE CONTROL 2", which has no business at all being in a textual HTML or XML document, and is invalid in this context. The HTML parser might allow it with a warning, as HTML is a sloppy format, but the XML parser will reject it outright, as it should.
dtognazzini
Thanks for the answer. So, to be safe, I'll strip these characters from the document before passing it through Prince. It's still surprising to me that input like example9 and example10 result in a different exit status. My expectation is that all of the examples would fail (i.e. non-zero exit status) or all examples would succeed (i.e. zero exit status) and the character would be stripped.

One more question: Is there a way to tell Prince to strip invalid characters?
mikeday
If the files have an extension of .html, Prince will use various heuristics to try and guess whether the file should be parsed as XML or as HTML. The HTML parser is more forgiving, whereas the rules of XML dictate that any error is fatal. You can control which parser is used by specifying --input=xml or --input=html on the command-line.

In future we will be changing Prince to use a new HTML5 parser, and enabling that parser by default for all .html files unless explicitly overridden.