Forum Bugs

8-bit characters rendered wrong when HTML title is 8-bit

dbarrett
I'm having difficulty with Prince-rendered PDFs when the HTML <title> contains an 8-bit character. Here's an HTML page that displays the word "Résumé":

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>X</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<p>Résumé</p>
</body>
</html>


This converts correctly with Prince, creating a PDF that contains ""Résumé".

$ /usr/bin/prince --input=html good.html -o good.pdf


However, if you add an 8-bit character into the HTML title, the word Résumé gets rendered in the PDF as "Résumé".

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>Xé</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<p>Résumé</p>
</body>
</html>


Is this a bug? Am I doing something wrong with encodings? Thanks.
mikeday
Put the <meta> tag giving the charset before the <title>, or add an <?xml?> declaration.
dbarrett
Thanks. The title and meta tags are generated by MediaWiki in the given order. Is this wrong for MediaWiki to do?
mikeday
It's something of a gray area honestly, but it is best for the charset indication to come as early in the document as possible as it determines how the rest of the document will be decoded. In the future we will switch to a HTML5-compliant parser that will bring Prince more closely into line with how browsers handle this case.