Forum Bugs

Character encoding problems (UTF-8, accented characters)

gpian
Hello everyone,

I'm experiencing some problems with character encoding, specifically with an UTF-8 encoded HTML file containing accented characters such as à, è, ì, ò, ù and the like.

I'm using a Python script to produce an HTML 5 file, writing data to disk with the encoding='utf-8' argument of the write() Python function. The final HTML 5 file contains "<meta charset=utf-8>" as its encoding declaration. Then, I produce a PDF file from that HTML file using Prince, but every single accented character gets printed as a couple of different characters (clearly a sign of encoding problems). Since the document is written in Italian, the problem is quite annoying.

Note that it's not a font problem: in fact, if I substitute e.g. à with &agrave; in the HTML file, the à character gets correctly printed in the final PDF file produced by Prince.

How can I solve this issue? I'm running Prince 7.0b1 on Ubuntu Linux 9.04 and on MIcrosoft Windows 2000 SP4. Both environments are giving me the same problem.
mikeday
Do you have a simple test document you can email me? (mikeday@yeslogic.com) It certainly sounds like an encoding problem.
jim_albright
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
.......


This is the start of my html file that works.

You can verify that your file is really UTF-8 encoded by opening in notepad and then seeing what encoding is suggested in a save as.

Jim Albright
Wycliffe Bible Translators

gpian
mikeday wrote:
Do you have a simple test document you can email me? (mikeday@yeslogic.com) It certainly sounds like an encoding problem.


Sent! Thanks for looking at it.
gpian
jim_albright wrote:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
.......


This is the start of my html file that works.


Yup. But my file is not an XML file, just HTML, and in particular HTML 5.

Indeed, I tried to use your code at the beginning of my file (changing it to HTML 4.01 Transitional) and the result was correct. Does this imply that the problem is the HTML 5 DOCTYPE and/or encoding declaration?

You can verify that your file is really UTF-8 encoded by opening in notepad and then seeing what encoding is suggested in a save as.


Emacs says "utf8-unix" (i.e. UTF-8 with Unix EOL, I believe) so I'm fine with it.
gpian
jim_albright wrote:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
.......


This is the start of my html file that works.


OK, so, apparently I have solved the problem by using just the "classic" meta element currently in use for HTML 4.x and XHTML 1.x that you included in this snippet. The rest of the file goes untouched, i.e. still contains the HTML 5 DOCTYPE. Thank you very much.

However, I'd like to point out to Prince developers that "<meta charset=utf-8>" is valid HTML 5 syntax; if they want to support HTML 5 in the future, the software will need to be able to distinguish content encoding even when set through that "unusual" form of the meta element.
mikeday
The encoding issue does seem to be due to the HTML 5 meta tag, which is not recognised by the HTML parser we are using in Prince.

We are planning to switch to a HTML 5 parser in the future, once the HTML 5 specification settles down and we find a suitable parser library.
nico
I am also experiencing encoding issue with Prince 8.1 rev 4 that are caused by the encoding declaration.

The HTML 5 norms allows a shorter encoding declaration.
<meta charset="utf-8" />


This was the traditional way for HTML 4.
<meta http-equiv="content-type" content="text/html;charset=utf-8" />


Unfortunately, Prince does not interpret the short declaration correctly. That would be nice if you could fix this issue.

It is a bit strange to me that Prince does not use UTF-8 as default. This encoding is the de facto standard today.

See also :
http://www.w3.org/International/questions/qa-html-encoding-declarations#html5charset

You can try this example. Prince behaves correctly only with --input xml but not with --input html :

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8" />
	<title>test utf8</title>
</head>
<body>
	<p>This is é</p>
	<p>This is &#233;</p>
</body>
</html>



This example works in any case :

<!DOCTYPE html>
<html>
<head>
	<meta http-equiv="content-type" content="text/html;charset=utf-8" />
	<title>test utf8</title>
</head>
<body>
	<p>This is é</p>
	<p>This is &#233;</p>
</body>
</html>

Edited by nico

mikeday
Try --input=html5 to use our new HTML5 parser, which does default to UTF-8.
nico
It works fine, but "--input html5" is not documented in "prince -h". Is it an unofficial feature ?
mikeday
It's experimental, although we have mentioned it on the website under the command-line interface documentation.

In the next release of Prince the HTML5 parser will become the default option, and the old parser will be available using --input=html4.