Forum Bugs

Creepy Rendering of german Umlaut like Ä,Ü,Ö

Lindemann
I tried to render my Bachelor Thesis with Prince, but I get a poor result because the letters for Ä,Ü,Ö looking not like they should.





I use Georgia (on a Mac) and UTF-8 Encoding in the HTML Head.
mikeday
Hilarious! Do you know if the accented characters are being represented as one single Unicode character, eg. "A with umlaut", or two Unicode characters, eg. "A followed by combining umlaut character"? Perhaps you could email me (mikeday@yeslogic.com) a small section of the document that demonstrates the problem.
Lindemann
I found out that this happened when I copied Text from a PDF. When I type the letters with the keyboard everything works.

It seems that Prince cant handle for example these representation of ü:


Which is very confusing, because the most Applications, like my Text Editor display it as a correct ü.


First Image was taken from Sublime Text 2 and the second from Chocolat.

You can try it by your self with this PDF http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf
mikeday
I'm still having trouble reproducing this problem. Would you be able to email me a short snippet of HTML?
mikeday
Thanks for the test document. There are two ways of representing accented Latin characters: either as a single precomposed character, or as two separate characters, eg. "U" + combining umlaut. It seems that the combining umlaut is not working with Prince at the moment.

You can workaround this temporarily using this CSS rule:

body { prince-text-replace: 'Ü' 'Ü' 'Ä' 'Ä' }

where you make the first 'Ä' is the one Prince can't handle, and the second 'Ä' is the one it can handle. This is a bit awkward, but can fix the problem until we can fix Prince. Sorry for the inconvenience.
mikeday
We have investigated this issue further and come to the conclusion that it is a font problem. :)

Some modern fonts such as the DejaVu family have proper support for Unicode combining marks, and use the OpenType GPOS table to position them properly over the preceding vowel.

Other fonts such as the Calibri and Candara fonts that come with Windows Vista use the OpenType GSUB table to substitute high or low glyphs depending on whether the preceding vowel is uppercase or lowercase. This is better than nothing, but the horizontal position is not aligned very well, so it doesn't look great.

Older fonts such as Times New Roman and Georgia don't even have glyphs for the Unicode combining marks, so it is almost impossible to render the text correctly with these fonts, and the only solution is to use precomposed characters (eg. "u + umlaut") that they do support.

For this reason our recommendation is to use modern fonts that have full OpenType support for Unicode combining characters, or use precomposed Unicode characters that work with older fonts.
JonoB
mikeday wrote:
We have investigated this issue further and come to the conclusion that it is a font problem. :)

Some modern fonts such as the DejaVu family have proper support for Unicode combining marks, and use the OpenType GPOS table to position them properly over the preceding vowel.

Other fonts such as the Calibri and Candara fonts that come with Windows Vista use the OpenType GSUB table to substitute high or low glyphs depending on whether the preceding vowel is uppercase or lowercase. This is better than nothing, but the horizontal position is not aligned very well, so it doesn't look great.

Older fonts such as Times New Roman and Georgia don't even have glyphs for the Unicode combining marks, so it is almost impossible to render the text correctly with these fonts, and the only solution is to use precomposed characters (eg. "u + umlaut") that they do support.

For this reason our recommendation is to use modern fonts that have full OpenType support for Unicode combining characters, or use precomposed Unicode characters that work with older fonts.


This is really interesting.

Do you have a list of "modern" fonts, or a reference by any chance?
mikeday
To be more precise, "modern fonts that take Unicode seriously and support more than just English text" is a more accurate description. :)

Initially I only found the DejaVu font family and the SIL Gentium font to support these Unicode combining characters. Surprisingly, high-quality fonts such as Adobe Arno Pro did not.

Checking Google Web Fonts shows a bunch of fonts that support them, such as Roboto Condensed, fonts that have glyphs but don't position them very elegantly, and some fonts that don't have glyphs for them at all.

So it's still not a standard feature across all fonts. This is probably because most European text uses the precomposed characters, not the combining accent characters, which are generally used for less common linguistic purposes.
jim_albright
The SIL fonts are freely available. Supporting combining characters is a must in our fonts. Fonts have been designed for various scripts and styles. Andika is especially designed for literacy.

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=FontDownloads
SIL Fonts.PNG

Jim Albright
Wycliffe Bible Translators

  1. SIL Fonts.PNG73.6 kB
    Last update 2012
nico
Hello Mike, I don’t really understand why accented characters can be represented sometimes with two characters (e.g. "e" "´") and sometimes with only one (e.g. "é"). Do you know where I can get more information about that topic ?

This is a big problem when you copy-paste texts in french from a PDF. If you have any advice how to solve it, you are welcome.
mikeday
Unicode is the universal character set that is used by all modern software. Unicode includes separate characters for all the different accents, eg. acute, grave, macron, and so on. These are all combining characters, that combine with the character that precedes them. So you could apply an acute accent to any other character, although in practice it will probably look strange with most fonts unless you apply it to something typical like a vowel.

However, programmers are lazy and don't like dealing with the complexities of combining characters. For this and other historical reasons, Unicode also encodes a bunch of precomposed characters, such as "e with acute accent", "e with grave accent", and so on; basically all the accented characters that you need for common Western European languages such as French.

This means that some accented characters can be represented in two different ways: either with a single precomposed character, or as a base character followed by a combining accent character. This difference can cause problems with software that does not expect it.

So the trouble you are having with copy-paste depends upon several issues: the software that produced the PDF, the software that is displaying the PDF, the operating system you are running on, and the original input document that the PDF is based on.

Did you create the PDF file with Prince? If so, are you able to attach a sample HTML input and PDF output file here?
nico
Wonderful, I understand much better now. Thank you Mike.

I prepared an example that you can download and play with.
  1. post_2150.zip46.7 kB
mikeday
In your example, the HTML and PDF seem to be using single precomposed characters / glyphs for accented letters, so I wasn't able to get two characters back with copy and paste.
nico
I installed Adobe Reader on my Mac and the problem of precomposed characters disapeared, but a new problem appeared : some hyphens are lost. The result is consistent with Adobe Reader on Windows. I obtained exactly the same text file after copy-paste. I verified with MD5 sumchecking.

OK, so it seems that OSX Preview is the problem and that Adobe Reader is a bit better, but not perfect. Do you have any suggestion for a better PDF reader ?
mikeday
When I copy and paste from Adobe Reader on Linux, it seems to automatically strip hyphens just before linebreaks, so that "ra-conter" split across two lines becomes "raconter". This doesn't seem so bad though, as if you paste the text somewhere with a different width, you don't necessarily want to preserve the exact hyphenation.