Forum Bugs

Bengali text and ligatures

dennis.leas
I am using Prince 11 with Vranda font on Win 10.

I have an HTML page containing some Bengali text. It renders properly in Firefox 53.0 (32-bit) and Chrome 58.0.3029.81 (64-bit). When converted to a PDF using Prince, some of the characters do not display properly.

To display properly in the browsers, I have used ZWNJ characters to control the combining of certain Bengali characters, in particular U+09B0 and U+09AE when followed by U+09CD, U+09C7, and U+09C1.

I am following the guidelines (as best I understand them) found in:
1. http://unicode.org/review/pr-9.pdf
2. https://en.wikipedia.org/wiki/Zero-width_non-joiner
3. http://www.unicode.org/L2/L2006/06053-zwj-bengali-lig.pdf
4. http://unicode.org/review/pr-30.pdf
5. https://en.wikipedia.org/wiki/Bengali_alphabet

Prince does not seem to handle the ZWNJ character in the same manner as the browsers.
Following some suggestions on the Forum, I have added:
prince-text-replace: '\200C' '\200B'
to the CSS file (substituting ZWSP for ZWNJ).

This fixes most of the display issues. However, one character still does not display properly.

Reference (3) especially discusses non-ligature and ligature forms of combined characters. I think the troublesome character is one of these ligature characters whose "normal form" is as non-ligatured.

I have attached:
1. BengaliTest.html - the sample HTML page
2. TroublesomeText.png - shows which characters are not shown properly in PDF
3. PrintIt.css - the CSS file
4. BengaliTest-NoTextReplace.pdf - resulting PDF without the text replace
5. BengaliTest-TextReplace.pdf - resulting PDF with the text replace

TroublesomeText.png is a screen shot of the browser output, showing the proper display. The outlined character do not show properly in the .pdfs.
  1. BengaliTest-NoTextReplace.pdf12.7 kB
  2. BengaliTest-TextReplace.pdf12.6 kB
  3. BengaliTest.html0.4 kB
  4. PrintIt.css0.2 kB
  5. TroublesomeText.png18.8 kB
mikeday
Thank you for the very detailed test case! We will investigate this issue.
dennis.leas
Do you have an updated status on this issue?

At this point we are shipping documents with images of Bengali text because we can't use Prince to render the Bengali.
mikeday
We are still investigating. Currently zwnj characters are stripped before Indic shaping takes place, and preserving them until later in the pipeline will require some care.
mikeday
We have identified a possible solution which we can release for testing in our next build in two weeks time.

One aspect I am still a bit confused by is exactly what effects a zwnj can have depending on where it appears, for example just before a U+9B0 reph character at the end of a word.
dennis.leas
From what I can tell, the "Proposed Solution" section of http://www.unicode.org/L2/L2006/06053-zwj-bengali-lig.pdf gives this information:

******
"Whereas Bengali consonant conjuncts are formed using virama, virama is not appropriate in this case: the inherent vowel is not killed but is overridden by the vowel mark, and to introduce consonant + virama + vowel sequences would potentially destabilize the encoding model for Indic scripts.

Instead, these consonant-vowel conjoined forms can be treated as ligatures, and the general function of ZWJ and ZWNJ can be used for requesting or blocking the formation of ligatures. Thus, a given font implementation can choose whether or not to treat the “ligature” forms as defaults. If the non-ligated form is the default, then ZWJ can be used to request the ligature; for example: [example in table]

But if the ligated form is the default, then ZWNJ can be used to block the ligature: [another example in table]"
******

So in my example, the zwnj serves to block the ligature. I've attached an HTML file to illustrate.
  1. BengaliTest - ZWNJ.html0.6 kB
    Ligated vs. Non-ligated forms
mikeday
Great thanks, I'm getting that example right at least. :D
dennis.leas
How is the work proceeding on this issue? I will be pleased to run any tests if that would be helpful.
mikeday
Sorry there was some delay due to the travel, but I think the issues are all fixed now and we should be able to release an updated build early next week.
mikeday
The latest build is now available with the support for zwnj characters in Bengali and other Indic scripts. Please let me know if you encounter any issues! :)

Edited by mikeday

dennis.leas
It works!

We also tested some other languages which use Devanagri script (real bare bones testing): Hindi, Nepali, besides Bengali.

Much thanks to you and your dev team for rapid turn-around.
mikeday
Great, thanks for your help with this issue!
twardoch
The fontkit OpenType Layout engine (written in JS) now supports Indic scripts:
https://www.princexml.com/forum/topic/3638/plans-for-modernized-opentype-font-format-support?p=1#18285

It might be possible to use it within Prince instead of the built-in engine if problems persist.