Forum How do I...?

Japanese Line Breaks

joeh
I’m working on a Japanese manual in Prince, and I have received feedback from our translators that there are a lot of bad line breaks in the PDF. From what I understand, Prince allows line breaks between any characters (other than punctuation) since spaces aren’t used in Japanese content. Because of this we’re seeing places where there are breaks in the middle of a word that could easily be avoided. For example, one of the titles in the manual has a line break that looks like this:

御霊によって教え,学


Since “学ぶ” is one word, it would be much better if the line break looked like this:

御霊によって教え,
学ぶ

I’ve been working on a script that uses regular expressions to control line breaks, but this relies on counting characters (for example, not allowing line breaks between the 4 characters before or after a comma.) This solution works well in some cases, but since I’m just counting characters and it’s not linguistically based there are still going to be undesirable breaks between words where they could otherwise be avoided.

I’m wondering if anyone else has had similar issues working with Japanese content in Prince, and if there are any solutions that don’t involve manually editing the source content.
mikeday
I think this will require a Japanese dictionary to get it right, much like support for Thai line-breaking which we have added recently. Until we can do this it will be necessary to edit the source document, possibly with JavaScript, eg. to wrap up words with span elements to make them unbreakable.
pjrm
A complication is that some words end in 学 (e.g. 大学), and many words start with ぶ. I wonder whether the best approach would be to have the translators explicitly mark places that shouldn't be broken. One approach to make that convenient would be for the translators to type something else (some character that's easily accessible on a keyboard but rarely used in text), and then change to the  character (zero-width non-breaking space), whether using a preprocessor (perhaps as simple as sed) or using Prince-specific CSS (
prince-text-replace: '=' '\feff'
).
pjrm
To be more explicit: One obvious approach to fixing this would be to implement a rule of never breaking between 学 and ぶ, or in the middle of any other sequence of characters that exists as a word. A problem is that this rule would wrongly forbid some breaks, such as where the sentence happens to contain 大学 followed by a word beginning with ぶ: in general, one needs to be a human to know whether 学 should be attached to 大 or to ぶ (or neither), and similarly for other words in the dictionary. One could try gluing all of them together when this happens, though wrongly gluing things together would produce seemingly inexplicably short or unusually spaced lines. (I would also expect that there would still be a few bad breaks just because of words not in whatever word list were used.) If the software did use guessing then the translators would still need to make corrections for where it guessed wrong, presumably involving not just u+FEFF but also u+200B (zero-width space) for phrases that the software wrongly glues together.

All that said, I can see that there's a convenience argument for software that makes guesses and needs a few corrections after checking over the formatted output.
pjrm
I see that Prince has an embarrassing bug of not actually honouring u+feff. I've written a fix for this, so I hope we can give you a build that contains that soon.

If you want to experiment with the idea of erring towards gluing things together rather than erring towards breaking, and if you have access to a Unix machine, then you could try the attached shell script, which takes as input a list of Japanese words (one per line), and creates on its output a sed script to run
on HTML documents (assuming utf-8).

The list of input words could be either something that your translaters create as they go, or it could be a huge word list from a dictionary. (In the latter case, the resulting sed script might run slowly.)
  1. phrase-feff.sh1.1 kB
    Shell script to create sed script
hallvord
Hi joeh,
I wonder if a small JS hack based on this Node.js library might help:
https://www.npmjs.com/package/kuromoji

I've tested this quickly on the demo page here: http://takuyaa.github.io/kuromoji.js/demo/tokenize.html and with input that's potentially tricky like "大学ぶ" it seems to do the right thing. (Given the complexity of Japanese I'd not be surprised if there are words that are legitimately ambiguous, where a string of characters can have two different interpretations if words are split differently, but I hope that's really rare.. ;))

I have not tested if Prince's JavaScript engine is up for running the kuromoji lib. Actually, it would be fun testing that..

Announcement: repos for tests/utils

hallvord
I'm still experimenting here. I wrote this helper script to try to use the Kuromoji library:
https://gist.github.com/hallvors/c8803a2bf533b5f21f85b20647b86fcb
attempting to call Prince with arguments like --script node_modules/kuromoji/built/kuromoji.js --script kuromoji-helper.js but don't get your hopes too high yet - it's not working. I'm investigating why..

Announcement: repos for tests/utils

hallvord
Prince lacks support for parts of the new JS ArrayBuffer feature, namely constructors like Int8Array, Int16Array and Int32Array, this is currently the reason why the Kuromoji library fails. Since the ArrayBuffer constructor itself is defined this seems like work in progress..

Announcement: repos for tests/utils

mikeday
Ah we only support the Uint versions! Any luck if you define Int8Array to Uint8Array, etc. ?
hallvord
That got us a little step further indeed, but now it complains about
return bytes.subarray(0, j);

- I guess this method isn't implemented? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray/subarray

Announcement: repos for tests/utils

mikeday
Good point, we can include this method in the build next week. :)
joeh
Thank you all for your responses!
I think Kuromoji is a great option for what we are trying to do, so if you're able to get it working in Prince please let me know. Otherwise I will probably try to write something that will use Kuromoji to wrap spans around each word and then save it out as a new HTML file. But running the script inside Prince would be ideal. :)
hallvord
You could probably use my script (see gist linked above) as a starting point for that "something" - I hope I understood the Kuromoji API correctly ;) - but I agree that running it inside Prince is better. I'll keep tinkering when mikeday gives us the promised .subarray() support, didn't see a way to fake that convincingly ;)

Announcement: repos for tests/utils

mikeday
The new latest build supports the subarray() method and the signed integer typed arrays, although there may still be AJAX limitations which prevent the script from running completely out of the box.
neochief
@mikeday, I've been typesetting a Japanese version of my book lately, and I have discovered this whole issue. Just FYI, all of the mentioned issues are mitigated with CSS property "line-break: strict" on the web. PrinceXML doesn't support it at the moment, but should you guys implement it, working with Japanese in PrinceXML would become a breeze.

So, while there's no line-break support, here's what I do for my PDF file.

0. I enable JavaScript processing.

1. I include the following library on my HTML page: https://github.com/niklasvh/css-line-break (this is a polyfill for the line-break property)

2. Then, I include the following JS code (where "pdf-pages" is element ID that contains your main text):

<script>
    window.addEventListener('load', function() {
        var LineBreaker = window['css-line-break'].LineBreaker;

        function updateLineBreaks(text) {
            var breaker = LineBreaker(text, {
                lineBreak: 'strict',
                wordBreak: 'normal',
            });

            var str = '';
            while (!(bk = breaker.next()).done) {
                var item = bk.value.slice();


                // If the returned chunk contains more than one Japanese character, this means that there should be
                // no line breaks between the characters. Therefore, we insert unicode word joiner character between
                // these characters to prevent line breaks between them.
                //
                // We can't simply write a condition of the item's length here, as the item may contain spaces.
                item = item.replace(new RegExp('[' +
                    '\u3041-\u3096'+ // Hiragana
                    '\u3400-\u4DB5\u4E00-\u9FCB\uF900-\uFA6A\u2E80-\u2FD5'+ // Kanji
                    '\uFF5F-\uFF9F'+ // Katakana (Half Width)
                    '\u30A0-\u30FF'+ // Katakana (Full Width)
                    ']{2,}', 'g'), function(matches) {
                    return matches.split('').join('⁠');
                });

                str += item;
            }
            return str;
        }

        var walkTextNodes = function(node) {
            var isNotEmptyTextNode = function(node) {
                return /^(\s|\n)+$/gi.test(node.data) ? false : true;
            };

            var execute = function(node) {
                var child = node.firstChild;
                while (child) {
                    switch (child.nodeType) {
                        case Node.TEXT_NODE:
                            if (isNotEmptyTextNode(child)) {
                                child.textContent = updateLineBreaks(child.textContent);
                            }
                            break;
                        case Node.ELEMENT_NODE:
                            execute(child);
                            break;
                    }
                    child = child.nextSibling;
                }
            }
            if (node) {
                execute(node);
            }
        }

        walkTextNodes(document.getElementById('pdf-pages'));
    });
</script>


This code goes over your page and finds the sequences of characters that should be kept together and inserts word joiner character between them.