Forum How do I...?

Prince and Hyphenation

dauwhe
A book I've been working on just came back from the proofreaders, with hundreds of bad word breaks marked. We're using Prince 7.1 with the supplied English-language hyphenation dictionary (and 2-up 3-down settings)

Editing the dictionary directly looks rather risky, even after figuring out the format (thank goodness for Appendix H from Knuth's TeXBook!). Having an TeX-style hyphenation exceptions dictionary would be ideal.

What happens when Prince can't follow any of the internal rules on justification? Will words be hyphenated at places not allowed by the dictionary? The book in question is admittedly a difficult situation, as it is a large-print edition and thus has relatively few characters per line.

Many entries in the dictionary look problematic to me. Some examples (hyphenation shown is Prince's actual output, not the preferred hyphenation):

coincid-ental

When I look at the dictionary, I find:

n1c2i2
ncid5en
nci2d

which means that they really want this word to hyphenate after the “d”. But this seems quite clearly wrong.

dir-ectly (I find 3dir2e and ir3ec so I guess the latter wins.)

nuc-lear (I find .nuc2le5 so I guess it never wants nu-clear which is the proper hyphenation according to Webster's (which most major publishers follow))

* * *

Some other bad breaks:

erec-ted
des-troyed
repres-ented
respon-ded
loc-ation
contemplat-ively
defens-ively
Co-lony
soph-isticated
infilt-rated
so-mething
comman-ded
detect-ives
fam-ous
biograph-ies
cap-able
commun-ists
pa-cing
acquaint-ance
prosec-uting
Un-ited
pois-oning
ima-gined
fin-ancial
conduc-ted
prosec-ution
insis-ted
evid-ence
prosec-utor’s
pro-secutor
dir-ect
defend-ant
import-ant
re-lative
deman-ded
patho-logists
se-cond
trus-ted
secur-ity
reas-onable
test-ified
defin-itely
presid-ent
differen-ces
blur-ted
pat-riotic
chan-ging
probab-ility
ima-gine
pro-gress
prear-ranged
hein-ous
evid-ence
incrimin-ate
explos-ives
apo-calyptic
independ-ence
afteref-fects
ambi-valence
res-isted
coordin-ate
star-ted
Americ-ans
ra-cing
revel-ation
accep-ted
commen-ted
dir-ection
preten-ded
agit-ated
interrog-ator
adam-antly
comprom-ised
deton-ate
lament-ations
operat-ive
deve-loped
deton-ating
interrog-ation
emer-ging

Thanks,

Dave
mikeday
From your examples it would seem that Prince is correctly following the rules in the dictionary, but perhaps the dictionary itself is incorrect? We will take a look at this.
mikeday
Switching to use the standard TeX file hyphen.tex seems to produce better results, eg. "nu-clear" instead of "nuc-lear". Does this fix most of the issues you listed? Note that you need to chop off the first and last few lines in the file, as Prince only reads the pattern lines, not the comments, \patterns, or \hyphenation directives.

I'm not sure why the current hyphenation dictionary we are using performs so much worse. We will need to investigate this further, and perhaps provide support for generating new pattern dictionaries with exception lists.
jim_albright
.thisIsAWordWithNoHyphens.
.thisWord3HyphenatesAtTheThree.

These can be added to the main hyphenation file.

Jim Albright
Wycliffe Bible Translators

mikeday
Good point, Jim! :D
jim_albright
As I work on Bibles with a fixed number of words, I first create a wordlist and then add the odd numbers for where hyphenation is allowed. I start and end the word with a period making it certain that my rules only apply to full words. I know this doesn't work for people who have changing texts. It does work for my needs. :D

Jim Albright
Wycliffe Bible Translators

dauwhe
mikeday wrote:
Switching to use the standard TeX file hyphen.tex seems to produce better results, eg. "nu-clear" instead of "nuc-lear". Does this fix most of the issues you listed? Note that you need to chop off the first and last few lines in the file, as Prince only reads the pattern lines, not the comments, \patterns, or \hyphenation directives.

I'm not sure why the current hyphenation dictionary we are using performs so much worse. We will need to investigate this further, and perhaps provide support for generating new pattern dictionaries with exception lists.


We'll give that a try tomorrow and I'll let you know how it goes. Thanks very much for all the advice. This is a huge issue for us!

Dave
mikeday
Another handy tool for debugging hyphenation issues is to specify "hyphens: prince-expand-all" on a paragraph, then every hyphenation point will be shown with a dot. For the standard TeX hyphenation dictionary I get "nu.cle.ar" and with our current (OpenOffice?) dictionary I get "nuc.le.ar".
jim_albright
Mike,

Thanks for sharing about seeing the hyphenation points. Prince product support rocks!

Jim Albright
Wycliffe Bible Translators

dauwhe
Replacing the supplied hyphenation dictionary resulted in vastly improved output--going from 400+ errors per book to fewer than 10. We used version 2.5 of the Hunspell dictionary... http://sourceforge.net/projects/hunspell/files/Hyphen/

I would recommend anyone using Prince for English switch to this dictionary file.

Dave Cramer
mikeday
Thanks for the tip Dave, we'll take a look at this. :)
howcome
dauwhe wrote:
Replacing the supplied hyphenation dictionary resulted in vastly improved output--going from 400+ errors per book to fewer than 10. We used version 2.5 of the Hunspell dictionary... http://sourceforge.net/projects/hunspell/files/Hyphen/


It would be interesting to see your list of words -- both those that are correctly handled by the hunspell files, and those who still are problematic. I've been testing Prince with the files found here:

http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/

For English, the "hyph-en-us.pat.txt" seems to work well with Prince -- I'd be interested in comparing it with your findings.

-h&kon
dauwhe
howcome wrote:
dauwhe wrote:
Replacing the supplied hyphenation dictionary resulted in vastly improved output--going from 400+ errors per book to fewer than 10. We used version 2.5 of the Hunspell dictionary... http://sourceforge.net/projects/hunspell/files/Hyphen/


It would be interesting to see your list of words -- both those that are correctly handled by the hunspell files, and those who still are problematic. I've been testing Prince with the files found here:

http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/

For English, the "hyph-en-us.pat.txt" seems to work well with Prince -- I'd be interested in comparing it with your findings.

-h&kon


I'll see if I can dig up some of our lists...

Dave
mikeday
We have updated the hyphenation dictionaries used in Prince 8.0, which should fix these issues.
bookdev
I can't seem to get Ukrainian hyphenation working in Prince 8. Could you please be so kind as to suggest what I may be doing wrong, if you don't mind?

Since Prince doesn't come with Ukrainian hyphenation, I downloaded the four Ukrainian files from http://www.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/txt

As you suggest, I checked that there were no comments or extraneous information at the beginning or end of the files. I checked that they were saved in utf-8. I removed the ".txt" extension and put them in my Prince hyph folder.

I created a utf-8 HTML file with:
<body lang="uk">
<p style="hyphens: prince-expand-all; text-align:justify; hyphens: auto; columns: 4; column-gap: 1em; column-fill: balance;">Ukrainian text</p>

But the resulting PDF (see below) has neither hyphenation nor bullets.

I haven't had any problem with the hyphenation files supplied with Prince. Is there something wrong with the ctan files?
  1. uk.jpg58.2 kB
mikeday
You will also need to edit style/hyph.css to add an entry for Ukrainian.
bookdev
That worked. Thanks, Mike.
eluikaplan
Beginner-level question: I'd like to experiment with modifying the default hyphenation dictionary, since we deal with a fair bit of medical and/or scientific terminology that isn't covered by the existing hyphenation rules, but I don't understand the syntax of the hyphenation dictionary file—can someone point me to a guide? (e.g., use of periods and numbers in specifying preferred/allowed/unwanted word breaks) Thanks!
pjrm
Even numbers are places where a hyphen is forbidden; odd numbers are places where a hyphen is permitted; bigger numbers override smaller numbers; dot (as in ‘.’) stands for the beginning or end of a word.

If you want to update an existing patterns file (and don't have the original list of words that was used to generate the existing patterns file, as is the case with hyph-en.pat), then this can be forced by appending a line with a 9 at every place where you want to force a hyphenation opportunity, and 8 at every place where you want to prevent a hyphen.

If instead you already have a list of words each in the form "uni-ver-si-ties" (i.e. giving every hyphenation opportunity for that word), and if you have access to a unix-like computer, then you can create a patterns file with the following commands:

(cat hyph-en-us.pat; sed 's/-/9/g;s/\([^0-9]\)/8\1/g;s/^8/./;s/$/./;s/98/9/g' exceptions.txt) > custom-en.pat

You will also need to tell Prince to use the new file, for example by changing "hyph-en-us.pat" to "custom-en.pat" in both places in hyph.css.
eluikaplan
Exactly what I needed, thanks! One last question—is Prince able to pull from more than one patterns file at a time? For example, could I have it use both the base hyph-en-us.pat and a custom file? I ask because I only really need to employ the custom medical hyphenation for specific books, not all. I think there may also be some value in being able to modularize our custom hyphenation for other subject matter, as well.
jim_albright
yes

Jim Albright
Wycliffe Bible Translators

jim_albright
Look for TeX hyphenation. Prince uses TeX hyphenation rules.

Jim Albright
Wycliffe Bible Translators

pjrm
It depends what you mean by “at a time”: for a given span of text, only a single patterns file is applicable (even if different spans of text can be marked up as different languages or more generally styled with different values of 'prince-hyphenate-resource'). That's why I suggested creating a file that starts with the existing content of a patterns file such as hyph-en-us.pat.

Some options for specialized hyphenations are:
  • Use all the specialized hyphenations everywhere, just in case a medical term is a plot point in a novel or an application for an engineering technique or what have you. (This assumes the usual case that the specialized hyphenation is still correct in other contexts.)
  • Use a stylesheet containing a ruleset like
    * { prince-hyphenate-resource: url(...); }
  • If the desired hyphenations actually differ between different documents (according to audience), then it might be appropriate to mark up that text with a more specific language tag (say en-us-medical) and add a line to hyph.css.
pjrm
I've just started working on Prince for Books' hyphenation of english, and I'll try to include medical and chemical words within that work. However, note that different books or publishers choose different hyphenations of the same word (as evidenced by a few of the "bad" hyphenations in the list at the start of this post, some of which are the suggested hyphenation of a different authority, or even a different year of Webster). For chemical and medical words, it should be relatively easy to give purely sense-based hyphenations (lact-ose, carcin-oma), but Webster's hyphenations are more of a mixture (e.g. it wants lac-tose but malt-ose).