October 03, 2004

The Horrors of Hyphenation

As anyone who has ever produced a large document knows, writing it is just the beginning. In our last installment, I listed some of the things left to do before I can offer bound copies of Through Darkest Zymurgia! for sale on-line. I've got much of that work done now; in particular, I've chosen the font and the page style, set the pagesize to 5" by 8", and got the frontmatter of the book almost completely ready to go. (You can download a preview of the front of the book in (what else) PDF format if you're interested.)

But there was one big step which I had forgotten--or suppressed, I'm not sure. And that big step is policing line-breaks or, in a word, hyphenation.

The soul of TeX is its justification algorithm. TeX is extraordinarily good at producing high-quality fully justified output that looks as though it were typeset by hand by a skilled typesetter. Unfortunately, that beautiful output comes with a cost--by TeX's standards, not all text is capable of being beautifully typeset. This usually results in what TeX calls an "overful hbox"--that is, a line that it simply can't break without introducing "too much" whitespace into the paragraph. In such cases TeX reports the error and allows the line to run a little long and stick out into the margin. If desired, it will also mark the error with a big black box, so that it will be easier to find visually.

There are several ways to solve the "overful hbox" problem. TeX is good at hyphenation, but of course it doesn't know anything about made up words and names, nor is it aware of all of the possible word-breaks even in standard English. Often it's possible to solve the problem by inserting an explicit hyphen here or there.

In more serious cases the appropriate words in the errant paragraph simply do not admit of hyphenation. You can't hyphenate the word "good", for example. In such cases, you can tell TeX to be "sloppy" about formatting the paragraph; this allows it to add more interword space than it would ordinarily do, and usually solves the problem.

Sloppy formatting has its own perils, however--once in a while it results in the dreaded "underful hbox" error. This means that TeX has had to add too much whitespace between one or more words, and that its poetic soul has rebelled. One can ignore "underful hbox" errors, as TeX inserts the space anyway, but the annoying thing is that TeX is usually right. Too much whitespace sticks out like a sore thumb. In this case, you generally have to modify the text in some way. Sometimes you can split the paragraph in two; other times, you actually have to change the wording slightly.

There's an additional problem associated with hyphenation, which is that people's names shouldn't be hyphenated if it can possibly be avoided. It's possible to specify that a word is not to be hyphenated, but all too often so specifying leads to all of the problems listed above.

TeX has no idea whether a word is a person's name or not; and sometimes even when hyphenation can't be avoided it will hyphenate names in the wrong place. Consider the narrator of Zymurgia, Professor Leon Thintwhistle. The good professor's last name is prounounced "Thint-whistle", yet TeX decided that it could hyphenate it "Thin-twhistle". It's possible to educate TeX about such matters, but it requires looking through the finished PDF file for hyphenation problems.

All of this, I may say, is slow going. I've now spent two or three hours at it, and I've made it through chapter 10 (of 41).

It's not all bad, though. I'm taking the opportunity to added drop caps at the beginning of each chapter, and as I read through the output looking for bad line-breaks I'm finding a number of other small errors.

In the next installment--I'm not sure yet. We'll see.

Posted by Will Duquette at October 3, 2004 03:58 PM