Bertelsman's Wikipedia

Håkon Wium Lie

Chairman, YesLogic; CTO Opera Software

howcome@yeslogic.com, howcome@opera.com

Abstract: The content and layout of Bertelsman's printed version of German Wikipedia is discussed.

Keywords: Wikipedia, Bertelsman, printing, book, book production

2009-01-26

The German media house Bertelsmann has published a paper edition of the German Wikipedia. The resulting «Lexikon» is impressive in several ways. First, it's a sizable hardback volume of 992 pages. The book feels solid when you hold it. It probably fits better on your reference bookshelf at home than in your backpack while travelling. It contains around 20k articles and 1000 figures. The layout is nice, and the binding feels solid. It costs around 20 euros, which I think is less than other «commercial» books of this size. While admiring the three-column layout (see sample on the right), you can feel good about the fact that 20k euros were donated to the German Wikipedia efforts by the publisher.

The book is pioneering in the sense that it changes the flow of publishing from paper-to-web model to a web-to-paper model. I believe most books will be made this way in the future, and I'm therefore above average interested in the production process. I should also note that I have co-authored another book which also went from web files to paper.

Some questions come to mind when I see Bertelsman's book. What was the production process like? How much processing — manual or automated — took place to produce the book? What software did they use? How much effort would it take to produce a similar book for other languages? Can it be done automatically?

I'll discuss possible answers to these questions by looking at two important aspects in the creation of this book: the content and the layout.

The content

The book contains around 20k short articles. In comparison, the German Wikipedia site now (January 2009) has around 850k articles. Somehow, someone must have picked out the small subset of articles that were to be printed in the book. To some extent, this can be don automatically — e.g. by looking at the number of links into the article, the number and frequency of edits, the length, etc. An automated first pass, combined with a manual second pass seems like a viable approach. The automated step should be able to rank articles so that one could ask for an arbitrary number of articles. For example, it should be possible to ask for the 100 most important articles.

Once the set of articles has been determined, the exact text and figures must be selected. The printed articles are typically shorter than the original Wikipedia entries. The printed version seems to copy the text from the start of the original article and continue for 1-10 sentences, with 1 or 2 being the most common length. Some heuristics can probably be employed in determining how many sentences to copy. It seems reasonable that popular articles (by links, edits, or length) should have a longer excerpt than less popular ones. In some articles, the structure will also constrain the choices. For example, in the article about «Die fabelhafte Welt der Amélie», the introduction consists of two sentences. Thereafter starts the synopsis (Handlung) which does not fit naturally with the introduction. (In the printed edition, only the first sentence is copied from that article.)

The book also contains figures and some tables. Figures typically consist of an image with a short textual description. Tables are only used to present countries; key statistics for each country is displayed in a tabular fashion.

The layout

The book uses a three-column layout. Most figures are contained within the columns, but some span across two columns and some extend to the bleeding edge. There are running headers indicating, in three-letter codes, the start and end subjects described on the page. The page number is displayed at the bottom of the page, on the outside edge. Page numbers are rotated 90°.

On the right, four pages are shown. Three of them are scanned pages from the book, which show different types of figures. The last page has been generated by copying text and images from the English Wikipedia into a sample document that shows how a HTML document can be turned into PDF by way to Prince. The resulting layout layout resembles the one used in the book. However, Prince is not able to reproduce the following:

Prince is able to place SVG figures into margin boxes. Therefore, the Wikipedia logo is possible to reproduce.

Conclusion

Much of the production of a Wikipedia-based printed encyclopedia can be automated. A script should be able to extract articles and content. Prince is able to format the content into a suitable layout. However, the exact layout of Bertelsman's book cannot be replicated.