Printing Wikipedia

Håkon Wium Lie

Chairman, YesLogic; CTO Opera Software

howcome@yeslogic.com, howcome@opera.com

Abstract: Wikipedia is a fascinating repository of information. Being able to reuse it in many contexts would make it even more valuable for mankind. Printing Wikipedia articles is obviously a meaningful reuse. However, the HTML code that Wikipedia currently generates is presentational and does not easily lend itself to reformatting for printing. This paper discusses some problems and suggests some solutions.

Keywords: Wikipedia, printing, web, structured documents, style sheets

Working draft as of 2009-01-17

I believe all information will be available on the web in the not-so-distant future, and if we want to print books we must learn how to produce them from the web. HTML is a good starting point for printing. When written correctly, HTML documents are not tied to screen presentations but contain enough structure and semantics to be presented on many types of media, e.g., a printer. The presentation of HTML is described in a CSS style sheet which can be optimized for different media types.

One of the largest repositories of interesting information on the web is Wikipedia. Wikipedia also holds content that many people would like to see on paper. In Germany, Bertelsman has published a book derived from the Germen Wikipedia articles. That production process was not fully automatic, I believe. (At least, Bertelsman claims copyright on the «Layout» of their book.) My approach for printing Wikipedia is slightly different: I've written a CSS style sheet that can be combined with the HTML code from Wikipedia to form a PDF file generated by Prince. That is, I remove all style sheets that comes with Wikipedia articles (called "author" style sheets in CSS) and add the one I have written (called "user" style sheet) instead. I've published some examples that show how articles can look when formatted into a two-column layout. The results so far may seem successful, but they also expose serious problems in the HTML code that Wikipedia generates.

Contributors to Wikipedia do not author in HTML. Instead they use a «Wiki» language which is, generally, at a higher level of abstraction than the resulting HTML pages. A complex set of templates is used to convert the source wiki documents to the resulting HTML code. Problems in the resulting HTML code can only be fixed by changing one or more of the templates. Just like articles in Wikipedia, templates can be edited by users. However, many of the commonly used templates are locked from editing because changing them will impact many pages.

In this paper, which for the moment is work in progress, I will outline these problem areas

Infoboxes

One important group of templates generate «infoboxes» in articles that belong to certain groups. For example, all countries have an infobox that lists the number of inhabitants and other easily comparable information. Although the information in the infoboxes is highly structured, the resulting HTML code is not. Instead, the code is very presentation-oriented and most information about structure and semantics is removed. Let us look at some simple examples.

FixBunching

The FixBunching template is used to bunch together several infoboxes. As such, generates the «oldest» element in the hierarchy of infoboxes. The wiki code to call the template looks like this:

{{FixBunching|beg}}

(You can find it in the source code of the Winter War article)

The template, when expanded, generates this HTML code:

<table style="background:transparent; float:right; margin:0 0 0.5em 1em;&#160;;">

This code does not say why the element was created (to bunch infoboxes together). Instead, it creates a new table and sets some CSS declarations it. As said above, I remove all author style sheets that when I create my two-column layouts. This includes the style attribute, and I thereby lose information about the "bunch". One solution to this problem is to add a class name to the table element which is created:

<table class="fixbunching" style="background:transparent; float:right; margin:0 0 0.5em 1em;&#160;;">

By adding the class attribute, it becomes possible to write a user style sheet that replicates — or, improves upon — the styling.

Infobox Country

Another example can be found in the Infobox_Country template. As the name indicates, it records information about countries. One of the "fields" recorded is the commonly used name of the country. In the source code for Norway, the ethnicity of Norway is described:

|ethnic_groups = 90.3% [[Norwegian people|Norwegian]], [[Sami people|Sami]], 
  9.7% other<ref>[http://www.ssb.no/english/subjects/00/00/10/innvandring_en/ 
  Statistics Norway - Immigration and immigrants]</ref>(2009)

When this information is converted to HTML it comes out as:

<tr>
<td colspan="2"><b><a href="/wiki/Ethnic_group" title="Ethnic group">Ethnic groups</a></b> </td>
<td>90.3% <a href="/wiki/Norwegian_people" title="Norwegian people">Norwegian</a>, <a href="/wiki/Sami_people" title="Sami people">Sami</a>, 9.7% other<sup id="cite_ref-0" class="reference"><a href="#cite_note-0" title=""><span>[</span>1<span>]</span></a></sup>(2009)</td>
</tr>

The above HTML code creates a row in the infobox table. The row is started with the <tr> element. However, for the HTML parser it's impossible to know that this row is about ethnicity as there is no metadata on the <tr> element. This could easily have been fixed by adding a class:

<tr class="ethnic_groups">

In the above code, the field name ("ethnic_groups") has been inserted as a class name. This makes it possible to write a style sheet that, for example, highlights the ethnicity row in all infoboxes.

(It is possible to write a CSS style sheet that selects the first td element by way of its title ("Ethnic group"). However, this is not a good solution as (a) the selector is language-dependent; and (b) it only selects one cell, not the whole row.)

Some class names are used consistently in the Infobox_Country template: mergedtoprow, mergedbottomrow, mergedrow. The purpose of these seem to be presentational. That's fine — this practice allows the style to be set in external style sheets and the use of style attributes can be avoided.

Infobox Military Conflict

An infobox used to describe military conflicts will be my third example of a template. Consider the source code for the Winter War article — the format is very similar to the country example discussed above. For example:

{{Infobox Military Conflict
|conflict=Winter War
|partof=[[World War II]]
|image=[[Image:Winter war.jpg|300px|]]

Here is an excerpt from the resulting HTML code:

<tr>
<th style="padding-right: 1em;">Location</th>
<td>Eastern <a href="/wiki/Finland" title="Finland">Finland</a></td>
</tr>
<tr>
<th style="padding-right: 1em;">Result</th>
<td><a href="/wiki/Interim_Peace" title="Interim Peace">Interim Peace</a></td>
</tr>

Again, we see two common problems:

Infobox recommendations

I'm not quite done analyzing the infobox conglomerate in Wikipedia. Still, a conclusion is emerging; to improve printing of Wikipedia pages in general and infoboxes in particular, I suggest:

Columns

[difficult to print two-column layouts as some of the Wikipedia elements establish two-column layouts on their own]

This markup results in:

{|
|valign=top|
* [[Akershus]]
* [[Aust-Agder]]
* [[Buskerud]]
* [[Finnmark]]
* [[Hedmark]]
* [[Hordaland]]
* [[Møre og Romsdal]]
* [[Nord-Trøndelag]]
* [[Nordland]]
* [[Oppland]]
|valign=top| 
* [[Oslo]]
* [[Østfold]]
* [[Rogaland]]
* [[Sogn og Fjordane]]
* [[Sør-Trøndelag]]
* [[Telemark]]
* [[Troms]]
* [[Vest-Agder]]
* [[Vestfold]]
|}
<table>
<tr>
<td valign="top">
<ul>
<li><a href="/wiki/Akershus" title="Akershus">Akershus</a></li>
<li><a href="/wiki/Aust-Agder" title="Aust-Agder">Aust-Agder</a></li>
<li><a href="/wiki/Buskerud" title="Buskerud">Buskerud</a></li>
<li><a href="/wiki/Finnmark" title="Finnmark">Finnmark</a></li>
<li><a href="/wiki/Hedmark" title="Hedmark">Hedmark</a></li>
<li><a href="/wiki/Hordaland" title="Hordaland">Hordaland</a></li>
<li><a href="/wiki/M%C3%B8re_og_Romsdal" title="Møre og Romsdal">Møre og Romsdal</a></li>
<li><a href="/wiki/Nord-Tr%C3%B8ndelag" title="Nord-Trøndelag">Nord-Trøndelag</a></li>
<li><a href="/wiki/Nordland" title="Nordland">Nordland</a></li>
<li><a href="/wiki/Oppland" title="Oppland">Oppland</a></li>
</ul>
</td>
<td valign="top">
<ul>
<li><a href="/wiki/Oslo" title="Oslo">Oslo</a></li>
<li><a href="/wiki/%C3%98stfold" title="Østfold">Østfold</a></li>
<li><a href="/wiki/Rogaland" title="Rogaland">Rogaland</a></li>
<li><a href="/wiki/Sogn_og_Fjordane" title="Sogn og Fjordane">Sogn og Fjordane</a></li>
<li><a href="/wiki/S%C3%B8r-Tr%C3%B8ndelag" title="Sør-Trøndelag">Sør-Trøndelag</a></li>
<li><a href="/wiki/Telemark" title="Telemark">Telemark</a></li>
<li><a href="/wiki/Troms" title="Troms">Troms</a></li>
<li><a href="/wiki/Vest-Agder" title="Vest-Agder">Vest-Agder</a></li>
<li><a href="/wiki/Vestfold" title="Vestfold">Vestfold</a></li>
</ul>
</td>
</tr>
</table>

Notes and references

[Most of the problems here can be solved, even with the current markup]

Tables and Figures

[Some tables and figures need to span the whole page. Which ones?]