Forum Feature requests

Provide a mechanism not to generate PDF tags for HTML tags with no semantic meaning

David J Prokopetz
This might be more of a how-do-I than a feature request, but I'm *reasonably* certain I'm not overlooking it this time: it would be handy to have a way to ignore HTML tags that have no semantic significance and exist only for styling purposes when generating tagged PDFs.

To provide an example, we've got a document where level two headers have very complex border art that requires a couple of nested spans to correctly apply it - something like this:

<h2><span><span>Introduction</span></span></h2>


(Note that the nested spans don't actually exist in the HTML document in this case - they're inserted via Javascript immediately before PDF conversion using the --script command line argument, if that makes any difference.)

In a tagged PDF, this results in the following tag hierarchy:

<H2>
+--<Span>
    +--<Span>
         +--Introduction


That's not wrong, per se, but some of the pickier validators may not like the fact that there are a couple of semantically null Span tags floating around in there when the document semantics would be more accurately reflected by:

<H2>
+--Introduction


I'm not sure what an appropriate mechanism for handling cases like this would look like. Possibly some sort of "none" or "exclude" value for the prince-pdf-tag-type CSS property?
mikeday
Perhaps yes, we will investigate this.
David J Prokopetz
Thank - I really appreciate the attention you're giving to this!

While reviewing the document I'm currently working with, I ran into what I believe is a related scenario. The HTML source uses <cite> elements for the names of cited works, but since our style guide calls for the titles of short works to be displayed with quotation marks rather than italics (and we've got a mix of short and long works involved), the stylesheet does something like this:

<p>Robert Bolaño, <cite class="short-work">Exiles</cite></p>
<style>
    cite.short-work { font-style: normal; }
    cite.short-work::before { content: "\201C"; }
    cite.short-work::after { content: "\201D"; }
</style>


The resulting tag hierarchy in the tagged PDF ends up looking like this:

<P>
+--Robert Bolaño, 
+--<Span>
   +--<Span>
      +--"
   +--Exiles
   +--<Span>
      +--"


i.e., owing to the ::before and ::after pseudoelements receiving their own tags, the citation ends up sandwiched between a couple of nested Spans, each containing a single quotation mark.

A more favourable tag hierarchy would look something like this:

<P>
+--Robert Bolaño, 
+--<Span>
   +--"Exiles"


i.e., with only the parent <cite> element producing a Span tag.

Edited by David J Prokopetz

David J Prokopetz
Upon further consideration, I've realised that the second issue could be worked around by adding a print-only stylesheet that sets the ::before and ::after psuedoelements of short-work citations to "display: none", and adding a scripting step that directly inserts opening and closing quotes around the contents of <cite> elements for short works before handing it off to Prince for PDF conversion. Still, having a pure CSS solution would be convenient!