Forum Feature requests

prince accessibility

Anonymous
is prince making accessible pdf with taggued pdf (alt off an image become an image description of the image in the pdf,etc
mikeday
Prince does not yet support tagged PDF, although it's something that we would like to add in the future, so we would be interested in hearing what kind of requirements that you have for it.
Anonymous
i you have an well formatted xhtml page generate a well taggued pdf,
<p> is <p>, alt become image description, th is th, h1 ..h6 is h1...h6.
For now even the import from web of adobe acrobat pro generate non taggued pdf, and for exemple if i take on off your sample pdf generated with html source, acrobat can't add automaticaly tag, an error occured
mikeday
Right, we will need to investigate adding support for this in a future release of Prince. Thanks for the information! :)
Anonymous
please alert me when it's done because it will be a huge progress for accessibility of pdf on the web
pkj
This is something that I am interested in as well.

Have you had any further thoughts about it?

Regards

Peter
mikeday
Support for tagged PDF would be good to have in Prince, and I've added it to the roadmap for 7.0. The reason we still haven't done it yet is that we haven't had the time, due to the demand for other features and improvements. Would you be able to give us some examples of situations in your work that would benefit from tagged PDF support?
pkj
It's purely for accessibility. Many of our clients are local government and take accessibility rather seriously. Up until now its been focussed on the HTML side of things but yesterday we came across this requirement:

"It would also be desirable for PDFs produced by the system to include accessibility options. They should be correctly structured and tagged, include a reading order, and a language specification."

Peter
Florent V.
I would like to support goetsu's and pkj's requests. The ability to produce accessible PDF with Prince would be great.

Client requests for accessible PDF are bound to increase in number here in France due to the evolution of legislation. And I reckon the situation is not so different in many other countries.

Moreover, I'd rather produce accessible PDF than not (because that would increase its overall quality), especially if it requires no extra work on my part. If producing accessible (X)HTML and outputting to PDF through Prince is all it takes to get quite-accessible PDF, I'm all for it. :P
goetsu
any news on that topic ?
mikeday
Not yet, as supporting tagged PDF requires some fairly major changes to Prince. However, we are adding support for Arabic and Indic scripts in Prince 7.0, which will require some basic support for span tags to preserve the accessibility of the original text, so this may be a good starting point.
mikeday
It seems that in most cases Adobe Reader is smart enough to handle copying and pasting text in complex scripts even without any tagging in the PDF file, so this feature may not be a requirement for Prince 7.0 after all. However, we would be interested to know of any specific requirements for tagged PDF that are mandated by accessibility legislation.
goetsu
see 1.3.1 Info and Relationships: Information, structure, and relationships conveyed through presentation can be programmatically determined or are available in text. (Level A) in WCAG 2.0.

The minimum tags for me regarding this guidelines and national accessibility legislation (as RGAA in France, SGQRI in quebec for exemple) are : headings, lists, links, table, tr, th,td and paragraphs.

A plain text version is not an accessible version.
goetsu
can we hope to see this feature integrated some day ?
mikeday
Yes, we will get there. Right now the focus is on JavaScript support for the next major release of Prince, but we will return to layout and tagging after that. Sorry for the delay, I know this has been on the roadmap for quite a while now. :D
mbrandon
Hi Mike,
Have there been any updates in making accessible pdfs for 508? We really need this.
Thanks
mikeday
Do you know of any tool or validator to check if a PDF file complies with 508?
Louis
Before you get sidetracked with the rest of my post, here's some classic A List Apart from fellow Torontonian Joe Clark on PDF accessibility: http://www.alistapart.com/articles/pdf_accessibility

Slightly dated since it's from 2005, but a good read to put everything in perspective. Don't miss the link in "Tags and structure" on Predefined PDF Tags. A fantastic look at PDFs and accessibility circa 2005, it even covers a tutorial on tagging and TouchUp in Acrobat. And that Predefined PDF Tags link has, after the long list, definitions of each tag in a nice HTML excerpt from Creating Accessible PDF Documents with Adobe Acrobat 7.0: A Guide for Publishing PDF Documents for Use by People with Disabilities, Appendix A (PDF), © 2005 Adobe Systems.

mikeday wrote:
Do you know of any tool or validator to check if a PDF file complies with 508?

Some background on PDF/UA in a push for its 508 standard recognition: http://65.181.155.238/blogs/duffjohnson/testimony-us-access-board-public-hearing-section-508

After some googling on PDF accessibility:
  • Acrobat Pro does validation (versions 7-X, with tags supported since Acrobat 5's XML integration)
  • Contrast Analyser (supports PDF, freeware, limited to "colour visibility" obviously)
  • pdfreactor, a competitor to Prince seems to support tagged PDFs if the option is enabled, but is per-CPU server-licensed only and the academic price is higher than Prince. (And its typography looks terrible!)
  • An informative and at times funny presentation of TeX support for tagged PDF generation: http://www.fi.muni.cz/~sojka/dml-2009-moore.pdf

The following is retyped from http://www.cs.nott.ac.uk/~dfb/Publications/Download/2002/Hardy02.pdf:

A Tagged PDF must conform to a set of rules, which allow the document to be more accessible. These properties can be separated into three categories:

1. Page Content
  • All represented text is in a form that can be converted to
    Unicode.
  • Word breaks are explicitly represented.
  • Actual content is distinguished from artifacts of layout
    and pagination.
  • Content must be given an order related to its appearance
    on the page.

2. Structure Types
  • Standard structure types are used within the structure
    tree to convey the semantics of the structure.
  • When using customised tagsets, these must be mapped
    to their closest equivalent standard structure types.

3. Structure Attributes
  • Standard structure attributes used to preserve styling information from authoring applications.

Table 1: List of Standard Structure Types
  • P, H, H(1-6) Paragraph and Heading tags containing textual content.
  • L, LI, LBody List tags describing a List, List Item and List Body respectively.
  • Table, TH, TR, TD Table tags for displaying a Table, Table Headings, Rows and Data respectively.
  • Document, Art, Part, Sect, Div Standard structure types used for grouping content.
  • Figures, Form Tags representing figures and interactive form elements. [Ed. note, by forms do they mean XFDF since v5 or XFA since v7?]

For those with more of a technical or programming bent, you can check out the full ISO 32000-1 PDF standard:

Adobe has an agreement with ISO that it can post the standard for free provided it isn’t the “official” ISO version. So the running headings and footings have been changed and the introductory pages are different, but the technical chapters are identical including page and section numbering. If you need an official version please pay for it at the ISO site. This is one source of income for this important standards organization.

http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf (like 700+ pages!)

However, Section 14.8 of that document covers Tagged PDF. On the adobe version, it begins on page 582 of 756.

Tagged PDF (PDF 1.4) is a stylized use of PDF that builds on the logical structure framework described in 14.7, “Logical Structure.” It defines a set of standard structure types and attributes that allow page content (text, graphics, and images) to be extracted and reused for other purposes. A tagged PDF document is one that conforms to the rules described in this sub-clause. A conforming writer is not required to produce tagged PDF documents; however, if it does, it shall conform to these rules.

NOTE 1 It is intended for use by tools that perform the following types of operations:
  • Simple extraction of text and graphics for pasting into other applications
  • Automatic reflow of text and associated graphics to fit a page of a different size than was assumed for the original layout
  • Processing text for such purposes as searching, indexing, and spell-checking
  • Conversion to other common file formats (such as HTML, XML, and RTF) with document structure and basic styling information preserved
  • Making content accessible to users with visual impairments (see 14.9, “Accessibility Support”)

A tagged PDF document shall conform to the following rules:
  • Page content (14.8.2, “Tagged PDF and Page Content”). Tagged PDF defines a set of rules for representing text in the page content so that characters, words, and text order can be determined reliably. All text shall be represented in a form that can be converted to Unicode. Word breaks shall be represented explicitly. Actual content shall be distinguished from artifacts of layout and pagination. Content shall be given in an order related to its appearance on the page, as determined by the conforming writer.
  • A basic layout model (14.8.3, “Basic Layout Model”). A set of rules for describing the arrangement of structure elements on the page.
  • Structure types (14.8.4, “Standard Structure Types”). A set of standard structure types define the meaning of structure elements, such as paragraphs, headings, articles, and tables.
  • Structure attributes (14.8.5, “Standard Structure Attributes”). Standard structure attributes preserve styling information used by the conforming writer in laying out content on the page.
A Tagged PDF document shall also contain a mark information dictionary (see Table 321) with a value of true for the Marked entry.
NOTE 2 The types and attributes defined for Tagged PDF are intended to provide a set of standard fallback roles and minimum guaranteed attributes to enable conforming readers to perform operations such as those mentioned previously. Conforming writers are free to define additional structure types as long as they also provide a role mapping to the nearest equivalent standard types, as described in 14.7.3, “Structure Types.” Likewise, conforming writers can define additional structure attributes using any of the available extension mechanisms.


I'll stop copying and pasting there, as with all the tables, 14.8 alone is from 582-610, and 14.9 on Accessibility Support continues for another 6 pages, for a total of 35 pages. Actually, I'm surprised it's only 35 pages. I suppose it relies on already understanding the postscript, as I'm completely lost in half the examples.

I wonder why there's so little documentation on tagged PDFs? I mean, I'd use wkhtmltopdf but it borrows Qt libraries and those don't appear to output tagged PDFs either. And it's only been what -- eight years -- since such things were first introduced? I don't want to have to write LaTeX to automatically produce and tag my PDFs ...

It appears even if I swapped to native (Mac) support of PDF, it *too* doesn't write tags. Only OpenOffice, Word, and other more common/official converters seem to -- and even then, you'll still have to confirm its accessibility in Acrobat, renaming or cleaning up tags.

It's sad then that Prince doesn't support accessibility tagging, as this could give this kind of document generation a leg up on other methods, which usually try to handle tagging 100% automatically and then you'd end up having to re-tag parts manually. Having instead more control over the input/output process and tagging could allow for automation, along with automatic mapping of common HTML tags to PDF ones, alt tags for images, etc. There's a real gap here, and if you guys don't address it, perhaps someone will modify QPrinter to support it, and then wkhtmltopdf might have the leg up.

Just my 2 cents. (Oh who am I kidding, I wasted a couple hours on Google. Hehe.)

Not giving up, I found one last alternative for accessible PDF output, Apache FOP: http://xmlgraphics.apache.org/fop/1.0/accessibility.html (or a few of the commercial XML-FO alternatives)

If only I wasn't so allergic to XML-FO. Maybe I can use CSSToXSLFO 1.6.2 from http://sourceforge.net/projects/css2xslfo/
There are examples here: http://www.re.be/css2xslfo/examples.xhtml ... but somehow again I feel the typography isn't up to snuff. Maybe it's just me. At this rate I'm looking at automating Adobe InDesign. C'est la vie.
mikeday
Thank you for the detailed post! We do hope to get this done, it's just not as simple as it first appears, as Prince needs to keep track of the relationship between input and output in ways that it currently doesn't. Still, we'll get there eventually. :)
Louis
mikeday wrote:
Thank you for the detailed post! We do hope to get this done, it's just not as simple as it first appears, as Prince needs to keep track of the relationship between input and output in ways that it currently doesn't. Still, we'll get there eventually. :)

Any idea on a timeline? What bugs me the most is even if I rewrite the XML to Adobe's tags on my own, there's afaik no tool to match an input XML doc to an output PDF. Seems like adding and editing tags after the fact could be quite necessary for those whom accessibilty is more than a bullet point or buzzword. Perhaps I should re-examine Acrobat's tagging features.
mikeday
Say, 2011? Hopefully! :D
Boris
Hi,

If it can help you to prioritize your tasks, I'm hugely waiting for this improvement as well. ;)
goetsu
"mikeday" wrote:
Do you know of any tool or validator to check if a PDF file complies with 508?


here is one : http://www.access-for-all.ch/en/pdf-wer ... ecker.html
RigorMortis
I've run into this issue too. While I can't yet find an easy way to add working Replacement Text for images (at least in a way that Adobe will use), I have written up a quick and dirty perl script that will add Language and Text Direction tags into the PDF generated by PrinceXML (this version also rewrites the XREF section and updates the STARTXREF entry as well):

# Usage: rejigger.pl LANG TDIR INFILE OUTFILE
# LANG is a language identifier (eg en-US or en-AU) (this is NOT CHECKED!)
# TDIR is the text direction - either L2R or R2L (this is NOT CHECKED!)
# INFILE is the Prince generated PDF
# OUTFILE is where to write the altered PDF
use strict;
my ($lang, $dir, $inf, $outf) = @ARGV;
open(my $infh, '<', $inf) or die "Bad infile";
open(my $outfh, '>', $outf) or die "Bad outfile";
my $done = 0;
my $repl = "/Lang ($lang) \n/ViewerPreferences <</Direction /$dir>> \n";
my $offsets = {};
my $len = 0; # Running amount of data written to outfile
my $xlen = 0; # Amount of data written before the "xref" line
while (defined(my $line = <$infh>)) {
  if ($line =~ /^(\d+)\s+(\d+)\s+obj\s*\r?\n\z/) {
    my $objn = $1;
    $offsets->{$objn + 0} = $len;
  } elsif ($line eq "xref\n") {
    $xlen = $len;
  }
  $len += length($line);
  print $outfh $line;
  if ($line eq "xref\n") {
    rewrite_xref($infh, $outfh, $offsets);
  } elsif ($line eq "startxref\n") {
    rewrite_startxref($outfh, $xlen);
    last;
  }
  next if ($done or $line !~ /^<<\/Type\s*\/Catalog/);
  print $outfh $repl;
  $len += length($repl);
  $done = 1;
}
close($infh);
close($outfh);

sub rewrite_xref {
  my ($ifh, $ofh, $offsets) = @_;
  my $line = <$ifh>; # get max obj
  my ($max) = $line =~ /^\d+\s+(\d+)/;
  print $ofh $line;
  $line = <$ifh>; # get zeroth offset
  print $ofh $line;
  for my $i (1..($max - 1)) {
    if (not exists $offsets->{$i}) {
      print $ofh "0000000000 65535 f \n";
    } else {
      printf $ofh "%010d 00000 n \n", $offsets->{$i};
    }
  }
  while (1) {
    $line = <$ifh>;
    last if $line !~ /^\d+\s*\d+\s*\w/;
  }
  print $ofh $line;
}

sub rewrite_startxref {
  my ($ofh, $xlen) = @_;
  print $ofh $xlen, "\n%%EOF\n";
}


Note that this expects exactly the output from the current version of Prince (7.1), which splits the Catalog dictionary to have one key/value pair per line. If that changes (for instance if the Catalog dictionary is put all on one line) this script will corrupt your PDF.

It should be easy to incorporate this directly into PrinceXML (obviously in the right language and prefixed with the code to parse out the language/direction)... ;)

Edited by RigorMortis

mikeday
Technically this will break the cross-reference dictionary at the end of the PDF file, but some PDF viewers will not complain about this. Does it even make sense to have a language / direction for the entire PDF file? What if the file contains text in multiple languages or scripts?
RigorMortis
mikeday wrote:
Technically this will break the cross-reference dictionary at the end of the PDF file, but some PDF viewers will not complain about this. Does it even make sense to have a language / direction for the entire PDF file? What if the file contains text in multiple languages or scripts?


Ah, so that's why Acrobat Pro complains about fixing a broken PDF (but reader is silent) - I'll have to look into that.

I'm running off the PDF 1.7 spec. For the Lang key of the Catalog dictionary (section 7.7.2) it has:

Lang (text string):
(Optional; PDF 1.4) A language identifier that shall specify the natural language for all text in the document except where overridden by language specifications for structure elements or marked content (see 14.9.2, "Natural Language Specification"). If this entry is absent, the language shall be considered unknown.

And for the Direction key of the Catalog >> ViewerPreferences dictionary:

Direction (name):
(Optional; PDF 1.3) The predominant reading order for text:
L2R Left to right
R2L Right to left (including vertical writing systems, such as Chinese, Japanese, and Korean)
This entry has no direct effect on the document’s contents or page numbering but may be used to determine the relative positioning of pages when displayed side by side or printed n-up. Default value: L2R.

Essentially they're both global defaults. Direction makes little difference unless you're actually using a R2L language, but the lack of a Lang key is "bad" as far as accessibility is concerned, and can be overridden in parts of the PDF if necessary.
mikeday
We could grab the language and direction from the root element, or the HTML <body> element, and use those. But that won't necessarily be the best choice, and things start to get complicated. The most tempting approach I feel would be to extend the JavaScript API for specifying PDF metadata, then it will be easy for users to customise it in whatever way they like.
RigorMortis
mikeday wrote:
We could grab the language and direction from the root element, or the HTML <body> element, and use those. But that won't necessarily be the best choice, and things start to get complicated. The most tempting approach I feel would be to extend the JavaScript API for specifying PDF metadata, then it will be easy for users to customise it in whatever way they like.


The root html element was where I was expecting that info to be picked up (my auto-generated html code already adds xml:lang, lang and dir attributes to the <html> tag), and that leaves the ability to add lang attributes to <span>'s and whatnot. I haven't looked at the JS API yet though, so that might be a good place to at least override things, but I would have thought that as long as the info is there it might as well be used without a duplicate definition elsewhere.

BTW - I modified my script above to inlcude updating the startxref count. Untested of course :)
RigorMortis
RigorMortis wrote:
BTW - I modified my script above to inlcude updating the startxref count. Untested of course :)


Wow, and in doing so ignored the xref table completely. More coding for me :)
RigorMortis
Another thing I noticed on this same topic is that the "title" tag for images doesn't seem to be getting added as mouse-over text. The little "P" image in the corner of PDFs generated with an unlicensed version of Prince _does_ have this though as an Annotation (/Annot entry). While this is technically not the exactly correct way of handling replacement text for images, it should be a very easy way of increasing the accessibility of generated PDFs by reusing existing code.

BTW, just so you don't think I'm being deliberately negative, I'm pushing very hard to get approval for a licensed copy of Prince going forward for my work's site - compared to the alternatives we've been offered it rocks :)
pronik
@mikeday: PDF tags appeared on my radar in connection with government-related accessibility so I'm interested in this feature too! Thanks!
henning
Hi,

I was wondering if you are going to put accessible PDFs on your roadmap. Accessibility issues become more and more important, for example for government websites. All our documents have to comply with WCAG 2.0.

Here is a W3C document that outlines how to make PDF documents accessible:
http://www.w3.org/WAI/GL/WCAG20-TECHS/pdf.html

There is also a lot of information available on the Adobe website.

There is also a free tool that checks PDFs for accessibility issues:
http://www.access-for-all.ch/en/pdf-werkstatt/pdf-accessibility-checker-pac.html

Thanks,

Henning
mikeday
Thanks for the pointers, the conformance checker looks helpful.
joelmeador
Any update on when tagging might be released?

Try DocRaptor - PrinceXML web service and official PrinceXML partner

mikeday
Not in Prince 8.1, as it requires some fundamental changes to the internals of the layout engine.
fbrzvnrnd
Hi,

any news about acessibility? We are building document for Government and the Prince's accessibility is still very poor. Trying the PAC (http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac/download-pac.html) on a Prince PDF we did, I got a bad result...


f.
mikeday
Yes, we are actively working on tagged PDF, and plan to include support for it in Prince 9.0.
fbrzvnrnd
Thank you for the answer.

f.
fbrzvnrnd
Hi, there is some news for tagged PDF and acessibility in Prince 9?
mikeday
Prince 9 supports PDF/A-1b, which includes various metadata, but not tagging. We have experimental support for PDF/A-1a, which does include tagging, and will release it this year.
rpilkey
Thanks for working on accessibility, we are looking forward to using it in our group.
Louis
Yay, glad to hear progress on this. I haven't had a project that specifically required prince for awhile now, but I'll upgrade to show my support for the accessibility features when that experimental build ships.

Thanks for continuing to build out all this new functionality. I can't remember the last time I ever felt that way about a new copy of Acrobat ;-)
BGrenon
Any more news on the experimental build for this? Will it be available to those of us willing to experiment?
Srinivas
Does the latest version of Prince XML support Annotations? I am looking to convert HTML to PDF using Prince XML. How do i mention anything about annotation in HTML? Any specific tags?
mikeday
Prince does not support the creation of arbitrary PDF annotations or forms at this time.

We're testing support for tagged PDF, please send me an email (mikeday@yeslogic.com) and let me know if you would like to try it out, and which platform you are running Prince on.