Forum Bugs

Some exported PDFs have issues with ToUnicode CMap

stig
We produce PDFs for our customers using Prince, and some of our customers have had issues with the PDFs when they needed to convert them to HTML.

While we have no access to the converter our customer uses, we suspect it may be based on pdf2htmlEX, and we have managed to reproduce a similar issue oureselves.

Attached are two HTML documents that differ only in the presence of a third p element. Both are processed by Prince 15.2 without issues. However, the document with only two p elements produces the following warning when reconverted by pdf2htmlEX:

ToUnicode CMap is not valid and got dropped for font: 1


When this warning is produced, the text in the converted HTML is corrupted, and copying and pasting it results only in a mess of missing Unicode characters.

I have used a Google Font for better reproducibility, but have also seen the same warning with a sans-serif system font.

This could be a bug in pdf2htmlEX and not Prince, so I'm thankful for any help to find the root of the issue.

PS: even the succesful conversion to HTML leaves some slightly corrupted text where ligatures occur: "fling, fish, and affinity" becomes "$ing, #sh, and a%nity"
  1. three-para-success.html0.8 kB
    Three p elements
  2. two-para-fail.html0.7 kB
    Two p elements
wezm
If I run pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.AppImage from the latest release. I can replicate the warning however the generated HTML does seem to render properly in my browser (Firefox).

Looking at the ToUnicode object in the generated PDFs shows the following differences between the successful one and failed one:

--- success.txt 2024-03-06 09:20:20.170008615 +1000
+++ fail.txt    2024-03-06 09:20:10.190008558 +1000
@@ -11,10 +11,12 @@
 1 begincodespacerange
 <00> <FF>
 endcodespacerange
-1 beginbfrange
-<09> <22> <0061>
+3 beginbfrange
+<09> <11> <0061>
+<12> <16> <006c>
+<17> <1c> <0072>
 endbfrange
-11 beginbfchar
+12 beginbfchar
 <01> <0020>
 <02> <002c>
 <03> <002e>
@@ -23,9 +25,10 @@
 <06> <004c>
 <07> <004f>
 <08> <0054>
-<23> <00660069>
-<24> <0066006c>
-<25> <006600660069>
+<1d> <0079>
+<1e> <00660069>
+<1f> <0066006c>
+<20> <006600660069>
 endbfchar
 endcmap
 CMapName currentdict /CMap defineresource pop


This is the specific place where the error originates:

https://github.com/pdf2htmlEX/pdf2htmlEX/blob/6f85c88b1df66b1658bef6a8c478fd5e0ed684af/pdf2htmlEX/src/HTMLRenderer/font.cc#L663

Looking at the code, it reports the error when more than one PDF character maps to more than one unicode string. In the case of the failing document it seems that the issue is <20>. It's contained within the range '<17> <1c> <0072>' however there's a mapping later '<20> <006600660069>'

I suspect this is a bug in pdf2htmlEX. https://github.com/pdf2htmlEX/pdf2htmlEX is the continuation of https://github.com/coolwanglu/pdf2htmlEX and in the latter archived repo there are a number of issues regarding this error:


The CMap file specification pp52 (used for ToUnicode) notes:

> Code mappings (unlike codespace ranges) may overlap, but succeeding maps superceded preceding maps.

So I think what we're generating is valid, which is also bolstered by the fact the PDFs render fine in PDF readers.
wezm
Oops I am mistaken. <20> is not in the <17> <1c> range. So not sure what repeat entry pdf2html is hitting.
wezm
If you pass '--tounicode 1' to pdf2htmlEX it will ignore duplicate cmap entries, is the output suitable then?

I haven't been able to work out where the duplicate entry is coming from. It's trying to map PDF char 0x20 to Unicode 0x20 but that mapping is not in the CMap. PDF char 0x20 is mapped to U+0066 U+0066 U+0069.
stig
I can replicate the warning however the generated HTML does seem to render properly in my browser


It renders fine visually for me too, but the text cannot be copied. Did you try that?

If you pass '--tounicode 1' to pdf2htmlEX it will ignore duplicate cmap entries, is the output suitable then?


When pdf2htmlEX is run without options, the HTML output looks fine visually, but selected text cannot be copied - or rather, it copies as a mess of missing glyphs. See attached "corrupted-copy" image.

When run with the '--tounicode 1' option, the visual output is somewhat degraded - the 'ffi' ligature is missing. However, the text can be selected and copied, except for the ligatures. See attached "missing-.ffi" image. Also, the "CMap is not valid" error log becomes an "encoding confliction" error.

Thanks for investigating this! If you're confident that Prince's PDF output is correct, then I'll attempt to report the bug to the converter solution our customer is using. But I suspect we'll have to keep Opentype features turned off for a while.
  1. corrupted-copy.png42.5 kB
    pdf2htmlEX without options
  2. missing-ffi.png41.1 kB
    pdf2htmlEX with --tounicode 1
wezm
the visual output is somewhat degraded - the 'ffi' ligature is missing.


Yes that makes sense. Since the <20> <006600660069> entry is the one for ffi and it's the one that it's seeing as the duplicate and being skipped.

If you're confident that Prince's PDF output is correct


I'm not 100% confident as I haven't been able to pinpoint where the issue is in pdf2html but I do think Prince's output is right.