Forum How do I...?

Large XHTML file error

crunchygumps
I have a 20 MB XHTML file that I am trying to convert. The log reports the following:
error: Excessive depth in document: 256 use XML_PARSE_HUGE option


Not sure how to set this option.

I am using the windows install trial version to see if the tool will be a good fit for our needs.

Thanks
mikeday
Interesting, we haven't hit this one before! It seems that libxml2 has some built-in limits to document size and maximum element nesting, primarily for security reasons. We can adjust this for the release of Prince 7.0.

Does your document really have over 256 levels of nested elements? That's quite impressive for XHTML. :)
crunchygumps
When you mentioned 256 levels of nested elements I new something must be wrong with the HTML.

I checked the html source in the browser and it all looked good but when saving the outputfrom a browser to a file the save stripped of the closing part of the BR tags in the document. :shock:

Fixed the output to be valid xml once again and prince worked beautifully.
kandersteg
We are running into this problem as well with html files that have a huge amount of nested elements (like >1000):

error: Excessive depth in document: 256 use XML_PARSE_HUGE option

Unfortunately, we cannot modify the HTML because these are official government documents.

Thanks

kandersteg
mikeday
Are the elements really nested, or just missing close tags? It is normally difficult to nest that many elements in HTML; are they <div> or table elements, or something else?
kandersteg
Mike,

Here is the link

http://www.sec.gov/Archives/edgar/data/12601/000001260109000017/pvcncsrsfinal.htm

It's ugly - not sure if there is a fix. Tidy destroys it.

Thanks

kandersteg
mikeday
The linked document works for me, and produces an 828 page PDF file. There are many problems with the markup: tables are nested inside paragraphs, which is not valid in HTML, and the document is wrapped up in some unusual tags, eg. <DOCUMENT>, that really should not be there. It also overuses the <font> tag, and would be cleaner with CSS. But besides all that, it is convertible with Prince.
kandersteg
Thanks Mike,

We created a batch file to process over 1,000 files. I believe we have the latest version (7.0 beta). For some reason this file hangs for 20 min or so for us at the "parse_huge_xml" message. We are using a 3ghz xp machine with 4gb ram.

I'll try the gui tomorrow.

kandersteg
kandersteg
Mike,

We converted it using the GUI & the output was cut short at exactly 500 pages - We did not use a stylesheet.

Thanks

kandersteg
mikeday
With the XML_PARSE_HUGE error message?
kandersteg
Yes. Also, the right margins run off the pages. I realize that the html input is pitiful , so anything that you think would help is appreciated.

Thanks
kandersteg
Mike - my apologies.

We do not get the XML_PARSE_HUGE error when using the GUI - It's just that the conversion terminates at 500 pages & the right margins run off the pages.

Thanks

kandersteg
mikeday
I see the margin problem, but not the terminating at 500 pages problem, for me it continues to 800+ pages.

The reason the table is too wide is due to table cells like this one:
<TD align=left width=56% nowrap>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<FONT size=1 face="TimesNewRomanPSMT,Arial,Helvetica,sans-serif">Total Assets</FONT>&nbsp;
        </TD>

All those no-break space characters combined with the nowrap attribute on the table cell mean that it is going to be far wider than necessary.
kandersteg
Thanks Mike

Any ideas on what might cause the 500 page termination?

kandersteg
mikeday
No, I'm having a great deal of trouble reproducing that, it's always ~828 pages when I convert it, either on Windows or Linux. Are there any error messages in the output log other than all the ones relating to unexpected </P> tags?
kandersteg
Mike,

We uninstalled Prince & then downloaded/installed the latest release.

Now the document is perfect when we apply css files.

Thanks for your help & sorry this topic consumed so many calories given that the problem must have been on our end somewhere.

kandersteg