Forum How do I...?

Converting very large html documents.

Jamesoc
Hi guys,
I'm using the windows version of Prince in an attempt to convert a very large html file, 40Mb, which consists of just a table with 9 columns and 120000 rows.
I've removed all style sheets and images, so its just a plain old html file.

However every time I do the conversion I get a failure and nothing outputted in the Output Log.

The file is of the format :

<!DOCTYPE html
  PUBLIC "-//W3C/DTD HTML 4.01 Transitional//EN">
<html lang="">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <title>Report Results - Calls Abuse - Out of Hours</title>
   </head>
   <body>
      <table border="0" cellpadding="0" cellspacing="0">
         <thead>
            <tr>
               <th width="20">&nbsp;</th>
               <th>Phone Number</th>
               <th>User Name</th>
               <th>Dialled Number</th>
               <th>Destination</th>
               <th>Total Duration</th>
               <th>Date</th>
               <th>Time</th>
               <th>Total Spend</th>
            </tr>
         </thead>
         <tbody>
            <tr>
               <td></td>
               <td>004412345566789090</td>
               <td>Someones name</td>
               <td>3495485XXXX</td>
               <td>&nbsp;</td>
               <td>02h 07m 24s</td>
               <td>22/01/2009</td>
               <td>22:36:42</td>
               <td> &pound; 65.15</td>
            </tr>
          </tbody>
      </table>
   </body>
</html>


with the <tr> section repeated about 120000 times.

Is there a debug version of Prince that I can try this on or am I simply hitting the upper limit of what Prince can process?

Thanks for any help
mikeday
Can you try running Prince from the command-line? Then it should be easier to see any error messages. Also, what version of Windows are you running on, and which version of Prince are you using?
Jamesoc
Hi Mike,

I'm running Prince 6.0r8 on Windows XP Service Pack 3

Total output from prince run at a command line is :

C:\Program Files\Prince\Engine\bin>prince -v sannedreports.html
prince: loading HTML input: sannedreports.html

C:\Program Files\Prince\Engine\bin>

I then tried the following command to see if it was an issue with any of the prince options :
C:\Program Files\Prince\Engine\bin>prince -v --input=html --no-xinclude --no-network --no-embed-fonts --no-subset-fonts --no-compress sannedreports.html
prince: loading HTML input: sannedreports.html

Again with no success.

I'm fairly sure its a size issue, if I take out 1/2 of the table it will run and convert fine.


Furthur update at 17:00
I downloaded and tried using the V7 Beta and got the following message in the log :

GC Warning: Out of Memory! Returning NIL!

So I guess I just need to figure out how to make it use more memory :)


Further update on this :

I've downloaded the version 7.0 that was released today and still get the same out of memory warning.
Jamesoc
Just a little update on this issue,
I've successfully created a PDF of about 1200 pages and in or around 60000 rows in the table but I cant go any higher.

I keep getting the Out of Memory error.

Is there any fix for this or am I simply hitting the upper limit of what Prince can handle in a table?

Thanks for any help.
mikeday
It sounds like you are hitting the upper limit of what Prince can handle on your machine, at least. How much RAM do you have? :)
Jamesoc
I've got 4 Gb but its a 32 bit XP machine.
I may be able to wrangle some access to a 12Gb Linux box, I'll give it a run there and see how it goes.

I thought it may have been a problem with the number of rows I had in my tables so I took out all the tables and tried it with just plain text.
I was able to generate a 120000 line pdf that way.

I then split the table into 10 seperated tables in the doc.
I was able to generate the pdf with 1 table but with all 10 I still run into the memory issue.

I'll just have to get access to a more robust linux box and see how it goes.

I'll also try and put a copy of the test html up somewhere as a zip file if anyone with a 64 bit machine and humungous amounts of ram want a go with it :)

Edit:
I've loaded up two files onto some webspace I have.
1. broken.zip this is a file based on one of the files we are trying to do.
2. brokenhuge.zip this is the file but with 10 times the number of rows. This is about the absolute maximum we would ever have to produce.

Edited by Jamesoc

mikeday
Hmm, 4GB is going to be the most RAM any process can use on a 32-bit machine without various mapping tricks, so more RAM probably won't help in this case.

The issue here with the text vs. tables is that currently elements are relatively heavyweight items, carrying around lots of CSS properties and other attribute information. Tables with tens of thousands of cells end up using far more memory than textual documents with mostly paragraphs, where there might be less than ten elements per page.

We took some steps to reduce memory consumption during the 6.0 release cycle, and we have further improvements planned for the future. If you can put the document online that will help us with our ongoing memory profiling work.
Jamesoc
I've linked the two example documents above, have fun :)

I've also got access to a 12Gb linux box so I'll let you know how that goes.

Thanks for all your help so far.
mikeday
On a 64-bit Linux machine I was able to successfully process the first document, producing a 7144 page PDF file. It took 3.5GB of RAM in total, so we'll have to reduce that substantially before we can tackle your second document.
mikeday
We have now released Prince 8.0, which features reduced memory usage for large documents. Retesting your first document I was able to convert it to PDF on a 32-bit machine using only 1.2 Gb of RAM in total, which is great! However, making the document 10 times bigger is probably still out of the question. :)