Forum How do I...?

Bulk PDF creation approach

tchapa
Hi all,

On my company, we have a software that generates an impressive, for a small company, number of pdfs. Almost 40k/month.

One of most used features enable users bulk export documents to download. For instance: download 1000 documents as PDF bundled in a zip format.

When I saw prince I thought: "Wow, we could open 30 prince processes, ~2 seconds per pdf, and get a throughput of ~30 pdfs in 2s . F**K Yeah :)"

But on my tests, i got another reality, the best environment was with 10 prince processes at same time. Maybe because we got hit by too much context switching.

Is there any way of create a prince process pool and reuse one prince instance ?

What is the best approach to maximize my throughput?

I' am using nodejs to spawn a new prince process everytime a new pdf is requested.

Thanks
btw: Prince is really impressive !
mikeday
On most modern operating systems there is very little overhead in spawning new processes. A quick test of Ubuntu 12.04 running on an average PC shows I can convert 100 trivial HTML documents containing just the letter "a" in under 4 seconds, single-threaded, spawning a new Prince instance each time. So the entire document conversion is taking 0.04 seconds, and the process creation is a tiny part of that.

How long are the documents you are converting? Do they contain any large images, and most importantly do they attempt to retrieve any resources over HTTP?
tchapa
Hi Michael,
The time for one pdf is very affordable, we take ~2 seconds per pdf.
Yeah, we are downloading one small image and css files.

time prince test.html -o test.pdf  baseurl https://www.mydomain.com
real	0m2.256s
user	0m0.559s
sys	0m0.113s


You are wondering why so many princes (10 - 20 ) at same time.
I thought that while the prince is fetching resources over internet, the cpu will be idle and others prince could execute in this meantime. right ?

But , my Dual Core does not work like my theory, more than 10 princes at same time the average response time decrease too much. ~(5 - 6 ) seconds per pdf.

What is the best approach to create a bulk ( more than 100 pdfs ) ?
On my test, 100 pdfs will be done in ~ 6minutes.


Thanks Michael
mikeday
Do the remote images and style sheets often change? If they don't, you can get a huge speed improvement by caching them as local files on the same machine.
tchapa
Yeah, caching the assets on my filesystem is much better.

real	0m0.866s
user	0m0.497s
sys	0m0.080s


My real problem happens when a request is made to create 100 pdfs ( Bulk operation). When this happen, I am spawning 10 prince process at time.

real	0m5.344s
user	0m0.921s
sys	0m0.129s


Any idea for better throughput ?

Thanks
mikeday
Would you be able to email me (mikeday@yeslogic.com) a representative sample document + images and style sheets? We can run some profiling here, see if we have any recommendations about boosting speed.
wachunga
Any updates on how this panned out? My team is also doing bulk pdf creation with nodejs and I would love to hear if you were able to increase speed.

Pulse Energy

tom
It's also very important for our system, the speed of bulk pdf creation is critical.
I think that reusing the process would be a easy and useful way to speed up.
mikeday
Are you currently running multiple Prince processes in parallel? How fast is it running?
sfinktah
Making rediculously large numbers of non-threaded applications run is something of a specialty for me.

We have a 64-core AMD bulldozer with 128GB of RAM, and we use taskset to set the affinity of a process to a particular code. (We don't use it for prince though).

In your situation, I would probably go for (if I were setting it up on the bulldozer) 64 x gearman workers, which would accept jobs and return results. You can spread the workers across multiple machines.

As Mike says, caching locally - I find that my local SQUID proxy is always caching things, so much so, I have explicitly had to prefix http_proxy= to the start of every prince run, or risk getting out of date files.

Prince also honours the <base url="http://server.local"> tag, which can be a quick way to transform the network location of sources in a document.

Any rules that apply to ordinary web page creation, i.e. optimising your CSS and including it all in one file, they would apply even more so.

If you need a hand with gearman, I have written some BASH scripts that implement the basic gearman protocol, and can be used to run jobs and return results. Launching the jobs is even easier, you can use PHP for that (again, I've written my own set of simple tcp libraries anyway, as i have issues with the ones provided.

If (god help me) there is any javascript involved in the rendering process, I would most attempt to remove that (pre-process in node, or just find another way).

I also generate CSS files for all the fonts I use, but I have a feeling that could slow matters down... Mike would have to answer that.