Forum Bugs

35min to process 5k HTMLs in bulk

igorv
Hi there!

I have some strange perfromance issue with bulk generation. It takes almost 35 mins to generate one big PDF from 5000 htmls (~350KB each) in bulk by using --input-list option. During generatation it eats almost 14GB of RAM from 20GB available.

The command I use:
/usr/bin/prince --input=xml --no-subset-fonts --no-compress --css-dpi=72 --style=/tmp/HelveticaNeueLTStd-Roman.otf.style --force-identity-encoding --input-list=/tmp/imput.txt --structured-log=buffered -o output.pdf


Each html contains XInclude option to read another html file. By removing this option I can save 5 minutes, but I really need to use it and still 30 minutes is pritty long.
I also tried to move all fonts and input.txt file to /dev/shm, but it gave me only 3 min boost.

Are there any other options I can try to increase performance?

As I told my box has 20GB RAM and 8 core CPU:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
stepping        : 1
cpu MHz         : 2396.934
cache size      : 4096 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hyp
ervisor lahf_lm
bogomips        : 4793.86
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:


Prince version:
Prince 20160109
Copyright 2002-2015 YesLogic Pty. Ltd.
CSO License
mikeday
How big is the final PDF file, ie. page count and file size?
igorv
It's 10k pages and 835MB in size.
mikeday
That is very big, and Prince is not a streaming processor. (Try opening 5000 tabs simultaneously in a browser, for example!)

It might be possible to reduce the memory usage and time slightly by simplifying the input documents, depending on exactly how they are structured. Also make sure they don't refer to any external HTTP resources, as that can significantly slow down conversion.