Forum How do I...?

How well does this Prince scale?

jcmatador1
We have a need to convert tens of thousands of HTML files per hour. Some of the files will be perfect XHTML to the spec while others will be more real world HTML that may have issues but can be rendered properly by most browsers.

So, my question is, is the product reasonably scalable out of the box and has anyone done any extensive load testing to see how it performs when asked to convert multiple documents per second?

Thank you for your responses.
StoneCypher
Prince has no outside dependencies, and no parallel dependencies. It will, by definition, scale linearly to the amount of hardware thrown at it.

How Prince will perform in your situation, which seems like an exceedingly unlikely situation, is a matter that essentially nobody will have tested. I have a hard time imagining any situation which would want the conversion of tens of thousands of web pages to PDFs per hour.

Prince has a free trial. You should be benchmarking. That's the only approach you should ever take in a large scale setting. Even if someone had tried this already, subtle differences in the platform hardware can make enormous differences on performances of various applications, no doubt Prince included, in situations where it seems like there should be no effect.

You'll want machines with an extremely large cache, and if you have multiple core machines, you'll want to isolate the cores in VEs unless you're running an OS with native support for single-core execution (Solaris and QNX being the most common examples.)

The short answer: even if anyone else had been in your situation previously, your mileage /will/ vary. Prince is standalone, and standalone things scale linearly. Therefore, the real answer comes down to "can you afford the hardware to run such a batch?"

I mean, in a very extreme case, consider a block of 100,000 computers, each running one copy of Prince, and a single computer to dispatch incoming documents to the rendering farm, as well as a single computer or cluster to contain the result flow. In that case, if your average document takes one hour to convert (and, for scale, my dictionary, which is fairly complex in terms of DOM structure and which renders to about 380 pages, takes about 45 seconds to build, so, the phrase "yeah right" seems appropriate,) then your worksystem should be able to handle roughly 100,000 documents at once.

This isn't a scaling problem. This is a capacity problem. Linear scaling basically means "how deep is your wallet?"

In a more reasonable answer to your question, what you're looking at is the overhead from approx. four applications, which may be spread across multiple machines if desired.

  • The dispatcher. This application would gather incoming HTML documents from whatever your source is - usenet, RSS feeds, financial storehouses, martian signals in your teeth, whatever. This application would also store these documents or their locations as appropriate, and establish a queue. Finally, this application would maintain communication with all workers, and when any worker announced its work as complete, the dispatcher would notify the worker of a new piece of work from the queue if available, or track that worker as available otherwise.
  • The preformat worker(s). PrinceXML works on XML, not HTML; people will tell you that HTML is an XML dialect, but it is not. If you are not feeding PrinceXML well-formed XML documents, PrinceXML will balk. Therefore, you need a tool inbetween the dispatcher and Prince to attempt to convert the source document into well-formed XML. Fortunately, this is a well-explored tool locus, and many such tools exist, many of whom are both extremely successful and free. This has the further benefit that you can create your own filter if you find nothing that meets your needs. The preformat worker(s) and the conversion worker(s) are sequential and scale one to one; they may effectively be thought of as a single point in the flow process. The important differentiation is only in realizing that that point is the result of several processes rather than one, and that PrinceXML only accounts for one such process.
  • The conversion worker(s). This is where PrinceXML comes in, and this is also where the vast bulk of actual machine work is done. If you really need tens of thousands of datastreams per hour, it is quite likely that you will need multiple workers. Whereas I don't have actual heavy usage PrinceXML numbers, and whereas my understanding of your actual load demand is wholly inadequate, I can magic you up an answer. I run PrinceXML on a Pentium M 1.6gHz laptop with 1.25g RAM and a 1m cache. I use PrinceXML for a variety of things, but the thing that's likely to be contextually appropriate for you is the aforementioned dictionary. The dictionary is 7600 primary element clusters (one element cluster for each term/definition block.) An element cluster is a set of three primary elements with children; there are 61 elements in one element cluster, if you count children. The CSS rules for element clusters are complex; the CSS for that part of the dictionary is about 15k. (Really.) Given these situations, PrinceXML will convert the full text section of the dictionary, ~365pp, in 44.1 seconds on my machine if nothing else is hitting the disk. Prince also converts a 25 page sample section in 6.75 seconds, a 50 page section in 9.47 seconds and a 100 page section in 14.9 seconds. If you do a plot, this suggests that in my hardware configuration, Prince needs about four seconds to preparse the document for structure and CSS rules and whatever, and then is able to push about 9 1/8 pages per second. Now, given that a system like that would be running on heavier server hardware, presumably Xeons with 8M caches, then Prince would be able to work entirely from cache, which should speed it up by at least an order of magnitude, and probably a lot more, depending on what compiler the author uses and how familiar he is with hardware optimization. Assume another 15% or so for the far higher performance disks, drop about 5% for dispatch thrash and cache thrash. *So,* using that enormous mass of made-up and single-case experiential numbers, which is worth about nothing, I'm guessing that a serious server should be able to push in the neighborhood of 105 pages per second with a few-second (1.5?) startup overhead per document. Now, if 10,000 documents per hour are an average of 20-page documents with a first order deviation of five pages, then you're looking at about 220,000 pages per hour with a 15,000 second startup overhead (pages per hour is a much more important metric for estimating performance on a system like Prince.) This suggests about 284 minutes per hour of work time, which would mean that around five seriously beefy machines would be able to handle it with about a 3% per machine tolerance for overload (an uptime sensitive system would probably want at least 1.5x reserve capacity, so that suggests 12 or so heavy servers, or an investment of about $45k plus bandwidth.)
  • The output collation system. This is basically just a SAN.

So, based on that huge pile of total hogwash, I suspect you're looking at about a $52k investment for the hardware to develop a stream system which can handle a 150% overload capacity over the document throughput requirements you provide.

Please remember that this is based on estimations from my three year old Dell craptop and enormous speculation regarding a supposed 20 page average document length. Your mileage almost certainly will vary. The point is, it's feasable, you can prolly pull it off on $20k plus bandwidth if you're willing to cut it close, and you're just gonna have to try it before you'll know what the real cost is.

John Haugeland is http://fullof.bs/

StoneCypher
I should have mentioned such a tool in the source normalization step (step 2). HTMLTidy is the canonical example. Tidy is free and shockingly flexible/robust. Tidy is also fast. It's a bit RAM hungry, but in this situation, I suspect that completely does not matter.

http://tidy.sourceforge.net/

John Haugeland is http://fullof.bs/

mikeday
Quick correction: Prince can handle valid HTML as well as XML.
joelmeador
This post pretty much predicted DocRaptor 3 years before we made it. Turns out Prince scales pretty well. :)
mikeday
And thankfully Prince can handle any HTML now! :D