How to render multiple files with Java API
With prince you can do something like "prince test_*.html -o example.pdf" which batches all the html files in to one PDF file.
How do you do this with Java API? I have thought of creating one inputstream but when I do that with cat I get the following error:
$ cat test_*.html > all.html
$ prince all.html -o test.pdf
prince: all.html:36: error: Extra content at the end of the document
prince: all.html: error: could not load input file
prince: error: no input documents to process
Looks like the extra doctypes are breaking.
If you have separate input files, you can call convertMultiple(). This takes filenames, not input streams.
Do you actually want to create a single output PDF, or one PDF for each input document?
We're experimenting with alternate ways of using PrinceXML to speed up our generation.
Currently, our pipeline is structured to process one html at a time. As you've seen in Amir's other posts, we noticed a performance penalty trying to call PrinceXML with one file at a time.
We're using the convert with Inputstream Java API to accomplish this. We're limited to Inputstream because we do not want to incur extra I/O penalties with writing the html to disk.
Since we're constrained by our current architecture, we're trying to find ways to not incur the startup penalty in PrinceXML. Is there a way to prestart PrinceXML or keep the process available for additional inputs?
The alternate solution that we're investigating (but will not be implemented until later in the year), is to batch all the htmls together and generate a single PDF. This is ultimately what we do prior to delivering our files to the print vendor. It's an architectural change we've been aiming for and Prince enables it for us. However, Prince's API requires supplying multiple file paths to do this and our files are located in different directories and must be processed in a certain order.
convertMultiple() concatenates a list of file paths and generates a single PDF, but with a batch size of ~6000 different file paths, I'm uncertain if we'll run into issues with the length of our command line. We're wondering if it's possible to supply multiple InputStreams to Prince or if we need to concatenate them with something like SequenceInputStream.
Prince has an --input-list option that allows you to specify a file that includes a list of the all the input files to process, so this can get around command-line length limitations. It's not exposed to the Java wrapper yet but it could be added. However, if you don't need this until later in the year, then we can discuss it then.
If your current pipeline processes one document at a time via the InputStream convert method then that sounds amenable to prestart approach, which will require us to make some changes to the Java wrapper.
Thanks, we look forward to the changes.
We have a wrapper around Prince's API to expose additional options such as --css-dpi and the --scanfonts feature. We can amend our wrapper with other features.
Is the --input-list file in any kind of format? Or is it a character return delimited file with absolute paths?
Yes, one filename per line as you describe.
Thanks Mike. We are looking forward to the prestart approach.
As Eddy said, we do process one HTML at a time. Long term, I agree we should batch everything. However, because we have a large pipeline right now, making that change would require a lot more time.
Again thanks for all the help and quick response.
Here is an updated Java wrapper with prestart support: prince-java-prestart1.zip
It has two new methods: prestartProcess() and convertWithPrestart(). The intended usage is to call prestartProcess once for each thread that will be using the Prince wrapper, and then just call convertWithPrestart.
In our testing this reduces the time to convert 1000 documents from 26 seconds to 12 seconds, so hopefully this will help with your use case.
I ran a test and we're seeing a reduction from 12min to 8min with 10k documents. That's for a tight loop scenario where the priming has a small performance impact.
I'm updating our code to take advantage of this and we'll see how it performs when there's more lead time available to take advantage of the priming.
I do have a concern with how this is implemented. At most we'll have 2 processes running (one on standby and the other processing). We're configured with 20 threads so that's 40 processes per pipeline run. We may have several concurrent pipelines running (3-5) that would result in 120 - 200 Prince processes on a single VM. This may not scale very well and potentially have neighboring effects on the other processes on our machines.
We'll work on pushing for batched htmls at the end of the pipeline to try to overcome this limitation.
If you can run multiple Prince processes then that may be more efficient than prestarting, so you could just try raising the number of threads to 30 or 40 instead. It just depends what kind of overhead the Java threads have by comparison.
We think in the real world, there would be more time to process data and hopefully more lead time.
Mike, I suggested this on the other thread, but I never got a response. So I'll ask again. Have you guys thought about doing something like 'fork'? A parent process can be pre started and then all subsequent calls can be forked processes?
Yes that is possible, it is just a question of exactly how the Java wrapper will communicate with the forked processes. The other approach is to have a multithreaded HTTP server, but that does add a bit of complexity and statefulness.
How close are you to the required performance target now?
We are still a little off for performance.
Using HTTP server makes sense, but if we want to do that, then that's probably work on us. We can setup multiple machines with Prince and expose endpoints. That was our backup plan.
Locally, fork makes more sense. I am not a master at threading model in Mercury. But I assume like Python and Ruby, you have access to fork and wait. You can wait and listen for a process signal or something similar and fork a new process.
Here is where we are at now. Eddy tested locally that there was some improvement using prestart work that guys provided. However, it was only about 1.5x faster -- which means we are still about 4x slower than our legacy solution. Since our test was in a tight loop without actually fetching any data, we believe that the test was not accurate. We plan to run a real example in again.
Meanwhile, let us know if you think there is more hope with fork or another solution.
We are experimenting with a long-running process approach, that may be ready for testing next week.
Thanks Mike. That's great to hear. Excited to try it out.
Hi Mike, any update on this?
A little update from us, we did test the prestart functionality in a real production environment. We didn't see any improvement. Our speculation is that it is CPU bounded and with 20 threads x 2 princexml process; we just have too many threads processing data.
Right, that sounds likely. We are still working on an updated control interface that allows running multiple consecutive conversion jobs through one Prince process; hopefully it will be ready next week.
In our initial performance testing, the new persistent process control interface reduces your sample document conversion time from 26ms to 9.5ms, which is a significant improvement. We will do some more polishing and testing and then we can provide you with a Prince build to test. Which operating system are you running Prince on?
That's great news.
Locally we are testing on MacOS and our production boxes are redhat based.
$ cat /etc/redhat-release
Scientific Linux release 6.6 (Carbon)
Is there any documentation you can share soon? We are trying to make sure we can prepare ourselves for this change. If there is anything you can share, that would be really helpful.
More progress today, checking and fixing memory leaks, improving error reporting, and running more tests. Final work required is to polish the revised Java interface and make sure it is sufficiently robust.
The Java interface will not change much, there will just be a method you call once at the beginning to start the Prince process, and a method you call once at the end to terminate the Prince process. So it should not require major changes to your code.
We now have an updated Java wrapper available: prince-java-control1.zip
This also requires the latest Prince build
for MacOS X or CentOS 6.
The new Java wrapper has a PrinceControl class, which has start() and end() methods to control the persistent Prince process.
This is great news! Thanks Mike. We will try it shortly.
I ran a test with the changes and it's 4.7 times faster than before (without prestarting).
Looking through the source code, I see that process specific options are passed as json but additional options are never passed along.
We're adjusting the DPI using --css-dpi and I'm unclear if I should be sending that as part of the JSON or part of the base command line as --css-dpi.
This option can only be passed once at process start up, so you could change Prince.java and move the part that adds mOptions to cmdline from getJobCommandLine() up to getBaseCommandLine().
The JSON class does not appear to handle arrays properly. If I try to include multiple stylesheets, the array of stylesheets is not properly delimited with a comma.
I have a workaround but the attached Java wrapper has a bug.
Oops, how embarrassing.
Mike, when will the persistent process feature be part of an official release?
It will be available in all latest builds
going forwards, and in Prince 11, but not backported to Prince 10.