Forum Feature requests

Page break messages

agolos
Hello,
I am evaluating Prince for the use in our batch application that converts EPUBs to PDFs. One of the things we need to to is to insert back into HTML application specific attributes that indicate page numbers that match pages of the PDF rendering engine. Is it possible to get callbacks or info messages that output each page break and the corresponding CFI (Canonical Fragment Identifier for EPUBs) or something similar?
Is there something we can get out of debug mode?
How costly would be the development?

Best regards
Arie Golos
EBSCO Publishing
mikeday
Yes, it should be possible to do this with our box tracking API. I will try and put together an example.
mikeday
Which platform are you running Prince on? We have updated the box tracking API to provide additional information that will help, but it will require building some new Prince packages.
agolos
At the moment I am testing on Windows7 64 bit, however the production version will definitely run on some Linux ( SUSE or RH).
agolos
Forgot. If that matters, I will be using java shell.
mikeday
I have uploaded a latest build for Windows that includes an updated box tracking API. To use it you need to enable it, then register a JavaScript function that will be called when Prince has finished generating the PDF file:

Prince.trackBoxes = true;
Prince.addEventListener("complete", checkPages, false);

function checkPages() {
    // this will be called by the oncomplete event
}


The checkPages function can look at the DOM and see which boxes were generated for each element, and which pages they appeared on. For example, to see where all the paragraphs are:
function checkPages() {
    var ps = document.getElementsByTagName("p");

    for (var i = 0; i < ps.length; ++i) {
        var p = ps[i];
        var bs = p.getPrinceBoxes();

        if (bs.length > 0) {
            var box = bs[0];
            Log.warning("para "+p.id+" on page "+box.pageNum);
        }
    }
}

Alternatively, you can iterate over the pages in the PDF, and find the first text on the page and see what element it is from:
function checkPages() {
    for (var i = 0; i < PDF.pages.length; ++i) {
        var page = PDF.pages[i];
        var box = getFirstTextBox(page);

        if (box) {
            Log.warning("first text box on page "+(i+1)+" is for element "+box.element.id);
        }
    }
}

function getFirstTextBox(box)
{
    if (box.type == "TEXT" && box.element) return box;

    for (var i = 0; i < box.children.length; ++i)
    {
        var curr = getFirstTextBox(box.children[i]);

        if (curr) return curr;
    }

    return null;
}

So that is roughly how the API works. As you can see, there are quite a few things you can do with it, but that is a quick introduction to get started. :)

Edited by mikeday

agolos
Thank you Mike.
Is there anything written up about that PDF box model? In particular, will I be able to get an offset into text, when the text of the paragraph overflows from one page to the next? Do you do anything to the original DOM? For example, when I construct a CFI out of a box.element, will it point to the right place in the original HTML?
mikeday
Currently it is not possible to get an offset into the actual text of an element, so I think the best you could do for now is link to the first paragraph that begins on that particular page, and ignore paragraphs that continue on to that page.

The original DOM is not changed, so if you have an ID on the element you can get access to that, or you could walk up the DOM tree to make some kind of path-based identifier.
agolos
Thanks. Let me do now some tests.
agolos
Mike, are you sure that Windows version you created on Apr 06 contains PDF box model?
Your code fails with PDF undefined and getPrinceBoxes() dies just as well.

The conversion itself works very well and my converted EPUBs look awesome.
agolos
Sorry, my mistake. I used older (released) version.
agolos
A problem with the code example above is that a page sometimes does not start with text, but with an image (for example), Sometimes there is no text on a page at all. So, I decided to find the lowest on the stack box and use its box.element as the beginning of the page. It often yields good results, but I have a book which uses lots of <br> elements inside <p> elements. <br> element unfortunately does not show up as any kind of box and the last box on the stack has type TEXT and nodeName P. Is that possible to fix? Or this is my mistake and <br> elements are Prince boxes of some type?
mikeday
I think in many situations <br> elements will not generate any boxes, they will just force a line break. Most empty spans will not generate boxes either, but if you wrap up some text in a span then it will generate a box.
agolos
We would hate to drop the idea of using Prince and go back to Flying Saucer and start modifying it for HTML5. Can somebody in your company quote a price for developing in Prince the functionality that I am looking for? In other words, for each page break to generate a CFI (or some other DOM based scheme) that points to the first visible element or text fragment on the page.
mikeday
We will need to investigate the difficulty of this and get back to you.
cnj125
I tried the "getFirstTextBox()" solution and found some problems.

1. If the page break is within a list like <ol> or <ul>, it cannot detect the first element with text which is a HTMLLIElement.

2. If the page break is within a paragrape, I don't know which word is at the beginning of the next page.

Thanks
mikeday
We are currently working to improve the box tracking API and provide access to the text.
cnj125
Hi mikeday,

Is there any update for the tracking API?

Thanks
mikeday
We have made a number of changes to the box tracking API in latest builds, including providing access to the text of TEXT boxes via the "text" property (although this currently does not handle right-to-left scripts or Indic scripts very well).
cnj125
Thanks mikeday,

Is there any documentation or more example of box tracking API? I can only find this one.
http://www.princexml.com/forum/topic/3516/changebars
mikeday