Forum How do I...?

Images or unicode, but not both

shougun
I am having a problem with creating a PDF. When I set HTML mode on Prince, my Unicode characters are getting messed up, for example, rendering as 'ö' instead of 'ö' as it should. The prince log shows:

Thu Aug 11 09:47:48 2011: ---- begin
Thu Aug 11 09:47:48 2011: Loading document...
Thu Aug 11 09:47:49 2011: Converting document...
Thu Aug 11 09:47:49 2011: Resolving cross-references...
Thu Aug 11 09:47:50 2011: finished: success
Thu Aug 11 09:47:50 2011: ---- end


I have found that if I setHTML to false so the page is parsed as XML/XHTML, then my unicode characters display correctly, but all my images disappear. The prince log shows:

Thu Aug 11 09:45:31 2011: ---- begin
Thu Aug 11 09:45:31 2011: Loading document...
Thu Aug 11 09:45:32 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%24%7B%2Dx%5Cgeq%20%5Cfrac%7B3%7D%7B2%7D%7D%24%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:32 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%24%7Bx%5Cgeq%20%5Cfrac%7B3%7D%7B2%7D%7D%24%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:32 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%24%7Bx%5Cleq%20%5Cfrac%7B3%7D%7B2%7D%7D%24%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:32 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%28%2Db%20%2B%2Dsqrt%28b%5E2%20%2D%204ac%29%29%2F%287a%29: warning: The requested URL returned error: 400
Thu Aug 11 09:45:32 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%24%7B%20%5Cfrac%7B%2Db%20%5Cpm%20%5Csqrt%7Bb%5E2%2D4ac%7D%7D%7B7a%7D%7D%24%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%5Cmbox%7BThe%20following%20expression%20represents%20the%20value%20of%20which%20variable%20in%20the%20solution%20of%20the%20following%20system%20of%20equations%3F%7D%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%20%2Dx%20%2B%202y%20%2B%207z%20%3D%2013%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%242x%20%2D%20y%20%2D%202z%20%3D%20%2D2%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%203x%20%2B%205y%20%2B%202z%20%3D%20%2D14%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24y%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%5Cmbox%7BIt%20represents%20none%20of%20the%20variables%2E%7D%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24z%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24x%24: warning: The requested URL returned error: 400
Thu Aug 11 09:45:33 2011: Converting document...
Thu Aug 11 09:45:33 2011: Resolving cross-references...
Thu Aug 11 09:45:33 2011: finished: success
Thu Aug 11 09:45:33 2011: ---- end


I have also found that if I change the doctype (with setHTML still false) that some, but not all my images will start being rendered in the pdf. Which ones render is random, but most of the time it is the first few images in the document that get missed.

The doctype was changed from:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN" "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">

to
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/xhtml-math11-f.dtd">

This results in the following log entries for the exact same input file:

Thu Aug 11 09:42:45 2011: ---- begin
Thu Aug 11 09:42:45 2011: Loading document...
Thu Aug 11 09:42:45 2011: -:2: warning: failed to load external entity "file:///C:/Prince/engine/dtd/mathml2/xhtml-math11-f.dtd"
Thu Aug 11 09:42:46 2011: https://byuisdev.brainhoney.com/Resource/2746858,1E/Assets/FinalNumLine.jpg: warning: The requested URL returned error: 400
Thu Aug 11 09:42:46 2011: https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%24%7B%2Dx%5Cgeq%20%5Cfrac%7B3%7D%7B2%7D%7D%24%24: warning: The requested URL returned error: 400
Thu Aug 11 09:42:47 2011: Converting document...
Thu Aug 11 09:42:47 2011: Resolving cross-references...
Thu Aug 11 09:42:47 2011: finished: success
Thu Aug 11 09:42:47 2011: ---- end


These image URLs should be publicly available and return PNG files.
What am I missing here? I need to have reliable image rendering as well as correct display of Unicode characters.
shougun
I found the solution, which is not Prince related...

We use a Coldfusion service to help render our pdfs... ColdFusion has a server setting called 'Enable Global Script Protection' which replaces certain tag names with 'invalidTag'. This setting was on and our 'meta' tags were getting changed, so the correct Content-Type was not being identified to Prince when in HTML mode.

Disabling this setting for our service removed the issue of Unicode characters not showing correctly.
mikeday
Does this fix the images problem as well?
shougun
no, it does not. The issue is minimized because we can use the HTML mode, which is more reliable, but still suffers from missing images due to 400 errors from time to time.
mikeday
I have not been able to reproduce the image problem here. Can you try converting a test document like this with Prince from the command-line:
<html>
<body>
<img src="https://byuisdev.brainhoney.com/Resource/2746858,1E/Assets/FinalNumLine.jpg"/>
<br/>
<img src="https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24%24%7B%2Dx%5Cgeq%20%5Cfrac%7B3%7D%7B2%7D%7D%24%24"/>
</body>
</html>

Also, the "400" is a HTTP error that should leave some log on the server. Perhaps the URL is getting mangled some how making it an invalid request, although I'm not quite sure how that would happen. Which operating system are you running on?
shougun
Using your example did not present any problems for me either... Some degree of complexity seems to be needed. Here is an example that results in none of the images being downloaded. Note that simple changes to this code causes some, most, or all images to mysteriously appear, a change as simple as removing the div tag around the '<strong><em>Laocoön Group</em></strong>' text will cause all images to render... how is that related?

Note also that while coming up with this script, simply removing a blank line (possibly with tabs) would cause images to appear, or even crash Prince! (note also that all indentation is done with tabs.)

--- script removed --- see script below for reproducing the issue ---

Here is the command line arguments used:

Prince output2.htm --log=prince.log -i html


I am using Windows 7 (64 bit)

Since simple changes to white space or unrelated tags can make the images appear, if their URL is getting mangled, it is not happening prior to it being sent to Prince.

Edited by shougun

shougun
The windows event manager shows the following for when Prince crached after removing a blank line from the code (removing more than one line at a time would not cause the crash sometimes.)

Faulting application name: prince.exe, version: 0.0.0.0, time stamp: 0x4b667456
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc000001d
Fault offset: 0x010f16c2
Faulting process id: 0x7d8
Faulting application start time: 0x01cc5b6ed00b0f8f
Faulting application path: C:\Prince\Engine\bin\prince.exe
Faulting module path: unknown
mikeday
Thanks for the test document, I can reproduce the 400 error now with this:
<html>
<head>
<link rel="Stylesheet" href="https://byuisdev.brainhoney.com/resource/2746858,1E/Assets/CSS/BrainHoney.css" />
</head>
<body>

<math xmlns="http://www.w3.org/1998/Math/MathML">
</math>

<img src="https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24y%24"/>

</body>
</html>

Can you confirm that this also gives you the 400 error for the image, and that commenting out the stylesheet link or the empty math element will make it work?

I must admit I'm baffled by why an empty math element would cause a HTTP error from the server. There must be something invalid about the HTTP request, but because it's running over HTTPS this is a bit difficult to track down.
shougun
Yes that script does cause the problem for me, and yes, removing the stylesheet does allow the image to show, but leaving the stylesheet and removing the empty math tag does not allow the image to show.

So maybe it has to do with the style sheet and its @import of another stylesheet that does not exist, or is an empty file. It may also have to do with how byuisdev.brainhoney.com is responding to the request for the included CourseSpecific.css file (returning a content-type of text/html).

Removing the @import directive from the primary stylesheet allows the image to show with no modifications to your script.

It may be a problem with how Prince is handling @import directives in stylesheets. This is something that we have control over in this case (we are able to modify the primary stylesheet) so we can resolve this issue, but it sounds like something worth looking further into on your side.

Thanks for your help with this.

update-----

I have recreated the problem using my local IIS web server to supply the stylesheet. (the problem did not occur when Prince loaded the stylesheet directly from the file system.) I mimicked the text/html content-type by @importing a .htm file and this caused the error to occur. The issue persisted when changing the @import to an empty css file, but to my surprise, still persisted when the @import file had valid content in it. Does Prince cache responses between executions, or use OS utilities for http requests that may cache the response? Changing the name of the @import file to eliminate the possibility of a caching issue, still caused the 400 error for the image weather or not the imported file was empty or had content.

Because I cant reliably fix the issue by playing with the @import file, the true cause may be totally unrelated.

Removing the mathML tag allowed the image to show again. Adding it back again causes the issue. So there may be something to the combination of using MathML and @import in stylesheets.
mikeday
Were you also using HTTPS on your local server, or just HTTP? At this point I'm suspecting the issue is specific to HTTPS, it would be good to confirm this.
shougun
No, I reproduced the issue without using HTTPS.
mikeday
Okay that's unexpected. I'll see if I can reproduce that here with a similar document over HTTP, that will be easier to debug.
shougun
In order to reproduce the issue, I had to use a domain name for my local machine (localhost did not work at first.)

script:
<html>
	<head>
		<link rel="Stylesheet" href="http://mymachine.byu.edu/BrainHoney.css" />
	</head>
	<body>
		<math xmlns="http://www.w3.org/1998/Math/MathML">
		</math>
		<img src="https://byuisdev.brainhoney.com/MathML/Image.ashx?data=%24y%24" alt="missing image" />
	</body>
</html>


BrainHoney.css:
@import url('CourseSpecific.htm'); /*ensures text/html content-type*/


CourseSpecific.htm -- empty file

After doing that once, I could change CourseSpecific.htm to CourseSpecific.css and that file could be empty or have any content in it and the error would continue.
mikeday
Great, I can recreate that too. Now, can we replace the final HTTPS link with a HTTP link, that would be the question. Or isolate the problem as something specific to libcurl, which Prince uses for HTTP(S) support. Small steps. :)
shougun
new script:

<html>
	<head>
		<link rel="Stylesheet" href="http://localhost/BrainHoney.css" />
	</head>
	<body>
		<math xmlns="http://www.w3.org/1998/Math/MathML">
		</math>
		<img src="http://localhost/image.png" alt="missing image" />
	</body>
</html>


prince log results:

Mon Aug 22 12:19:27 2011: ---- begin
Mon Aug 22 12:19:27 2011: output3.htm:6: error: Tag math invalid
Mon Aug 22 12:19:27 2011: http://localhost/image.png: warning: The requested URL returned error: 400
Mon Aug 22 12:19:27 2011: ---- end


No ssl is used anywhere.

I tried adding additional images to determine if location made a difference... the single image must be after the math tag to produce the 400 error. adding a different image link before the math tag does not produce a new 400 error for the new image, but does sill produce the error for the image after the tag... Adding even more images, proved to illicit random behavior as some images before the math tag started producing the error and some after the tag did not.
mikeday
That's looking very promising, although I can't recreate the issue here with that document. What is the content of the CSS file?
shougun
I am still using the same files that I posted on the 18th. The only difference, is that I loaded the image in a browser and saved it to my local web server root as 'image.png'. I believe that I still started with the @import with an html file to get the text/html content type returned and then after that, changing it back to a css file made no difference. I am also still using the same command line arguments. Also, the first time I got this to work, the error did not occur when localhost was used... I had to use the DNS name for my computer instead.
mikeday
We appear to have found a solution for this issue, which will be included in the Prince 8.1 release. Please email me (mikeday@yeslogic.com) if you would like to try an updated build of Prince before then.
mikeday
Prince 8.1 is now available, and includes the fix for this issue. Thanks for your patience, it was a tough nut to crack! :)