Forum Bugs

svg filenames with special characters

Hi Michael,

We have a problem with svg image filenames in the html document which contains special characters.

All files are downloaded in advance with wget.
The URL of the svg image contains e.g. '\', which is encoded as '%5C' by wget and so part of the downloaded image filename.

Within the html document the "%" is encoded as '%25'. So in the resulting filename in the 'src' attribute contains '%255C' for the original '\'.

This seems to be all correct for me.

But the generated pdf document does not contain the image.

file system:

html document index.html:
<img src="pic%255Ctest.svg" />

If I rename the file removing all special characters and change the 'src' attribute within the html document accordingly, everything works fine.

I found out some strange things trying to solve the problem:

1. Starting prince with --debug reports no error : prince: debug: loaded resource: pic%5Ctest.svg

2. With special charater '/' (%2F) there is no problem!

3. With png images there are also no problems with special characters

Thanks in advance!
Which version of Prince are you using? URL processing has changed in recent alpha versions.
I tried with Prince 9.0 rev 4 and rev 5.

Could you provide alpha version rpms for SUSE (SLES11 , 64-bit)?
Alternativly for opensuse11 (64-bit) , which seems also to work here.
Yes we can do this, but they probably will not be ready until later next week.
Ok, but is there now a different URL processing for png and svg, which could be a reason for the problem ?
Actually I am having trouble reproducing this problem with Prince 9. I have created a file called "test.html" that contains this:
<img src="pic%255Ctest.png" />

The image filename on the filesystem is pic%5Ctest.png, and it works fine when I run "prince test.html". Is this similar to your experience?
As I wrote, the problem appears only with svg files.
Aside from that I used Prince in the same way.
Ah I see, trying it with an SVG file I can reproduce the problem with Prince 9, but the problem is fixed in the latest alpha version.

Annother problem I noticed in this context:
The SVG file requires the '.svg' extension, otherwise the warning 'Unknown image format' message is printed out and the SVG is missed in the pdf.
Again: With PNG files the filename extension are not needed!

Is this also fixed in the alpha version?
At the moment Prince does not try to guess whether a file is SVG or not. If it is a HTTP URL, then it will check the content-type header returned by the server, and if it is a local file it will check if it has an .svg extension.

For other image types, it will try loading it in different ways, due to the high number of misidentified image files, eg. PNG images saved with .jpeg extension.

Are you using a different extension for your SVG images?
Because we download all image URLs with wget in advance the filenames become the (CGI) URLs.

Local filename e.g.:

Thereby the CGI-Program exports the SVG from a database. This is why our SVG filesnames don't have any extension currently. It would be extensive to change that.
Perhaps you consider to let Prince also do a SVG check. That would be very helpful for us!

Edited by Stephan

Any opinion to that?
Yes I think it makes sense, just thinking about the best way to do it. :)
So can we expect that SVG check in Prince 10 together with our other problem solved ?

When will it approximately be released?

Hopefully we will have an updated build with the SVG check within a week, unless we are interrupted by unforeseen circumstances.
please remember to provide rpms for SUSE (SLES11 , 64-bit).
Alternativly for opensuse11 (64-bit) , which seems also to work for us.
New builds are now available for 64-bit OpenSUSE 11 which include the SVG loading change. Please let me know how it goes. :D
Handling of filenames with special characters (%5C) and SVG format test for files without extension works fine. :-)

BUT: Character '?' in filenames is not accepted anymore. :-(

Prince seems to cut filenames at that character.

 <img  src="lbdoc?dbname=caedb&amp;server=&amp;dbpath=%252Fhome%252Fengin%252FENGIN&amp;image=P%252FSamplesProject%255CPD%252FSystem%252FLogic%252FAlarm_u_Trend%255CV3.24.3%255CDOC%255CImages%252Ftest%257Cimage%257Csvg&amp;lang=049" />

Prince reports error message:
prince: lbdoc: warning: can't open input file: No such file or directory

If I change '?' to '_', everything works perfectly.

This will be a small correction hopefully.

I attached a tgz with all files.

Just try
prince index.html 

to reproduce the bug.

  1. docdir.tgz73.5 kB

Edited by Stephan

I think this is not a bug, actually. At least in browsers it has the same issue, unless you escape the ? in the URL as %3F.
But it is originally the separator between url component path and query string which is never escaped.
The resulting filename comes from the download with wget.
We had no problems with Prince 9 concerning that.
Yes, in the original URL it is fine. But as a local file, I think it needs to be escaped, and Prince 9 was wrong.

Some people work around this limitation of wget using the --restrict-file-names option, as described here.

Edited by mikeday

'wget --restrict-file-names=unix' does not escape '?' as it is an allowed character.

You don't want me to use wget --restrict-file-names=windows in our linux enviroment?
(This replaces '?' with '@')

I think Prince 9 was right.

Yes, I was suggesting --restrict-file-names=windows, to avoid the ? character. The other approach would be to rewrite the image URLs inside your HTML document. You could use JavaScript regular expressions to do this.

The new Prince behaviour for URL parsing is more correct according to the specification, and also consistent with what web browsers do, so it is generally not possible to interpret the unescaped ? character without breaking other behaviour.
Could you then give us a list of the characters which are not allowed and must be escaped or replaced.

Does the "--restrict-file-names=windows" option ensure the acceptance of the file name by prince?

Prince follows the same rules for decoding URLs as browsers, so special characters like ? & # and so on need to be escaped.

I think using --restrict-file-names=windows should be sufficient, yes. Although really it would be helpful if wget had more convenient mechanisms for rewriting links to local files.