Forum Bugs

svg filenames with special characters

Stephan
Hi Michael,

We have a problem with svg image filenames in the html document which contains special characters.

All files are downloaded in advance with wget.
The URL of the svg image contains e.g. '\', which is encoded as '%5C' by wget and so part of the downloaded image filename.

Within the html document the "%" is encoded as '%25'. So in the resulting filename in the 'src' attribute contains '%255C' for the original '\'.

This seems to be all correct for me.

But the generated pdf document does not contain the image.



file system:
pic%5Ctest.svg



html document index.html:
<img src="pic%255Ctest.svg" />


If I rename the file removing all special characters and change the 'src' attribute within the html document accordingly, everything works fine.


I found out some strange things trying to solve the problem:

1. Starting prince with --debug reports no error : prince: debug: loaded resource: pic%5Ctest.svg

2. With special charater '/' (%2F) there is no problem!

3. With png images there are also no problems with special characters

Thanks in advance!
mikeday
Which version of Prince are you using? URL processing has changed in recent alpha versions.
Stephan
I tried with Prince 9.0 rev 4 and rev 5.

Stephan
Could you provide alpha version rpms for SUSE (SLES11 , 64-bit)?
Alternativly for opensuse11 (64-bit) , which seems also to work here.
mikeday
Yes we can do this, but they probably will not be ready until later next week.
Stephan
Ok, but is there now a different URL processing for png and svg, which could be a reason for the problem ?
mikeday
Actually I am having trouble reproducing this problem with Prince 9. I have created a file called "test.html" that contains this:
<img src="pic%255Ctest.png" />

The image filename on the filesystem is pic%5Ctest.png, and it works fine when I run "prince test.html". Is this similar to your experience?
Stephan
As I wrote, the problem appears only with svg files.
Aside from that I used Prince in the same way.
mikeday
Ah I see, trying it with an SVG file I can reproduce the problem with Prince 9, but the problem is fixed in the latest alpha version.
Stephan
Excellent!

Annother problem I noticed in this context:
The SVG file requires the '.svg' extension, otherwise the warning 'Unknown image format' message is printed out and the SVG is missed in the pdf.
Again: With PNG files the filename extension are not needed!

Is this also fixed in the alpha version?
mikeday
At the moment Prince does not try to guess whether a file is SVG or not. If it is a HTTP URL, then it will check the content-type header returned by the server, and if it is a local file it will check if it has an .svg extension.

For other image types, it will try loading it in different ways, due to the high number of misidentified image files, eg. PNG images saved with .jpeg extension.

Are you using a different extension for your SVG images?
Stephan
Because we download all image URLs with wget in advance the filenames become the (CGI) URLs.

Local filename e.g.:
lbdoc?dbname=caedb&server=&dbpath=%2Fhome%2Fengin%2Fpicture

Thereby the CGI-Program exports the SVG from a database. This is why our SVG filesnames don't have any extension currently. It would be extensive to change that.
Perhaps you consider to let Prince also do a SVG check. That would be very helpful for us!

Edited by Stephan

Stephan
Any opinion to that?
mikeday
Yes I think it makes sense, just thinking about the best way to do it. :)
Stephan
So can we expect that SVG check in Prince 10 together with our other problem solved ?

When will it approximately be released?

mikeday
Hopefully we will have an updated build with the SVG check within a week, unless we are interrupted by unforeseen circumstances.
Stephan
Thanks,
please remember to provide rpms for SUSE (SLES11 , 64-bit).
Alternativly for opensuse11 (64-bit) , which seems also to work for us.
mikeday
New builds are now available for 64-bit OpenSUSE 11 which include the SVG loading change. Please let me know how it goes. :D
Stephan
Handling of filenames with special characters (%5C) and SVG format test for files without extension works fine. :-)

BUT: Character '?' in filenames is not accepted anymore. :-(

Prince seems to cut filenames at that character.

With
 <img  src="lbdoc?dbname=caedb&amp;server=&amp;dbpath=%252Fhome%252Fengin%252FENGIN&amp;image=P%252FSamplesProject%255CPD%252FSystem%252FLogic%252FAlarm_u_Trend%255CV3.24.3%255CDOC%255CImages%252Ftest%257Cimage%257Csvg&amp;lang=049" />

=>
Prince reports error message:
prince: lbdoc: warning: can't open input file: No such file or directory

If I change '?' to '_', everything works perfectly.

This will be a small correction hopefully.

I attached a tgz with all files.

Just try
prince index.html 

to reproduce the bug.

  1. docdir.tgz73.5 kB

Edited by Stephan

mikeday
I think this is not a bug, actually. At least in browsers it has the same issue, unless you escape the ? in the URL as %3F.
Stephan
But it is originally the separator between url component path and query string which is never escaped.
The resulting filename comes from the download with wget.
We had no problems with Prince 9 concerning that.
mikeday
Yes, in the original URL it is fine. But as a local file, I think it needs to be escaped, and Prince 9 was wrong.

Some people work around this limitation of wget using the --restrict-file-names option, as described here.

Edited by mikeday

Stephan
'wget --restrict-file-names=unix' does not escape '?' as it is an allowed character.

You don't want me to use wget --restrict-file-names=windows in our linux enviroment?
(This replaces '?' with '@')

I think Prince 9 was right.



mikeday
Yes, I was suggesting --restrict-file-names=windows, to avoid the ? character. The other approach would be to rewrite the image URLs inside your HTML document. You could use JavaScript regular expressions to do this.

The new Prince behaviour for URL parsing is more correct according to the specification, and also consistent with what web browsers do, so it is generally not possible to interpret the unescaped ? character without breaking other behaviour.
Stephan
Could you then give us a list of the characters which are not allowed and must be escaped or replaced.

Does the "--restrict-file-names=windows" option ensure the acceptance of the file name by prince?

mikeday
Prince follows the same rules for decoding URLs as browsers, so special characters like ? & # and so on need to be escaped.

I think using --restrict-file-names=windows should be sufficient, yes. Although really it would be helpful if wget had more convenient mechanisms for rewriting links to local files.