Question #178783 “several chm2pdf issues solved” : Questions : chm2pdf package : Ubuntu

Revision history for this message

Reto Knaak (reto-knaak) said on 2011-11-14:

#1

see in the script that something is done to Avoid duplicates in the list of image URLs (in class ImageCatcher), but nothing similar is done for class PageLister... I added this in Pagelister class:
                # Avoid duplicates in the list of URLs.
                if not self.pages.count('/'+value):
                    self.pages.append('/'+value)

In my application some links are not working in the PDF as I have some upper/lower case errors in links. As CHM is "windows stuff" this doesen't matter there, but "here" it does! So how about making the 1. pass matching case insensitive adding the (?
i) modifier in the regular expression?

I have a CHM file with images, and some are not generated in the PDF. The reason is (again) that in windows paths and names are not case sensitive, but in linux they are. So basically the problem is there: a mismatch in upper/lower case somewhere in the CHM is enough. The CHM will display correctly in windows but you can't convert in completely with chm2pdf.
The curious part is that in my case, the images not displayed where written correctly but they where in the same subdirectory with other images from other pages: and on one of the other pages the subdirectory was written lower case. So the page where images are missing in PDF is not necessarly the page where the mispelled upper/lowercase is, it can be on any other page. Probably what counts is how the path is spelled the first time it is encountered generating the CHM source file....
Anyone has some ideas how this could be solved automagically in chm2pdf?

I think this one is too greedy because it matches everything until the next #, even if it is outside the link!
            # Replace links of the form "somefile.html#894" with "somefile0206.html"
            # ...
            page = re.sub('(?i)<a href="([^#]*)#[^"]*"', '<a href="\\1"', page)
How about this? Is this the way to go? .. at least it seems to work!
            page = re.sub('(?i)<a href="([^(#|")]*)#[^"]*"', '<a href="\\1"', page)

I modified the same place even more, because in my file I have links like <a href="#X1">X1</a> and so nothing of the link would be left: in my optinion, in this case I prefer to leave the link intact, as it points inside the same file!
page = re.sub('(?i)<a href="([^#|"]+)#[^"]*"', '<a href="\\1"', page)

My CHM file contained some javascript, but no effort is done in chm2pdf to delete javascript (some other unwanted stuff is deleted before passing all to the htmldoc part). I am no expert of regex, so the following may not be a good solution,
but at least in my case one ERR011 is gone!
# Delete javascript (<script type='text/javascript'>...</script>)
page=re.sub('(?i)<script type=("|\')text/javascript("|\')(.*?)>(.*?)</script>','', page, flags=re.DOTALL|re.MULTILINE)

see in the script that something is done to Avoid duplicates in the list of image URLs (in class ImageCatcher), but nothing similar is done for class PageLister... I added this in Pagelister class:
                # Avoid duplicates in the list of URLs.
                if not self.pages.count('/'+value):
                    self.pages.append('/'+value)

In my application some links are not working in the PDF as I have some upper/lower case errors in links. As CHM is "windows stuff" this doesen't matter there, but "here" it does! So how about making the 1. pass matching case insensitive adding the (?
i) modifier in the regular expression?

I have a CHM file with images, and some are not generated in the PDF. The reason is (again) that in windows paths and names are not case sensitive, but in linux they are. So basically the problem is there: a mismatch in upper/lower case somewhere in the CHM is enough. The CHM will display correctly in windows but you can't convert in completely with chm2pdf.
The curious part is that in my case, the images not displayed where written correctly but they where in the same subdirectory with other images from other pages: and on one of the other pages the subdirectory was written lower case. So the page  where images are missing in PDF is not necessarly the page where the mispelled upper/lowercase is, it can be on any other page. Probably what counts is how the path is spelled the first time it is encountered generating the CHM source file....
Anyone has some ideas how this could be solved automagically in chm2pdf?

I think this one is too greedy because it matches everything until the next #, even if it is outside the link!
            # Replace links of the form "somefile.html#894" with "somefile0206.html"
            # ...
            page = re.sub('(?i)<a href="([^#]*)#[^"]*"', '<a href="\\1"', page)
How about this? Is this the way to go? .. at least it seems to work!
            page = re.sub('(?i)<a href="([^(#|")]*)#[^"]*"', '<a href="\\1"', page)

I modified the same place even more, because in my file I have links like <a href="#X1">X1</a> and so nothing of the link would be left: in my optinion, in this case I prefer to leave the link intact, as it points inside the same file! 
            page = re.sub('(?i)<a href="([^#|"]+)#[^"]*"', '<a href="\\1"', page)

My CHM file contained some javascript, but no effort is done in chm2pdf to delete javascript (some other unwanted stuff is deleted before passing all to the htmldoc part). I am no expert of regex, so the following may not be a good solution,
but at least in my case one ERR011 is gone!
    # Delete javascript (<script type='text/javascript'>...</script>)
    page=re.sub('(?i)<script type=("|\')text/javascript("|\')(.*?)>(.*?)</script>','', page, flags=re.DOTALL|re.MULTILINE)

Revision history for this message

Reto Knaak (reto-knaak) said on 2011-11-14:

#2

I am not able to explain my last ERR011 I have:

ERR011: Unable to parse HTML element on line 49!
PAGES: 124
BYTES: 3015375
Something wrong happened when launching htmldoc.
exit value: 256
Check if output exists or if it is good.

In the final PDF the last half page of the very last page is missing (but I see nothin wrong in the source file, also in work directory)

I found out that using the --verbose --verbositylevel high I can re-run asily the htmldoc call.
Doing so, no ERR011 is there and I get the complete file!
PAGES: 125
BYTES: 3018552

Revision history for this message

Reto Knaak (reto-knaak) said on 2011-11-26:

#3

Now I am able to fix also my last problem!

See https://bugs.launchpad.net/ubuntu/+source/chm2pdf/+bug/896692

Ubuntu
chm2pdf package

several chm2pdf issues solved

Question information

Subscribers

Ubuntuchm2pdf package

several chm2pdf issues solved

Question information

Related bugs

Related FAQ:

Subscribers

Ubuntu
chm2pdf package