Friday, December 18, 2009

Batch converting HTML files to PDF using OS X and 'convert' utility

At work we are converting content to a new website. As a part of that conversion, some older content will be archived on the new site in the form of PDF documents.

I needed something to convert HTML documents to PDF. We own Adobe's software and there is an option to convert entire web URLs to a single PDF, but that's not what we needed.
I could convert single URLs using the same Adobe software, but that wasn't an optimum solution either. My boss, through a Google search had found a utility on the Mac called 'convert' which does this.

I automated this through a Bash script call, but we still had problems because the pages were truncated. I went to the directory where the "convert" application was, and found it links to 'cupsfilter' in /usr/bin.

By figuring out what cupsfilter does, I was able to determine the parameters necessary to make the PDFs landscape, and use the page size of A4, which was enough to have it work properly.

The great thing about this is if we would have used the "Save As" feature to save each page to a PDF it would have taken hundreds of hours. Using a Bash shell it took an hour to convert three directories of HTML files with about 150+ files per directory. Even though I used 'convert' I suspect you could do the same thing by using the cupsfilter directly on any UNIX variant.

The key parameter was "landscape" but when using "convert" it wasn't obvious how to specify the parameters correctly. Through cupsfilter man pages I found out what I needed; in cupsfilter it's with a "-o" option, but in "convert" it's using -a. For media format options were "Letter" "Legal" and "A4" but A4 worked best. Letter was a little too small and ended up truncating some of our documents.

Here's my Bash Shell Command that walked through the current
directory finding HTML files with the extension HTM, and for
the output file name used SED (Stream EDitor) to convert the
HTM in the filename to the output file type of PDF.

for name in `ls *.htm` ; do /System/Library/Printers/Libraries/convert -f $name -o `echo $name | sed s/htm/pdf/` -a landscape -a scaling=75 -a media=A4; done

I have to point out that if I was still using a PC I could have probably done this with CygWin but the articles we found on Google indicate the people that did this used convert and I don't know I would have figured out to use cupsfilter instead which is what Mac OS X linked to.


Paul Hankin said...

${a/htm/pdf} is better that the backquote section using sed.

Paul Hankin said...

I mean of course ${name/htm/pdf}

DenverJuggler said...

Awesome - thanks Paul.

I know I don't always do everything the most efficient way in UN*X but it's great there's always multiple working solutions.

Should I also add the period (need to be escaped?) reflecting the separator between the file name and type and a dollar sign to reflect the end of the string so it only matches on the file type?

オテモヤン said...


generic propecia said...

Great site,this information really helped me , I really appreciate it.Thanks a lot for a bunch of good tips. I look forward to reading more on the topic in the future. Keep up the good work! This blog is going to be great resource. Love reading it.
nice tip

DenverJuggler said...

I did an Ignite! talk on this at our Denver Java Users Group. Here's the YouTube if you're interested:

polocanada said...

Great. I didn't know Adobe Acrobat can actually copy the whole site offline as PDF booklet (like Sitesucker plus added benefit of one PDF).
That's fantastic and useful. Thank's for pointing this out.

Mehdi Bougrine said...

I couldn't make the scaling=75 parameter to work whatever value I use the output is always the same ? have I missed something ?

Mehdi Bougrine said...

I couldn't make the scaling=75 parameter to work whatever value I use the output is always the same ? have I missed something ?

Anonymous said...

I could not make the scaling=[int] parameter work either.

Landscape seems to work, but not scaling.

Anonymous said...

Batch converting HTML files to PDF is not so easy task now.
Check out