Friday, December 18, 2009

Batch converting HTML files to PDF using OS X and 'convert' utility

At work we are converting content to a new website. As a part of that conversion, some older content will be archived on the new site in the form of PDF documents.

I needed something to convert HTML documents to PDF. We own Adobe's software and there is an option to convert entire web URLs to a single PDF, but that's not what we needed.
I could convert single URLs using the same Adobe software, but that wasn't an optimum solution either. My boss, through a Google search had found a utility on the Mac called 'convert' which does this.

I automated this through a Bash script call, but we still had problems because the pages were truncated. I went to the directory where the "convert" application was, and found it links to 'cupsfilter' in /usr/bin.

By figuring out what cupsfilter does, I was able to determine the parameters necessary to make the PDFs landscape, and use the page size of A4, which was enough to have it work properly.

The great thing about this is if we would have used the "Save As" feature to save each page to a PDF it would have taken hundreds of hours. Using a Bash shell it took an hour to convert three directories of HTML files with about 150+ files per directory. Even though I used 'convert' I suspect you could do the same thing by using the cupsfilter directly on any UNIX variant.

The key parameter was "landscape" but when using "convert" it wasn't obvious how to specify the parameters correctly. Through cupsfilter man pages I found out what I needed; in cupsfilter it's with a "-o" option, but in "convert" it's using -a. For media format options were "Letter" "Legal" and "A4" but A4 worked best. Letter was a little too small and ended up truncating some of our documents.

Here's my Bash Shell Command that walked through the current
directory finding HTML files with the extension HTM, and for
the output file name used SED (Stream EDitor) to convert the
HTM in the filename to the output file type of PDF.

for name in `ls *.htm` ; do /System/Library/Printers/Libraries/convert -f $name -o `echo $name | sed s/htm/pdf/` -a landscape -a scaling=75 -a media=A4; done

I have to point out that if I was still using a PC I could have probably done this with CygWin but the articles we found on Google indicate the people that did this used convert and I don't know I would have figured out to use cupsfilter instead which is what Mac OS X linked to.