PragMactic OS-Xer: Batch converting HTML files to PDF using OS X and 'convert' utility

Friday, December 18, 2009

Batch converting HTML files to PDF using OS X and 'convert' utility

At work we are converting content to a new website. As a part of that conversion, some older content will be archived on the new site in the form of PDF documents.

I needed something to convert HTML documents to PDF. We own Adobe's software and there is an option to convert entire web URLs to a single PDF, but that's not what we needed.
I could convert single URLs using the same Adobe software, but that wasn't an optimum solution either. My boss, through a Google search had found a utility on the Mac called 'convert' which does this.

I automated this through a Bash script call, but we still had problems because the pages were truncated. I went to the directory where the "convert" application was, and found it links to 'cupsfilter' in /usr/bin.

By figuring out what cupsfilter does, I was able to determine the parameters necessary to make the PDFs landscape, and use the page size of A4, which was enough to have it work properly.

The great thing about this is if we would have used the "Save As" feature to save each page to a PDF it would have taken hundreds of hours. Using a Bash shell it took an hour to convert three directories of HTML files with about 150+ files per directory. Even though I used 'convert' I suspect you could do the same thing by using the cupsfilter directly on any UNIX variant.

The key parameter was "landscape" but when using "convert" it wasn't obvious how to specify the parameters correctly. Through cupsfilter man pages I found out what I needed; in cupsfilter it's with a "-o" option, but in "convert" it's using -a. For media format options were "Letter" "Legal" and "A4" but A4 worked best. Letter was a little too small and ended up truncating some of our documents.

Here's my Bash Shell Command that walked through the current
directory finding HTML files with the extension HTM, and for
the output file name used SED (Stream EDitor) to convert the
HTM in the filename to the output file type of PDF.

for name in `ls *.htm` ; do /System/Library/Printers/Libraries/convert -f $name -o `echo $name | sed s/htm/pdf/` -a landscape -a scaling=75 -a media=A4; done

I have to point out that if I was still using a PC I could have probably done this with CygWin but the articles we found on Google indicate the people that did this used convert and I don't know I would have figured out to use cupsfilter instead which is what Mac OS X linked to.

11 comments:

Paul Hankin said...: ${a/htm/pdf} is better that the backquote section using sed.; December 18, 2009 at 12:24 PM
Paul Hankin said...: I mean of course ${name/htm/pdf}; December 18, 2009 at 12:25 PM
DenverJuggler said...: Awesome - thanks Paul.

I know I don't always do everything the most efficient way in UN*X but it's great there's always multiple working solutions.

Should I also add the period (need to be escaped?) reflecting the separator between the file name and type and a dollar sign to reflect the end of the string so it only matches on the file type?; December 18, 2009 at 2:04 PM
generic propecia said...: Great site,this information really helped me , I really appreciate it.Thanks a lot for a bunch of good tips. I look forward to reading more on the topic in the future. Keep up the good work! This blog is going to be great resource. Love reading it.
nice tip; April 23, 2010 at 10:09 AM
DenverJuggler said...: I did an Ignite! talk on this at our Denver Java Users Group. Here's the YouTube if you're interested: http://www.youtube.com/watch?v=FxSEesg9XXk; May 16, 2011 at 10:12 AM
polocanada said...: Great. I didn't know Adobe Acrobat can actually copy the whole site offline as PDF booklet (like Sitesucker plus added benefit of one PDF).
That's fantastic and useful. Thank's for pointing this out.; August 5, 2012 at 4:53 PM
Unknown said...: I couldn't make the scaling=75 parameter to work whatever value I use the output is always the same ? have I missed something ?; May 11, 2013 at 3:12 PM
Unknown said...: I couldn't make the scaling=75 parameter to work whatever value I use the output is always the same ? have I missed something ?; May 11, 2013 at 3:13 PM
Anonymous said...: I could not make the scaling=[int] parameter work either.

Landscape seems to work, but not scaling.; September 2, 2015 at 1:30 AM
Anonymous said...: Batch converting HTML files to PDF is not so easy task now.
Check out http://www.xspdf.com/product/html-to-pdf/; April 26, 2018 at 7:50 PM
gokulskyappz said...: Absolutely agree with the importance of choosing affordable WordPress development services—especially for startups and small businesses trying to maximize ROI without compromising on quality. A cost-effective service allows companies to focus more on growth and digital strategy. If you’re interested in building a stronger backend to support WordPress or other web applications, I highly recommend checking out the Backend Development course at Skyappz Academy: https://skyappzacademy.com/#/courses/backend-development. Their practical, industry-focused curriculum is perfect for developers looking to enhance their skills in scalable web architecture.; May 8, 2025 at 4:02 AM

PragMactic OS-Xer

Blog Archive

Contributors

Friday, December 18, 2009

Batch converting HTML files to PDF using OS X and 'convert' utility

11 comments: