Thursday, August 23, 2018

Fetching multiple files from an internet site as a batch job


Sometimes one encounters a website that displays a book or manuscript page-by-page as individual jpeg files.  But what you need for your research is to have a single PDF of the item, so that you can move about it easily, and consult it offline.

There are several quick ways of getting these images as a batch job: here's one.   

  • First you have to identify the URL of one of the images.  I use Firefox, so I 
    • first bring up a page that displays the first folio of the MS. 
    • Then I press ctrl+I to get the "page info" (or Firefox menu Tools/Page Info).  
    • Then I select "Media" on the top line of the Info window.  
    • Then scroll down to the graphics file of the whole page, right click and ctrl+c to copy the URL.

      You now have a URL that looks like this:

      http://awebsite.net/uploads/manuscripts/miscellaneous/sometext/001.jpg

There may be a more direct way of getting this URL, but this is good enough for me.

The next bit is the nice bit.  Drop to the command line and use the utility "curl".  Here's the syntax ($ is my command prompt):

$ curl -O http://awebsite.net/uploads/manuscripts/miscellaneous/sometext/[001-268].jpg
  • Hit "enter" and several hundred jpeg files will be transferred to your directory.  It takes a couple of minutes, depending on your bandwidth.
    The bit in square brackets, "[001-268]" is curl's syntax for "please fetch 001.jpg, 002.jpg, ... 267.jpg, and 268.jpg".  Curl is one of the few tools with this simple ability to fetch lots of different files with a single simple command.

To convert them to a single PDF, I use ImageMagick:

$ convert *.jpg Hayanaratna.pdf
and wait for ten seconds.

(I was taught about curl by Patrick McAllister - thanks Patrick!)

A quite different approach is to use wget to fetch a whole website in a single gulp.   That's what I use for GRETIL, for example, so that I have the whole archive on my hard drive.