Tuesday, April 17, 2018

Improving PDFs

I sometimes  do some processing on PDFs if I think they are important, or I want to read them more conveniently.  I was trying to explain my techniques to my students, recently, and I realized that I use a mixture of tools that are not at all obvious or easy to explain to someone not familiar with Unix.
So I'm going to write down here what I do, so that at least the information is available in one place.  I assume a general knowledge of Linux and an ability to work with command-line commands.

If I receive a PDF that is a scanned book, with 1 PDF page = one book opening, I want to chop it up so that 1 PDF page = 1 book page.
  • make a working directory
  • use pdftk to unpack the PDF into one file per page:
    > pdftk foobar.pdf burst
  • I now have a directory full of one-page PDFs.  Nice.
  • convert them into jpegs using pdf2jpegs, a shell script that I wrote that contains this text:
    #!/bin/bash
    # convert a directory full of pdfs into jpegs
    for i in *.pdf; do pdftoppm -jpeg -r 400 "$i" >"$i.jpg"; done
  • I now have a directory full of jpegs, one jpeg per page.
  • Start the utility scan-tailor and use it to
    • separate left and right pages into separate files
    • straighten the pages
    • select the text area of each page
    • create a margin around the text
    • finally, write out the resulting new pages
     
  • I now have a directory (../out) full of TIFF files, one page per file, smart.
  •  Combine the TIFFs into a single PDF using my shell script tiffs2pdf:
    #!/bin/bash
    # Create a PDF file from all the .tiff files in the current directory.
    # The argument provides the name of the output file (with .pdf appended).
    echo "Created a PDF from a directory full of .tif files"
    echo "Single argument - the filename of the output PDF (no .pdf extension)"
    tiffcp *.tif "/tmp/${1}.tiff"
    tiff2pdf "/tmp/${1}.tiff" > "${1}.pdf"
    echo "Created ${1}.pdf"
    rm "/tmp/${1}.tiff"
    echo "Removed temporary file /tmp/${1}.tiff"

    # thanks to http://ubuntuforums.org/showthread.php?t=155628
     
  • I now have a nice PDF that has one smart page per PDF page. 
  • If I want it OCRed, then I usually use Adobe Acrobat, a commercial program. But if I'm uploading to Archive.org, that isn't necessary because Archive.org does the OCR work using Abbyy.


That's all, folks!