Cikitsā: Improving PDFs

Tuesday, April 17, 2018

Improving PDFs

I sometimes do some processing on PDFs if I think they are important, or I want to read them more conveniently. I was trying to explain my techniques to my students, recently, and I realized that I use a mixture of tools that are not at all obvious or easy to explain to someone not familiar with Unix.
So I'm going to write down here what I do, so that at least the information is available in one place. I assume a general knowledge of Linux and an ability to work with command-line commands.

If I receive a PDF that is a scanned book, with 1 PDF page = one book opening, I want to chop it up so that 1 PDF page = 1 book page.

make a working directory
use pdftk to unpack the PDF into one file per page:
> pdftk foobar.pdf burst
I now have a directory full of one-page PDFs. Nice.
convert them into jpegs using pdf2jpegs, a shell script that I wrote that contains this text:
#!/bin/bash
# convert a directory full of pdfs into jpegs
for i in *.pdf; do pdftoppm -jpeg -r 400 "$i" >"$i.jpg"; done
I now have a directory full of jpegs, one jpeg per page.
Start the utility scan-tailor and use it to
- separate left and right pages into separate files
- straighten the pages
- select the text area of each page
- create a margin around the text
- finally, write out the resulting new pages
I now have a directory (../out) full of TIFF files, one page per file, smart.
Combine the TIFFs into a single PDF using my shell script tiffs2pdf:
#!/bin/bash
# Create a PDF file from all the .tiff files in the current directory.
# The argument provides the name of the output file (with .pdf appended).
echo "Created a PDF from a directory full of .tif files"
echo "Single argument - the filename of the output PDF (no .pdf extension)"
tiffcp *.tif "/tmp/${1}.tiff"
tiff2pdf "/tmp/${1}.tiff" > "${1}.pdf"
echo "Created ${1}.pdf"
rm "/tmp/${1}.tiff"
echo "Removed temporary file /tmp/${1}.tiff"

# thanks to http://ubuntuforums.org/showthread.php?t=155628
I now have a nice PDF that has one smart page per PDF page.
If I want it OCRed, then I usually use Adobe Acrobat, a commercial program. But if I'm uploading to Archive.org, that isn't necessary because Archive.org does the OCR work using Abbyy.

That's all, folks!