So I'm going to write down here what I do, so that at least the information is available in one place. I assume a general knowledge of Linux and an ability to work with command-line commands.
If I receive a PDF that is a scanned book, with 1 PDF page = one book opening, I want to chop it up so that 1 PDF page = 1 book page.
- make a working directory
- use pdftk to unpack the PDF into one file per page:
> pdftk foobar.pdf burst
- I now have a directory full of one-page PDFs. Nice.
- convert them into jpegs using pdf2jpegs, a shell script that I wrote that contains this text:
#!/bin/bash
# convert a directory full of pdfs into jpegs
for i in *.pdf; do pdftoppm -jpeg -r 400 "$i" >"$i.jpg"; done - I now have a directory full of jpegs, one jpeg per page.
- Start the utility scan-tailor and use it to
- separate left and right pages into separate files
- straighten the pages
- select the text area of each page
- create a margin around the text
- finally, write out the resulting new pages
- I now have a directory (../out) full of TIFF files, one page per file, smart.
- Combine the TIFFs into a single PDF using my shell script tiffs2pdf:
#!/bin/bash
# Create a PDF file from all the .tiff files in the current directory.
# The argument provides the name of the output file (with .pdf appended).
echo "Created a PDF from a directory full of .tif files"
echo "Single argument - the filename of the output PDF (no .pdf extension)"
tiffcp *.tif "/tmp/${1}.tiff"
tiff2pdf "/tmp/${1}.tiff" > "${1}.pdf"
echo "Created ${1}.pdf"
rm "/tmp/${1}.tiff"
echo "Removed temporary file /tmp/${1}.tiff"
# thanks to http://ubuntuforums.org/showthread.php?t=155628 - I now have a nice PDF that has one smart page per PDF page.
- If I want it OCRed, then I usually use Adobe Acrobat, a commercial program. But if I'm uploading to Archive.org, that isn't necessary because Archive.org does the OCR work using Abbyy.
That's all, folks!