Tuesday, April 17, 2018

Improving PDFs

I sometimes  do some processing on PDFs if I think they are important, or I want to read them more conveniently.  I was trying to explain my techniques to my students, recently, and I realized that I use a mixture of tools that are not at all obvious or easy to explain to someone not familiar with Unix.
So I'm going to write down here what I do, so that at least the information is available in one place.  I assume a general knowledge of Linux and an ability to work with command-line commands.

If I receive a PDF that is a scanned book, with 1 PDF page = one book opening, I want to chop it up so that 1 PDF page = 1 book page.
  • make a working directory
  • use pdftk to unpack the PDF into one file per page:
    > pdftk foobar.pdf burst
  • I now have a directory full of one-page PDFs.  Nice.
  • convert them into jpegs using pdf2jpegs, a shell script that I wrote that contains this text:
    # convert a directory full of pdfs into jpegs
    for i in *.pdf; do pdftoppm -jpeg -r 400 "$i" >"$i.jpg"; done
  • I now have a directory full of jpegs, one jpeg per page.
  • Start the utility scan-tailor and use it to
    • separate left and right pages into separate files
    • straighten the pages
    • select the text area of each page
    • create a margin around the text
    • finally, write out the resulting new pages
  • I now have a directory (../out) full of TIFF files, one page per file, smart.
  •  Combine the TIFFs into a single PDF using my shell script tiffs2pdf:
    # Create a PDF file from all the .tiff files in the current directory.
    # The argument provides the name of the output file (with .pdf appended).
    echo "Created a PDF from a directory full of .tif files"
    echo "Single argument - the filename of the output PDF (no .pdf extension)"
    tiffcp *.tif "/tmp/${1}.tiff"
    tiff2pdf "/tmp/${1}.tiff" > "${1}.pdf"
    echo "Created ${1}.pdf"
    rm "/tmp/${1}.tiff"
    echo "Removed temporary file /tmp/${1}.tiff"

    # thanks to http://ubuntuforums.org/showthread.php?t=155628
  • I now have a nice PDF that has one smart page per PDF page. 
  • If I want it OCRed, then I usually use Adobe Acrobat, a commercial program. But if I'm uploading to Archive.org, that isn't necessary because Archive.org does the OCR work using Abbyy.

That's all, folks!

Sunday, February 04, 2018

My personal protest withdrawal from USA academic conferences

I regret to say that I am cancelling my visit to this year's USA conferences.  Several USA conferences are among my favourite academic gatherings, and I will sorely miss the occasions.  I have made this difficult decision for political and personal reasons.

Like many, I am deeply dismayed by the statements and policies of current USA governance.  The statements and decisions that have been issued from the USA government over the last year, including the disgracefully vulgar, racist statements just last month, have been increasingly repellent.

I have been torn about whether to travel to the USA to work with and support all my liberal, educated, humanitarian friends there, or whether to take a moral stand to treat the USA as a pariah State.  I still do not feel certain about this matter.  Perhaps it is better to go to the USA and support my embattled friends and colleagues?  Last year, after soul-searching, I went to the American Oriental Society conference in LA.   But after a year of thinking about these issues, I have decided that I will take a different stand at this time, and stay away

My thinking on these issues was nudged forward decisively by a recent report that I heard on the BBC World Service from an Indian lady journalist who described the surprise, compulsory biometric facial scanning that she was subjected to at Dulles, on attempting to exit from the USA.  She was told that it was mandatory for non-USA citizens and that she could be detained in the USA if she did not comply.  The USA's Department of Homeland Security has recently introduced this biometric face-scanning technology at many airports for departing passengers (see here). The technology has been shown to be deeply flawed technically, morally and legally (see NY Times report of Dec 21, and the Georgetown University Law School report).  Amongst other profound problems, the software over-targets women and people with dark skin colour, producing 4% of false-positives in these cases.  As the Georgetown report finds,

Innocent people may be pulled from the line at the boarding gate and subjected to manual fingerprinting at higher rates as a result of their complexion or gender.
The technology being compulsorily applied to all non-USA citizens is demeaning, invasive and violates an individual's right to privacy.
USA Immigration already has assumed the right to require all social media passwords and to review up to five years of past social media postings, and copy all data from mobile phones or laptops [1, 2, etc. etc.].  This again violates individual privacy rights.  It also requires individuals to violate the terms of their contracts with service providers who require login details not to be shared.  In the case of university staff such as myself, it also violates the University's terms of use of my laptop, that contains or may refer to private information concerning my students.  I am required by the University of Alberta to keep my laptop and phone encrypted and I may not share the data with anyone outside the University.

Canada recently agreed, under Bill C-23, that the USA immigration authorities can arrest people while still in Canada, when they go through the USA immigration process while still at Canadian ports (1, 2, 3, etc. etc.).

I am a British Citizen and a European Citizen living as a resident in Canada.  I have no criminal record, nor any specific reason to expect difficulty entering  or leaving the USA.  While I am ashamed by the need to assert it, I am a white, Caucasian, male university professor.  From the point of view of USA governance, I am almost as good as a Norwegian.  But I am acutely aware that many good people in the USA, or entering and leaving the USA, including now my Indian friends, do not have the same automatic advantages of skin colour, gender or citizenship.  Families are being split up, people are being forced to travel to war-torn countries or otherwise being denied the American promise of safety symbolized since 1875 by the Statue of Liberty.  All international travellers are routinely being subjected to threatening, invasive and demeaning procedures.

For all these reasons, I have decided that I wish protest the USA governance and USA immigration policies and procedures.  This is done both in solidarity with my friends, and also as a matter of personal choice, because I do not wish to expose myself personally to the USA immigration service's demeaning and threatening procedures.

I am extremly uncertain about this decision. Perhaps all the above reasons should rather drive me to visit my colleagues in the USA and offer them friendship and collegiality in difficult times. I don't know what the right answer is. Last year, I had many of the same misgivings, but I decided to visit the USA. On this occasion I'm taking the opposite decision, and staying away.

I apologize to my friends in the USA and I look forward to happier times.  I only hope my protest contributes in a small way to encouraging good people in the USA to vote wisely and to lobby their representatives energetically.

Saturday, August 19, 2017

Mini edition environment for LaTeX

When writing an article or book using the LaTeX document preparation system, Indologists sometimes want to have a śloka or two printed on the page with some text-critical notes.  A famous example of this kind of layout is Wilhelm Rao's 1977 edition of the Vākyapadīya that edits the whole text this way, in each verse having its own mini-critical-edition format.

Here is a simple way to get this kind of layout in LaTeX.  I create a new environment called "miniedition":

    {\addtolength{\textwidth}{-\rightmargin} % width of the quote environment

This puts a minipage environment inside a quote environment, switches on italics and switches off the footnote rule.  It's pretty simple.  The clever bit is done by the minipage environment itself, that makes footnotes inside its own box, not at the bottom of the page.  The footnote numbers are lowercase alphabetical counters, to avoid confusion with footnotes outside the environment.
Here's how you would use it, and the result:

pāraṃparyād \emph{ṛte ’pi}\footnote{N: \emph{upataṃ}?} svayam 
\emph{anubhavanād}\footnote{My conjecture; both manuscripts are one syllable 
short. K: 
\emph{anubhavad}; N: \emph{anubhavād}.} granthajārthasya samyak\\
pūrṇābdīyaṃ phalaṃ sadgrahagaṇitavidāṃ \emph{aṃhrireṇoḥ}\footnote{N: 
with identical meaning.} \emph{prasādāt}\footnote{N: \emph{prasādaḥ}.}||

Output (with added text before and after:
The miniedition text is indented left and right, and followed immediately by the critical notes.  The footnotes at the bottom of the page are a separate series.  In both the main body text and the miniedition, you just use \footnote{} for your notes; LaTeX does the right thing by itself according to context. 
The miniedition environment does not break across pages, it is meant for for short fragments of text, one or two ślokas. 
This example is taken from Gansten 2017.

Monday, July 24, 2017

Biblatex, citations and bibliography sorting

"I want to sort in-text citations by year, but bibliography by name."  So begins one of the questions at a Stackexchange.  That's just what I want too.

I want to put multiple references in a \cite{} command without caring about their sequence, and have them automatically print in year-order.  Then, I want my bibliography to be ordinary, i.e., printed in author-date order.

The discussion at the above site is tricky, but the answer by moewe works.

In a nutshell here's what you actually do:
Hello world!\footcites{ref1,ref3,ref0,ref4}

This will print your footnoted citations in ascending order of year, and your bibliography in ascending order of author.

Sunday, June 18, 2017

Expanded Devanāgarī font comparison

In 2012 I posted a comparison of some Devanāgarī fonts that were around at the time.

Here's an update, with some more fonts and more concise TeX code:


% set up a font, print its name, and typeset the test text:
\newcommand{\FontTrial}[1]{ %
    % print the font name:
    {\eng #1} \TestText }

\newcommand{\TestText}{ = शक्ति, kārtsnyam ṣaṭtriṃśad;
{\addfontfeatures{Language=Hindi} Hindī =
        शक्ति कार्त्स्न्यम्}\par}


\FontTrial{Sanskrit 2003}

\setmainfont[FakeStretch=1.08,Mapping=RomDev]{Sanskrit 2003}
\newfontfamily\eng[FakeStretch=1.08,Language=English]{Sanskrit 2003}
{\eng Sanskrit 2003+} \TestText

\FontTrial{Murty Hindi}
\FontTrial{Murty Sanskrit}
% ... etcetera



Lessons learned:
  • Only Sanskrit 2003, Murty Sanskrit, Chandas, Uttara, Siddhanta, and Shobhika do a full conjunct consonant in ṣaṭtriṃśad.  The others fake it with a virāma.
  • Akshar Unicode's "prasanna" has a lazy horizontal conjunct.
  • Free Sans and Free Serif are the only fonts that distinguish Sanskrit and Hindi (see kārtsnyam).
  • Nakula, Sahadeva, Murty Hindi, Shobhika, Annapurna, Akshar Unicode, Kalimati, and Santipur do a lazy, horizontal conjunct consonant in kārtsnyam.
  • There's a special issue affecting FreeSans and FreeSerif.  I described this in a post in 2012.  The publicly distributed version of the fonts fails to make some important conjunct consonants, like त्रि and प्र correctly.  Unfortunately this issue has not changed in the intervening five years. The examples shown here use a fresh compilation of the fonts, based on downloading and compiling the development version at the Savannah repository (June 2017).  (Here's a link to my compiled fonts.)  This Savannah development version works better for  Devanagari, but has problems elsewhere, according to their author Stevan White.

Thursday, June 15, 2017

Preserve the Mess

Many years ago, I attended a Digital Humanities conference, Toronto 1989 I think it was, and heard a paper by Jocelyn Small about using digital tools to manage large datasets.  She was talking about images, but her ideas applied to any data.

One of her key slogans was, "Preserve the Mess."  This approach is now completely normalized by Google search, Google Mail, etc., and we all take it for granted.  But it's worth remembering that this was a major conceptual breakthrough.

Before this approach, everyone thought that the way to find stuff was to use subject indexes.  And subject indexing is expensive, difficult, subjective and structurally imperfect.  What subject headings would you use for the Mahābhārata, for example? I think most people would agree that it is difficult to impossible to arrive at a simple statement of the subject matter of the Mbh that is actually worth having.  Of course, we can all play nothing-buttery, "the Mbh is nothing but a family quarrel," but that's not a serious approach to the problem.  If we pervade the epic with our keywords and subject index terms, we are trying to make the text more accurate than it is, and our exercise is culture-bound and subjective.

"Preserving the mess" means that we leave the data alone.  Rather, we put the intelligence and power into our tools for accessing the data.  We use fuzzy-matching, pattern recognition, machine learning, but all applied to the raw data which is not itself manipulated or changed.

A published version of Small's ideas appeared in 1991:

As she says, p. 52,
Thus Principle Number One is Aristotelian: "Do not make your datum more accurate than it is. This principle may be rephrased as, "Preserve the Mess."