Friday, June 01, 2012

Crowdsourcing manuscript transcription

The TEI world is discussing, amongst many things, the crowdsourcing of MS transcription.  This idea seems to hold great promise for the Indian case.  After all, we've got crowds, right?  As always, the issue is quality control.

But just for a moment, imagine the scenario of an open, public, collaborative website where anybody can bring up an image of a Sanskrit manuscript and write a transcription in an adjacent window.  A transcription that - like a Wikipedia article - would be open for others to improve or annotate, that would rely on crowdsourced cognitive surplus for contribution and gradual quality improvement.  It would be under a history/version control system, so everything would be trackable.  Contributors would earn trust points or, as in eBay's feedback score. 

Ben Brumfield has created an extremely useful survey of MS transcription tools here.

His own FromThePage service looks simple to use and very attractive for a proof-of-concept pilot project.

For example, the Transcribe Bentham project has developed this way of working:
See also the other video about markup, on their "getting started" page.

All the exciting work in MS and edition work today is happening in connection with the TEI framework, and based on transcribed MSS with TEI encoding.  Juxta, the Versioning Machine, etc.  We need to start thinking about creating a public, high-quality corpus of transcribed MSS.  Such a corpus would be the basis for many future projects.

See also:



  • Clay Shirkey's Cognitive Surplus: Creativity and Generosity in a Connected Age (2011). And a TED talk on the same subject.
  • Transcribe Bentham project at UCL .


  1. Thanks Dominik. I finally had a look at the "Transcribe Bentham" project. It is indeed impressive. But the fact that it deals with a language accessible to many and to manuscripts which somehow resemble texts we are used to (our grandparents' letters, for instance) is a trait which "our" projects would not share.
    Further, the "Education" page shows that part of the resources come from students (forced to cooperate?). Maybe a suggestion could be to insert a small wiki-manuscript-project as part of the program of each of the classes we teach?

    1. I take your points, Elisa, but I think that this can be approached pragmatically. Let's try it out, and if it succeeds, great, and if not, एवमस्तु||

      I've written to Ben Brumfield about hosting a few Sanskrit images on his site. He says his software depends on a version of Ruby that doesn't support the full Unicode character set. So we're wondering whether we could go ahead with just a Velthuis-style input. He also suggested just taking his software and installing it on one's own computer. I might do that at Still thinking about it.

      What I think is the hardest thing for us all to grasp is the power of scale. If ten or even fifty people were interested in contributing to this sort of project, we would be in one world. But if a thousand or more got interested, we would be in a completely different world. And there are easily a thousand people in India who know enough to transcribe text. It's a question of whether they would want to, or have access to a networked PC.


  2. It is a great post.