[Press]: Medianama - Wikipedians Digitizing Out-Of-Copyright Texts In Eight Indian Languages

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[Press]: Medianama - Wikipedians Digitizing Out-Of-Copyright Texts In Eight Indian Languages

Noopur Raval
Dear all,

Here's an article by Medianama on Wikisource in Indic languages (
). There are a couple of minor misses on the article - but it does refer to
two important aspects about Wikisource:

a) It is a door through which many have entered our projects and
communities (i.e., they start with Wikisource, and indeed Wiktionary,
because it's relatively easier to contribute to, and then they move on to
contribute to other projects too, such as Wikipedia) - especially in Indic
b) The initiative run by Malayalam community (written about in the article)
to encourage school children to contribute to Wikisource is something that
could be of interest to many other communities.  If anyone wants any help
to start conversations with schools in their states, Nitika and I would be
happy to help out.  Please reach out to us at [hidden email] or
[hidden email]

*Wikipedians Digitizing Out-Of-Copyright Texts In Eight Indian
Languages* *-Nikhil

In what is a painstaking process, Wikipedians are digitizing Indian
language, out-of-copyright texts online, trying to address the comparative
paucity of Indic language texts online. Wikisource is a repository of
documents and archived material that serves as a reference source for
Wikipedia, and a means of improving access to information sources. Of the
64 languages Wikisource is available in,  8 are Indian:
stats <http://stats.wikimedia.org/wikisource/EN/SummaryTA.htm>),
stats <http://stats.wikimedia.org/wikisource/EN/SummaryML.htm>),
stats <http://stats.wikimedia.org/wikisource/EN/SummaryTE.htm>),
stats <http://stats.wikimedia.org/wikisource/EN/SummaryKN.htm>),
stats <http://stats.wikimedia.org/wikisource/EN/SummarySA.htm>),
stats <http://stats.wikimedia.org/wikisource/EN/SummaryMR.htm>),
stats <http://stats.wikimedia.org/wikisource/EN/SummaryBN.htm>) and
stats <http://stats.wikimedia.org/wikisource/EN/SummaryGU.htm>). What’s
particularly notable about this digitization is that the texts are being
typed out by volunteers on their own time, one word at a time.*

*How It Began*

*Users were adding bhajans of Mirabai to Wikipedia, but according to
Wikipedia’s policies, recipes, poems and song lyrics belong to Wikibooks or
Wikisource, Noopur Raval, Communications Consultant (India Program) at the
Wikimedia Foundation told MediaNama. One user raised this issue, and
following discussions, it was decided to create a Wikisource for Gujarati.
The first text to be digitized, though, was Rachnatmak Karyakram, a book by
Mahatma Gandhi. The project, involving the digitization of 60 pages, took
six volunteers a week. This was followed by another project, the
digitization of Gandhi’s autobiography, with a group of 13 people typing
out the book over a month.*

*Identification & Prioritization Of Texts For Digitization*

*Selection of text for digitization is entirely community driven: they
decide what is important. Editors put up a notice for the project, and user
participation is sought. For example, the Gujarati Wikisource editors chose
a text by Mahatma Gandhi. The community has an intensive process for
checking if a book is out of copyright, either using the publication date,
and there are mailing lists which discuss when books go out of copyright.
“It’s not as if there is a shortage of texts that are out of copyright,”
Hisham Mundol, Consultant (India Program) at the Wikimedia Foundation said,
adding that “The kind of projects that the community is undertaking (at
present) involves iconic books, where you know the author and the

*Overcoming Technological Challenges*

*Mundol points out that the process of digitization is brutal, compounded
by the fact that there is no reasonably functional OCR (Optical Character
Recognition) in Indic languages. Texts are thus manually typed out,
followed by a phase of correction and proofreading. In comparison, English
texts can be scanned and uploaded and OCR’ed. The lack of tools points
towards an issue which Wikipedia faces with Indic languages. “If a
MediaWiki tool comes to an English language project, the possibility of
implementing it, the kind of people using it, all of that happens very
quickly, because most of this is written English. It takes time to localize
it. For a bug to be filed for a local language project takes a lot more
time. That gap makes for a lot of difference: how many people (use it), how
easily is the work done, the kind of ease, at every step you need people
who know the language to work with people who know the technology,” says

*Still, the situation with Indic language fonts has improved over the past
year according to Mundol:”The font input problem is no longer the burning
issue. There’s been an increase in the volumes on Indic language scripts,
emails, mobiles. We’re seeing a doubling of readership of our Indic
languages.” One reason for the increase, according to Raval, has been the
implementation of a multiple input tool called Narayam, integrating both
Inscript and transliteration.*

*Reducing Entry Barrier & Involving Schools*

*The Wikisource project is really small in India right now, but it plays an
important role: “It allows people to enter the Wikimedia world of projects
in a much easier manner than editing a Wikipedia article. Wikisource is
much more accessible,” Mundol says.*

*In Kerala, community members involved schools in the process of
digitizing Ramchandra
Vilasam. ”As a part of the 7th or 8th standard, the school curriculum
encourages typing in Malayalam. So the community members work with the
teacher, and instead of 40 students typing out the same two pages that they
would have done in a class assignment, they split a book between them, and
each types out a separate page. It’s great because if everyone gave in the
same page, it would go to the recycle bin quite promptly,” Mundol said,
adding that “We are looking at involving more schools, and discussions are
on with Malayalam schools and colleges.”*

*The Culture Of Knowledge & The Importance Of Community For Wikimedia*

*“It’s interesting to see how a culture develops, not just editors and
technology, but the whole interaction that builds up the identity of a
community,” Raval says. “When Gujarati Wikisource or Marathi Wikisource
comes to your mind, you’re actually thinking of a bunch of people you don’t
necessarily know, and their attitudes towards knowledge, and why they would
go out of their way, spend hours, just to make sure that the knowledge that
they think is important in a language should survive and be digitized, and
they’ll go through the pain to make it available.” While each project has
an individual taking responsibility as a project manager, and a group gets
created around each project, since it is volunteer work, whenever someone
has exams or has other work, someone else compensates.*

*The focus on fostering an involved community often determines Wikimedia’s
approach: “The temptation could be to take a bunch of the 4 million
articles on the English Wikipedia, and run it through a translator. Very
quickly, you can build a huge content base (in Indic languages), but it
does nothing for that community. We’ve seen that it not only does no good,
it does a great deal of harm because they no longer feel that this is
actually their project. It’s about ‘I wrote this paragraph’, or ‘I
contributed to this article’,” Mundol says. “Anything that Wikimedia does,
we encourage the participation of an individual member as much much more
important than anything else because community members edit, contribute,
and no technology solution can get you that.”*

Noopur Raval
Wikikn-l mailing list
[hidden email]