Open Library, Wikisource, and cleaning and translating OCR of Classics

classic Classic list List threaded Threaded
97 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Open Library, Wikisource, and cleaning and translating OCR of Classics

metasj
Lars,

I think we agree on what needs to happen.  The only thing I am not
sure of is where you would like to see the work take place.   I have
raised versions of this issue with the Open Library list, which I copy
again here (along with the people I know who work on that fine project
- hello, Peter and Rebecca).  This is why I listed it below as a good
group to collaborate with.

However, the project I have in mind for OCR cleaning and translation needs to
 - accept public comments and annotation about the substance or use of
a work (the wiki covering their millions of metadata entries is very
low traffic and used mainly to address metadata issues in their
records)
 - handle OCR as editable content, or translations of same
 - provide a universal ID for a work, with which comments and
translations can be associated (see
https://blueprints.launchpad.net/openlibrary/+spec/global-work-ids)
 - handle citations, with the possibility of developing something like WikiCite

Let's take a practical example.  A classics professor I know (Greg
Crane, copied here) has scans of primary source materials, some with
approximate or hand-polished OCR, waiting to be uploaded and converted
into a useful online resource for editors, translators, and
classicists around the world.

Where should he and his students post that material?

Wherever they end up, the primary article about each article would
surely link out to the OL and WS pages for each work (where one
exists).


> (Plus you would have to motivate why a copy of OpenLibrary should
> go into the English Wikisource and not the German or French one.)

I don't understand what you mean -- English source materials and
metadata go on en:ws, German on de:ws, &c.  How is this different from
what happens today?

SJ


On Mon, Aug 3, 2009 at 1:18 PM, Lars Aronsson<[hidden email]> wrote:

> Samuel Klein wrote (in two messages):
>
>> >> *A wiki for book metadata, with an entry for every published
>> >> work, statistics about its use and siblings, and discussion
>> >> about its usefulness as a citation (a collaboration with
>> >> OpenLibrary, merging WikiCite ideas)
>
>> I could see this happening on Wikisource.
>
> Why could you not see this happening within the existing
> OpenLibrary? Is there anything wrong with that project? It sounds
> to me as you would just copy (fork) all their book data, but for
> what gain?
>
> (Plus you would have to motivate why a copy of OpenLibrary should
> go into the English Wikisource and not the German or French one.)
>
>
> --
>  Lars Aronsson ([hidden email])
>  Aronsson Datateknik - http://aronsson.se
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Federico Leva (Nemo)
Samuel Klein, 11/08/2009 07:00:
> Let's take a practical example.  A classics professor I know (Greg
> Crane, copied here) has scans of primary source materials, some with
> approximate or hand-polished OCR, waiting to be uploaded and converted
> into a useful online resource for editors, translators, and
> classicists around the world.
>
> Where should he and his students post that material?

Slovene Wikisource did something similar:
http://meta.wikimedia.org/wiki/Slovene_student_projects_in_Wikipedia_and_Wikisource#Wikisource

Nemo

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics

John Mark Vandenberg
In reply to this post by metasj
On Tue, Aug 11, 2009 at 3:00 PM, Samuel Klein<[hidden email]> wrote:
>...
> Let's take a practical example.  A classics professor I know (Greg
> Crane, copied here) has scans of primary source materials, some with
> approximate or hand-polished OCR, waiting to be uploaded and converted
> into a useful online resource for editors, translators, and
> classicists around the world.
>
> Where should he and his students post that material?

I am a bit confused.  Are these texts currently hosted at the Perseus
Digital Library?

If so, they are already a useful online resource. ;-)

If they would like to see these primary sources pushed into the
Wikimedia community, they would need to upload the images (or DjVu)
onto Commons, and the text onto Wikisource where the distributed
proofreading software resides.

We can work with them to import a few texts in order to demonstrate
our technology and preferred methods, and then they can decide whether
they are happy with this technology, the community, and the potential
for translations and commentary.

I made a start on creating a Perseus-to-Wikisource importer about a year ago...!

Or they can upload the djvu to Internet Archive.. or a similar
depositories... and see where it goes from there.

> Wherever they end up, the primary article about each article would
> surely link out to the OL and WS pages for each work (where one
> exists).

Wikisource has been adding OCLC numbers to pages, and adding links to
archive.org when the djvu files came from there (these links contain
an archive.org identifier).  There are also links to LibraryThing and
Open Library; we have very few rules ;-)

--
John Vandenberg

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics

Magnus Manske-2
On Tue, Aug 11, 2009 at 7:32 AM, John Vandenberg<[hidden email]> wrote:

> On Tue, Aug 11, 2009 at 3:00 PM, Samuel Klein<[hidden email]> wrote:
>>...
>> Let's take a practical example.  A classics professor I know (Greg
>> Crane, copied here) has scans of primary source materials, some with
>> approximate or hand-polished OCR, waiting to be uploaded and converted
>> into a useful online resource for editors, translators, and
>> classicists around the world.
>>
>> Where should he and his students post that material?
>
> I am a bit confused.  Are these texts currently hosted at the Perseus
> Digital Library?
>
> If so, they are already a useful online resource. ;-)
>
> If they would like to see these primary sources pushed into the
> Wikimedia community, they would need to upload the images (or DjVu)
> onto Commons, and the text onto Wikisource where the distributed
> proofreading software resides.

I see CC-NC...

http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3atext%3a2003.02.0004

Too bad.

Magnus

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics

John Mark Vandenberg
On Tue, Aug 11, 2009 at 6:21 PM, Magnus
Manske<[hidden email]> wrote:
> I see CC-NC...
>
> http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3atext%3a2003.02.0004
>
> Too bad.

Well, they can't copyright what is in the PD.

There is little about the XML in TEI format that can be called
"creative", and any non-factual markup can be easily stripped out.

I remember now ... it was in March/April 2008 that I was looking at
this, for the Pindar odes, and a djvu with pagescans is on
archive.org.

http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.04.0101
http://www.archive.org/details/olympianpythiano00pinduoft

The Perseus etext doesnt appear to include the 125 pages which have
the complete Greek texts.
(btw, here is our unverified original source:
http://el.wikisource.org/wiki/Ολυμπιόνικοι )

However the commentary is all there, with pagination in the TEI so it
is easy to marry the text with the images.

(warning: 850kb xml file, followed by medium res. image)
http://www.perseus.tufts.edu/hopper/xmlchunk?doc=Perseus%3Atext%3A1999.04.0101%3Atext%3Dcomm%3Abook%3DO.
http://www.archive.org/stream/olympianpythiano00pinduoft#page/124/mode/2up

--
John Vandenberg

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics

metasj
In reply to this post by John Mark Vandenberg
Onion sourcing.  That would be a nice improvement on simple cite styles.

On Tue, Aug 11, 2009 at 12:10 PM, Gregory Crane<[hidden email]> wrote:

> There are various layers to this onion. The key element is that books and
> pages are artifacts in many cases. What we really want are the logical
> structures that splatter across pages.

And across and around works...

> First, we have added a bunch of content -- esp. editions of Greek and Latin
> sources -- to the Internet Archive holdings and we are cataloguing editions
> that are the overall collection, regardless of who put them there. This goes
> well beyond the standard book catalogue records -- we are interested in the
> content not in books per se. Thus, we may add hundreds of records for a

Is there a way to deep link to a specific page-image from one of these
works without removing it from the Internet Archive?

> We would like to have useable etexts from all of these editions -- many of
> which are not yet in our collections. Many of these are in Greek and need a
> lot of work because the OCR is not very good.

So bad OCR for them exists, but no usable etexts?

> To use canonical texts, you need book/chapter/verse markup and you need
> FRBR-like citations ... deep annotations... syntactic analyses, word sense,
> co-reference...

These are nice features, but perhaps you can develop a clean etext
first, and overlay this metadata in parallel or later on.


> My question is what environments can support contributions at various
> levels. Clearly, proofreading OCR output is standard enough.
>
> If you want to get a sense of what operations need ultimately to be
> supported, you could skim
> http://digitalhumanities.org/dhq/vol/3/1/000035.html.

That's a good question.  What environments currently support OCR
proofreading and translation, and direct links to page-images of the
original source?  This is doable, with no special software or tools,
via wikisource (in multiple languages, with interlanguage links and
crude paragraph alignment) and commons (for page images).  The pages
could also be stored in other repositories such as the Archive, as
long as there is an easy way to link out to them or transclude
thumbnails.  [maybe an InstaCommons plugin for the Internet Archive?]

That's quite an interesting monograph you link to.  I see six main
sets of features/operations described there.  Each of them deserves a
mention in Wikimedia's strategic planning.  Aside from language
analysis, each has significant value for all of the Projects, not just
wikisource.


OCR tools
 *  OCR optimization: statistical data, page layout hints
 *  Capturing page layout logical structures

CROSS REFERENCING
 *  Quote, source, plagiarism idenfication.
 *  Named entity identification (automatic for some entities?  hints)
 *  Automatic linking (of urls, abbrv. citations, &c), markup projection

TEXT ALIGNMENT
 *  Canonical text services (chapter/verse equivalents)
 *  Version Analysis b/t versions.
 *  Translation alignment

TRANSLATION SUPPORT
 *  Automated translation (seed translations, hints for humans)
 *  Translation dictionaries (on mouseover?)

CROSS-LANGUAGE SEARCHING
 *  Cross-referencing across translations
 *  Quote identification across translations

LANGUAGE ANALYSIS
 *  Word analysis: word sense discovery, morphology.
 *  Sentence analysis: syntactic, metrical (poetry)



> Greg
>
> John Vandenberg wrote:
>>
>> On Tue, Aug 11, 2009 at 3:00 PM, Samuel Klein<[hidden email]> wrote:
>>
>>>
>>> ...
>>> Let's take a practical example.  A classics professor I know (Greg
>>> Crane, copied here) has scans of primary source materials, some with
>>> approximate or hand-polished OCR, waiting to be uploaded and converted
>>> into a useful online resource for editors, translators, and
>>> classicists around the world.
>>>
>>> Where should he and his students post that material?
>>>
>>
>> I am a bit confused.  Are these texts currently hosted at the Perseus
>> Digital Library?
>>
>> If so, they are already a useful online resource. ;-)
>>
>> If they would like to see these primary sources pushed into the
>> Wikimedia community, they would need to upload the images (or DjVu)
>> onto Commons, and the text onto Wikisource where the distributed
>> proofreading software resides.
>>
>> We can work with them to import a few texts in order to demonstrate
>> our technology and preferred methods, and then they can decide whether
>> they are happy with this technology, the community, and the potential
>> for translations and commentary.
>>
>> I made a start on creating a Perseus-to-Wikisource importer about a year
>> ago...!
>>
>> Or they can upload the djvu to Internet Archive.. or a similar
>> depositories... and see where it goes from there.
>>
>>
>>>
>>> Wherever they end up, the primary article about each article would
>>> surely link out to the OL and WS pages for each work (where one
>>> exists).
>>>
>>
>> Wikisource has been adding OCLC numbers to pages, and adding links to
>> archive.org when the djvu files came from there (these links contain
>> an archive.org identifier).  There are also links to LibraryThing and
>> Open Library; we have very few rules ;-)
>>
>> --
>> John Vandenberg
>>
>
>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Lars Aronsson
In reply to this post by metasj
Samuel Klein wrote:

> I think we agree on what needs to happen.  The only thing I am
> not sure of is where you would like to see the work take place.

I'm not so sure we agree.  I think we're talking about two
different things.

This thread started out with a discussion of why it is so hard to
start new projects within the Wikimedia Foundation.  My stance is
that projects like OpenStreetMap.org and OpenLibrary.org are doing
fine as they are, and there is no need to duplicate their effort
within the WMF.  The example you gave was this:

> >> >> *A wiki for book metadata, with an entry for every published
> >> >> work, statistics about its use and siblings, and discussion
> >> >> about its usefulness as a citation (a collaboration with
> >> >> OpenLibrary, merging WikiCite ideas)

To me, that sounds exactly as what OpenLibrary already does (or
could be doing in the near time), so why even set up a new project
that would collaborate with it?  Later you added:

> >> I could see this happening on Wikisource.

That's when I asked why this couldn't be done inside OpenLibrary.  

I added:

> > (Plus you would have to motivate why a copy of OpenLibrary should
> > go into the English Wikisource and not the German or French one.)

You replied:

> I don't understand what you mean -- English source materials and
> metadata go on en:ws, German on de:ws, &c.  How is this different from
> what happens today?

I was talking about the metadata for all books ever published,
including the Swedish translations of Mark Twain's works, which
are part of Mark Twain's bibliography, of the translator's
bibliography, of American literature, and of Swedish language
literature.  In OpenLibrary all of these are contained in one
project.  In Wikisource, they are split in one section for English
and another section for Swedish.  That division makes sense for
the contents of the book, but not for the book metadata.

Now you write:

> However, the project I have in mind for OCR cleaning and
> translation needs to

That is a change of subject. That sounds just like what Wikisource
(or PGDP.net) is about.  OCR cleaning is one thing, but it is an
entirely different thing to set up "a wiki for book metadata, with
an entry for every published work".  So which of these two project
ideas are we talking about?

Every book ever published means more than 10 million records.  
(It probably means more than 100 million records.) OCR cleaning
attracts hundreds or a few thousand volunteers, which is
sufficient to take on thousands of books, but not millions.

Google scanned millions of books already, but I haven't heard of
any plans for cleaning all that OCR text.

> Let's take a practical example.  A classics professor I know
> (Greg Crane, copied here) has scans of primary source materials,
> some with approximate or hand-polished OCR, waiting to be
> uploaded and converted into a useful online resource for
> editors, translators, and classicists around the world.
>
> Where should he and his students post that material?

On Wikisource.  What's stopping them?



--
  Lars Aronsson ([hidden email])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [ol-discuss] [Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

metasj
On Tue, Aug 11, 2009 at 9:16 PM, Lars Aronsson<[hidden email]> wrote:

>> Let's take a practical example.  A classics professor I know
>> (Greg Crane, copied here) has scans of primary source materials,
>> some with approximate or hand-polished OCR, waiting to be
>> uploaded and converted into a useful online resource for
>> editors, translators, and classicists around the world.
>>
>> Where should he and his students post that material?
>
> On Wikisource.  What's stopping them?

Greg: does Wikisource seem like the right place to post and revise OCR
to you?  If not, where?  If so, what's stopping you?



> I'm not so sure we agree.  I think we're talking about two
> different things.
>
> This thread started out with a discussion of why it is so hard to
> start new projects within the Wikimedia Foundation.  My stance is
> that projects like OpenStreetMap.org and OpenLibrary.org are doing
> fine as they are, and there is no need to duplicate their effort
> within the WMF.  The example you gave was this:

I agree that there's no point in duplicating existing functionality.
The best solution is probably for OL to include this explicitly in
their scope and add the necessary functionality.   I suggested this on
the OL mailing list in March.
   http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html

>> >> >> *A wiki for book metadata, with an entry for every published
>> >> >> work, statistics about its use and siblings, and discussion
>> >> >> about its usefulness as a citation (a collaboration with
>> >> >> OpenLibrary, merging WikiCite ideas)
>
> To me, that sounds exactly as what OpenLibrary already does (or
> could be doing in the near time), so why even set up a new project
> that would collaborate with it?  Later you added:

However, this is not what OL or its wiki do now.  And OL is not run by
its community, the community helps support the work of a centrally
directed group.  So there is only so much I feel I can contribute to
the project by making suggestions.  The wiki built into the fiber of
OL is intentionally not used for general discussion.



> I was talking about the metadata for all books ever published,
> including the Swedish translations of Mark Twain's works, which
> are part of Mark Twain's bibliography, of the translator's
> bibliography, of American literature, and of Swedish language
> literature.  In OpenLibrary all of these are contained in one
> project.  In Wikisource, they are split in one section for English
> and another section for Swedish.  That division makes sense for
> the contents of the book, but not for the book metadata.

This is a problem that Wikisource needs to address, regardless of
where the OpenLibrary metadata goes.  It is similar to the Wiktionary
problem of wanting some content - the array of translations of a
single definition - to exist in one place and be transcluded in each
language.


> Now you write:
>
>> However, the project I have in mind for OCR cleaning and
>> translation needs to
>
> That is a change of subject. That sounds just like what Wikisource
> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
> entirely different thing to set up "a wiki for book metadata, with
> an entry for every published work".  So which of these two project
> ideas are we talking about?

They are closely related.

There needs to be a global authority file for works -- a [set of]
universal identifier[s] for a given work in order for wikisource (as
it currently stands) to link the German translation of the English
transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
to its metadata entry [or entries].

I would prefer for this authority file to be wiki-like, as the
Wikipedia authority file is, so that it supports renames, merges, and
splits with version history and minimal overhead; hence I wish to see
a wiki for this sort of metadata.

Currently OL does not quite provide this authority file, but it could.
 I do not know how easily.


> Every book ever published means more than 10 million records.
> (It probably means more than 100 million records.) OCR cleaning
> attracts hundreds or a few thousand volunteers, which is
> sufficient to take on thousands of books, but not millions.

Focusing efforts on notable works with verifiable OCR, and using the
sorts of helper tools that Greg's paper describes, I do not doubt that
we could effectively clean and publish OCR for all primary sources
that are actively used and referenced in scholarship today (and more
besides).  Though 'we' here is the world - certainly more than a few
thousand volunteers have at least one book they would like to polish.
Most of them are not currently Wikimedia contributors, that much is
certain -- we don't provide any tools to make this work convenient or
rewarding.


> Google scanned millions of books already, but I haven't heard of
> any plans for cleaning all that OCR text.

Well, Google does not believe in distributed human effort.  (This came
up in a recent Knol thread as well.)  I'm not sure that is the best
comparison.


SJ

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Open Library, Wikisource, and cleaning and translating OCR of Classics

Yann Forget-2
Hello,

This discussion is very interesting. I would like to make a summary, so
that we can go further.

1. A database of all books ever published is one of the thing still missing.
2. This needs massive collaboration by thousands of volunteers, so a
wiki might be appropriate, however...
3. The data needs a structured web site, not a plain wiki like Mediawiki.
4. A big part of this data is already available, but scattered on
various databases, in various languages, with various protocols, etc. So
a big part of work needs as much database management knowledge as
librarian knowledge.
5. What most missing in these existing databases (IMO) is information
about translations: nowhere there are a general database of translated
works, at least not in English and French. It is very difficult to find
if a translation exists for a given work. Wikisource has some of this
information with interwiki links between work and author pages, but for
a (very) small number of works and authors.
6. It would be best not to duplicate work on several places.

Personally I don't find OL very practical. May be I am too much used too
Mediawiki. ;oD

We still need to create something, attractive to contributors and
readers alike.

Yann

Samuel Klein wrote:

>> This thread started out with a discussion of why it is so hard to
>> start new projects within the Wikimedia Foundation.  My stance is
>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>> fine as they are, and there is no need to duplicate their effort
>> within the WMF.  The example you gave was this:
>
> I agree that there's no point in duplicating existing functionality.
> The best solution is probably for OL to include this explicitly in
> their scope and add the necessary functionality.   I suggested this on
> the OL mailing list in March.
>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
>
>>>>>>> *A wiki for book metadata, with an entry for every published
>>>>>>> work, statistics about its use and siblings, and discussion
>>>>>>> about its usefulness as a citation (a collaboration with
>>>>>>> OpenLibrary, merging WikiCite ideas)
>> To me, that sounds exactly as what OpenLibrary already does (or
>> could be doing in the near time), so why even set up a new project
>> that would collaborate with it?  Later you added:
>
> However, this is not what OL or its wiki do now.  And OL is not run by
> its community, the community helps support the work of a centrally
> directed group.  So there is only so much I feel I can contribute to
> the project by making suggestions.  The wiki built into the fiber of
> OL is intentionally not used for general discussion.
>
>> I was talking about the metadata for all books ever published,
>> including the Swedish translations of Mark Twain's works, which
>> are part of Mark Twain's bibliography, of the translator's
>> bibliography, of American literature, and of Swedish language
>> literature.  In OpenLibrary all of these are contained in one
>> project.  In Wikisource, they are split in one section for English
>> and another section for Swedish.  That division makes sense for
>> the contents of the book, but not for the book metadata.
>
> This is a problem that Wikisource needs to address, regardless of
> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
> problem of wanting some content - the array of translations of a
> single definition - to exist in one place and be transcluded in each
> language.
>
>> Now you write:
>>
>>> However, the project I have in mind for OCR cleaning and
>>> translation needs to
>> That is a change of subject. That sounds just like what Wikisource
>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>> entirely different thing to set up "a wiki for book metadata, with
>> an entry for every published work".  So which of these two project
>> ideas are we talking about?
>
> They are closely related.
>
> There needs to be a global authority file for works -- a [set of]
> universal identifier[s] for a given work in order for wikisource (as
> it currently stands) to link the German translation of the English
> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
> to its metadata entry [or entries].
>
> I would prefer for this authority file to be wiki-like, as the
> Wikipedia authority file is, so that it supports renames, merges, and
> splits with version history and minimal overhead; hence I wish to see
> a wiki for this sort of metadata.
>
> Currently OL does not quite provide this authority file, but it could.
>  I do not know how easily.
>
>> Every book ever published means more than 10 million records.
>> (It probably means more than 100 million records.) OCR cleaning
>> attracts hundreds or a few thousand volunteers, which is
>> sufficient to take on thousands of books, but not millions.
>
> Focusing efforts on notable works with verifiable OCR, and using the
> sorts of helper tools that Greg's paper describes, I do not doubt that
> we could effectively clean and publish OCR for all primary sources
> that are actively used and referenced in scholarship today (and more
> besides).  Though 'we' here is the world - certainly more than a few
> thousand volunteers have at least one book they would like to polish.
> Most of them are not currently Wikimedia contributors, that much is
> certain -- we don't provide any tools to make this work convenient or
> rewarding.
>
>> Google scanned millions of books already, but I haven't heard of
>> any plans for cleaning all that OCR text.
>
> Well, Google does not believe in distributed human effort.  (This came
> up in a recent Knol thread as well.)  I'm not sure that is the best
> comparison.
>
> SJ

--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

David Goodman
Yann & Sam

The problem is extraordinarily   complex. A database of all "books"
(and other media) ever published is beyond the joint  capabilities of
everyone interested. There are intermediate entities between "books"
and "works", and important subordinate entities, such as "article" ,
"chapter" , and those like "poem" which could be at any of several
levels.  This is not a job for amateurs, unless they are prepared to
first learn the actual standards of bibliographic description for
different types of material, and to at least recognize the
inter-relationships, and the many undefined areas. At research
libraries, one allows a few years of training for a newcomer with just
a MLS degree to work with a small subset of this. I have thirty years
of experience in related areas of librarianship, and I know only
enough to be aware of the problems.
For an introduction to the current state of this, see
http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf.

The difficulty of merging the many thousands of partial correct and
incorrect sources of available data typically requires the manual
resolution of each of the tens of millions of instances.

OL rather than Wikimedia has the advantage that more of the people
there understand the problems.

David Goodman, Ph.D, M.L.S.
http://en.wikipedia.org/wiki/User_talk:DGG



On Wed, Aug 12, 2009 at 1:15 PM, c<[hidden email]> wrote:

> Hello,
>
> This discussion is very interesting. I would like to make a summary, so
> that we can go further.
>
> 1. A database of all books ever published is one of the thing still missing.
> 2. This needs massive collaboration by thousands of volunteers, so a
> wiki might be appropriate, however...
> 3. The data needs a structured web site, not a plain wiki like Mediawiki.
> 4. A big part of this data is already available, but scattered on
> various databases, in various languages, with various protocols, etc. So
> a big part of work needs as much database management knowledge as
> librarian knowledge.
> 5. What most missing in these existing databases (IMO) is information
> about translations: nowhere there are a general database of translated
> works, at least not in English and French. It is very difficult to find
> if a translation exists for a given work. Wikisource has some of this
> information with interwiki links between work and author pages, but for
> a (very) small number of works and authors.
> 6. It would be best not to duplicate work on several places.
>
> Personally I don't find OL very practical. May be I am too much used too
> Mediawiki. ;oD
>
> We still need to create something, attractive to contributors and
> readers alike.
>
> Yann
>
> Samuel Klein wrote:
>>> This thread started out with a discussion of why it is so hard to
>>> start new projects within the Wikimedia Foundation.  My stance is
>>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>>> fine as they are, and there is no need to duplicate their effort
>>> within the WMF.  The example you gave was this:
>>
>> I agree that there's no point in duplicating existing functionality.
>> The best solution is probably for OL to include this explicitly in
>> their scope and add the necessary functionality.   I suggested this on
>> the OL mailing list in March.
>>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
>>
>>>>>>>> *A wiki for book metadata, with an entry for every published
>>>>>>>> work, statistics about its use and siblings, and discussion
>>>>>>>> about its usefulness as a citation (a collaboration with
>>>>>>>> OpenLibrary, merging WikiCite ideas)
>>> To me, that sounds exactly as what OpenLibrary already does (or
>>> could be doing in the near time), so why even set up a new project
>>> that would collaborate with it?  Later you added:
>>
>> However, this is not what OL or its wiki do now.  And OL is not run by
>> its community, the community helps support the work of a centrally
>> directed group.  So there is only so much I feel I can contribute to
>> the project by making suggestions.  The wiki built into the fiber of
>> OL is intentionally not used for general discussion.
>>
>>> I was talking about the metadata for all books ever published,
>>> including the Swedish translations of Mark Twain's works, which
>>> are part of Mark Twain's bibliography, of the translator's
>>> bibliography, of American literature, and of Swedish language
>>> literature.  In OpenLibrary all of these are contained in one
>>> project.  In Wikisource, they are split in one section for English
>>> and another section for Swedish.  That division makes sense for
>>> the contents of the book, but not for the book metadata.
>>
>> This is a problem that Wikisource needs to address, regardless of
>> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
>> problem of wanting some content - the array of translations of a
>> single definition - to exist in one place and be transcluded in each
>> language.
>>
>>> Now you write:
>>>
>>>> However, the project I have in mind for OCR cleaning and
>>>> translation needs to
>>> That is a change of subject. That sounds just like what Wikisource
>>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>>> entirely different thing to set up "a wiki for book metadata, with
>>> an entry for every published work".  So which of these two project
>>> ideas are we talking about?
>>
>> They are closely related.
>>
>> There needs to be a global authority file for works -- a [set of]
>> universal identifier[s] for a given work in order for wikisource (as
>> it currently stands) to link the German translation of the English
>> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
>> to its metadata entry [or entries].
>>
>> I would prefer for this authority file to be wiki-like, as the
>> Wikipedia authority file is, so that it supports renames, merges, and
>> splits with version history and minimal overhead; hence I wish to see
>> a wiki for this sort of metadata.
>>
>> Currently OL does not quite provide this authority file, but it could.
>>  I do not know how easily.
>>
>>> Every book ever published means more than 10 million records.
>>> (It probably means more than 100 million records.) OCR cleaning
>>> attracts hundreds or a few thousand volunteers, which is
>>> sufficient to take on thousands of books, but not millions.
>>
>> Focusing efforts on notable works with verifiable OCR, and using the
>> sorts of helper tools that Greg's paper describes, I do not doubt that
>> we could effectively clean and publish OCR for all primary sources
>> that are actively used and referenced in scholarship today (and more
>> besides).  Though 'we' here is the world - certainly more than a few
>> thousand volunteers have at least one book they would like to polish.
>> Most of them are not currently Wikimedia contributors, that much is
>> certain -- we don't provide any tools to make this work convenient or
>> rewarding.
>>
>>> Google scanned millions of books already, but I haven't heard of
>>> any plans for cleaning all that OCR text.
>>
>> Well, Google does not believe in distributed human effort.  (This came
>> up in a recent Knol thread as well.)  I'm not sure that is the best
>> comparison.
>>
>> SJ
>
> --
> http://www.non-violence.org/ | Site collaboratif sur la non-violence
> http://www.forget-me.net/ | Alternatives sur le Net
> http://fr.wikisource.org/ | Bibliothèque libre
> http://wikilivres.info | Documents libres
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

metasj
DGG, I appreciate your points.  Would we be so motivated by this
thread if it weren't a complex problem?

The fact that all of this is quite new, and that there are so many
unknowns and gray areas, actually makes me consider it more likely
that a body of wikimedians, experienced with their own form of
large-scale authority file coordination, are in a position to say
something meaningful about how to achieve something similar for tens
of millions of metadata records.

> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.

In some areas that is certainly so.  In others, Wikimedia communities
have useful recent experience.  I hope that those who understand these
problems  on both sides recognize the importance of sharing what they
know openly -- and  showing others how to understand them as well.  We
will not succeed as a global community if we say that this class of
problems can only be solved by the limited group of people with an MLS
and a few years of focused training.  (how would you name the sort of
training you mean here, btw?)

SJ


On Thu, Aug 13, 2009 at 12:57 AM, David Goodman<[hidden email]> wrote:

> Yann & Sam
>
> The problem is extraordinarily   complex. A database of all "books"
> (and other media) ever published is beyond the joint  capabilities of
> everyone interested. There are intermediate entities between "books"
> and "works", and important subordinate entities, such as "article" ,
> "chapter" , and those like "poem" which could be at any of several
> levels.  This is not a job for amateurs, unless they are prepared to
> first learn the actual standards of bibliographic description for
> different types of material, and to at least recognize the
> inter-relationships, and the many undefined areas. At research
> libraries, one allows a few years of training for a newcomer with just
> a MLS degree to work with a small subset of this. I have thirty years
> of experience in related areas of librarianship, and I know only
> enough to be aware of the problems.
> For an introduction to the current state of this, see
> http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf.
>
> The difficulty of merging the many thousands of partial correct and
> incorrect sources of available data typically requires the manual
> resolution of each of the tens of millions of instances.
>
> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.
>
> David Goodman, Ph.D, M.L.S.
> http://en.wikipedia.org/wiki/User_talk:DGG
>
>
>
> On Wed, Aug 12, 2009 at 1:15 PM, c<[hidden email]> wrote:
>> Hello,
>>
>> This discussion is very interesting. I would like to make a summary, so
>> that we can go further.
>>
>> 1. A database of all books ever published is one of the thing still missing.
>> 2. This needs massive collaboration by thousands of volunteers, so a
>> wiki might be appropriate, however...
>> 3. The data needs a structured web site, not a plain wiki like Mediawiki.
>> 4. A big part of this data is already available, but scattered on
>> various databases, in various languages, with various protocols, etc. So
>> a big part of work needs as much database management knowledge as
>> librarian knowledge.
>> 5. What most missing in these existing databases (IMO) is information
>> about translations: nowhere there are a general database of translated
>> works, at least not in English and French. It is very difficult to find
>> if a translation exists for a given work. Wikisource has some of this
>> information with interwiki links between work and author pages, but for
>> a (very) small number of works and authors.
>> 6. It would be best not to duplicate work on several places.
>>
>> Personally I don't find OL very practical. May be I am too much used too
>> Mediawiki. ;oD
>>
>> We still need to create something, attractive to contributors and
>> readers alike.
>>
>> Yann
>>
>> Samuel Klein wrote:
>>>> This thread started out with a discussion of why it is so hard to
>>>> start new projects within the Wikimedia Foundation.  My stance is
>>>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>>>> fine as they are, and there is no need to duplicate their effort
>>>> within the WMF.  The example you gave was this:
>>>
>>> I agree that there's no point in duplicating existing functionality.
>>> The best solution is probably for OL to include this explicitly in
>>> their scope and add the necessary functionality.   I suggested this on
>>> the OL mailing list in March.
>>>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
>>>
>>>>>>>>> *A wiki for book metadata, with an entry for every published
>>>>>>>>> work, statistics about its use and siblings, and discussion
>>>>>>>>> about its usefulness as a citation (a collaboration with
>>>>>>>>> OpenLibrary, merging WikiCite ideas)
>>>> To me, that sounds exactly as what OpenLibrary already does (or
>>>> could be doing in the near time), so why even set up a new project
>>>> that would collaborate with it?  Later you added:
>>>
>>> However, this is not what OL or its wiki do now.  And OL is not run by
>>> its community, the community helps support the work of a centrally
>>> directed group.  So there is only so much I feel I can contribute to
>>> the project by making suggestions.  The wiki built into the fiber of
>>> OL is intentionally not used for general discussion.
>>>
>>>> I was talking about the metadata for all books ever published,
>>>> including the Swedish translations of Mark Twain's works, which
>>>> are part of Mark Twain's bibliography, of the translator's
>>>> bibliography, of American literature, and of Swedish language
>>>> literature.  In OpenLibrary all of these are contained in one
>>>> project.  In Wikisource, they are split in one section for English
>>>> and another section for Swedish.  That division makes sense for
>>>> the contents of the book, but not for the book metadata.
>>>
>>> This is a problem that Wikisource needs to address, regardless of
>>> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
>>> problem of wanting some content - the array of translations of a
>>> single definition - to exist in one place and be transcluded in each
>>> language.
>>>
>>>> Now you write:
>>>>
>>>>> However, the project I have in mind for OCR cleaning and
>>>>> translation needs to
>>>> That is a change of subject. That sounds just like what Wikisource
>>>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>>>> entirely different thing to set up "a wiki for book metadata, with
>>>> an entry for every published work".  So which of these two project
>>>> ideas are we talking about?
>>>
>>> They are closely related.
>>>
>>> There needs to be a global authority file for works -- a [set of]
>>> universal identifier[s] for a given work in order for wikisource (as
>>> it currently stands) to link the German translation of the English
>>> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
>>> to its metadata entry [or entries].
>>>
>>> I would prefer for this authority file to be wiki-like, as the
>>> Wikipedia authority file is, so that it supports renames, merges, and
>>> splits with version history and minimal overhead; hence I wish to see
>>> a wiki for this sort of metadata.
>>>
>>> Currently OL does not quite provide this authority file, but it could.
>>>  I do not know how easily.
>>>
>>>> Every book ever published means more than 10 million records.
>>>> (It probably means more than 100 million records.) OCR cleaning
>>>> attracts hundreds or a few thousand volunteers, which is
>>>> sufficient to take on thousands of books, but not millions.
>>>
>>> Focusing efforts on notable works with verifiable OCR, and using the
>>> sorts of helper tools that Greg's paper describes, I do not doubt that
>>> we could effectively clean and publish OCR for all primary sources
>>> that are actively used and referenced in scholarship today (and more
>>> besides).  Though 'we' here is the world - certainly more than a few
>>> thousand volunteers have at least one book they would like to polish.
>>> Most of them are not currently Wikimedia contributors, that much is
>>> certain -- we don't provide any tools to make this work convenient or
>>> rewarding.
>>>
>>>> Google scanned millions of books already, but I haven't heard of
>>>> any plans for cleaning all that OCR text.
>>>
>>> Well, Google does not believe in distributed human effort.  (This came
>>> up in a recent Knol thread as well.)  I'm not sure that is the best
>>> comparison.
>>>
>>> SJ
>>
>> --
>> http://www.non-violence.org/ | Site collaboratif sur la non-violence
>> http://www.forget-me.net/ | Alternatives sur le Net
>> http://fr.wikisource.org/ | Bibliothèque libre
>> http://wikilivres.info | Documents libres
>>
>> _______________________________________________
>> foundation-l mailing list
>> [hidden email]
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>>
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

David Goodman
The training is typically an apprenticeship under the senior
cataloging librarians.

David Goodman, Ph.D, M.L.S.
http://en.wikipedia.org/wiki/User_talk:DGG



On Thu, Aug 13, 2009 at 1:48 AM, Samuel Klein<[hidden email]> wrote:

> DGG, I appreciate your points.  Would we be so motivated by this
> thread if it weren't a complex problem?
>
> The fact that all of this is quite new, and that there are so many
> unknowns and gray areas, actually makes me consider it more likely
> that a body of wikimedians, experienced with their own form of
> large-scale authority file coordination, are in a position to say
> something meaningful about how to achieve something similar for tens
> of millions of metadata records.
>
>> OL rather than Wikimedia has the advantage that more of the people
>> there understand the problems.
>
> In some areas that is certainly so.  In others, Wikimedia communities
> have useful recent experience.  I hope that those who understand these
> problems  on both sides recognize the importance of sharing what they
> know openly -- and  showing others how to understand them as well.  We
> will not succeed as a global community if we say that this class of
> problems can only be solved by the limited group of people with an MLS
> and a few years of focused training.  (how would you name the sort of
> training you mean here, btw?)
>
> SJ
>
>
> On Thu, Aug 13, 2009 at 12:57 AM, David Goodman<[hidden email]> wrote:
>> Yann & Sam
>>
>> The problem is extraordinarily   complex. A database of all "books"
>> (and other media) ever published is beyond the joint  capabilities of
>> everyone interested. There are intermediate entities between "books"
>> and "works", and important subordinate entities, such as "article" ,
>> "chapter" , and those like "poem" which could be at any of several
>> levels.  This is not a job for amateurs, unless they are prepared to
>> first learn the actual standards of bibliographic description for
>> different types of material, and to at least recognize the
>> inter-relationships, and the many undefined areas. At research
>> libraries, one allows a few years of training for a newcomer with just
>> a MLS degree to work with a small subset of this. I have thirty years
>> of experience in related areas of librarianship, and I know only
>> enough to be aware of the problems.
>> For an introduction to the current state of this, see
>> http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf.
>>
>> The difficulty of merging the many thousands of partial correct and
>> incorrect sources of available data typically requires the manual
>> resolution of each of the tens of millions of instances.
>>
>> OL rather than Wikimedia has the advantage that more of the people
>> there understand the problems.
>>
>> David Goodman, Ph.D, M.L.S.
>> http://en.wikipedia.org/wiki/User_talk:DGG
>>
>>
>>
>> On Wed, Aug 12, 2009 at 1:15 PM, c<[hidden email]> wrote:
>>> Hello,
>>>
>>> This discussion is very interesting. I would like to make a summary, so
>>> that we can go further.
>>>
>>> 1. A database of all books ever published is one of the thing still missing.
>>> 2. This needs massive collaboration by thousands of volunteers, so a
>>> wiki might be appropriate, however...
>>> 3. The data needs a structured web site, not a plain wiki like Mediawiki.
>>> 4. A big part of this data is already available, but scattered on
>>> various databases, in various languages, with various protocols, etc. So
>>> a big part of work needs as much database management knowledge as
>>> librarian knowledge.
>>> 5. What most missing in these existing databases (IMO) is information
>>> about translations: nowhere there are a general database of translated
>>> works, at least not in English and French. It is very difficult to find
>>> if a translation exists for a given work. Wikisource has some of this
>>> information with interwiki links between work and author pages, but for
>>> a (very) small number of works and authors.
>>> 6. It would be best not to duplicate work on several places.
>>>
>>> Personally I don't find OL very practical. May be I am too much used too
>>> Mediawiki. ;oD
>>>
>>> We still need to create something, attractive to contributors and
>>> readers alike.
>>>
>>> Yann
>>>
>>> Samuel Klein wrote:
>>>>> This thread started out with a discussion of why it is so hard to
>>>>> start new projects within the Wikimedia Foundation.  My stance is
>>>>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>>>>> fine as they are, and there is no need to duplicate their effort
>>>>> within the WMF.  The example you gave was this:
>>>>
>>>> I agree that there's no point in duplicating existing functionality.
>>>> The best solution is probably for OL to include this explicitly in
>>>> their scope and add the necessary functionality.   I suggested this on
>>>> the OL mailing list in March.
>>>>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
>>>>
>>>>>>>>>> *A wiki for book metadata, with an entry for every published
>>>>>>>>>> work, statistics about its use and siblings, and discussion
>>>>>>>>>> about its usefulness as a citation (a collaboration with
>>>>>>>>>> OpenLibrary, merging WikiCite ideas)
>>>>> To me, that sounds exactly as what OpenLibrary already does (or
>>>>> could be doing in the near time), so why even set up a new project
>>>>> that would collaborate with it?  Later you added:
>>>>
>>>> However, this is not what OL or its wiki do now.  And OL is not run by
>>>> its community, the community helps support the work of a centrally
>>>> directed group.  So there is only so much I feel I can contribute to
>>>> the project by making suggestions.  The wiki built into the fiber of
>>>> OL is intentionally not used for general discussion.
>>>>
>>>>> I was talking about the metadata for all books ever published,
>>>>> including the Swedish translations of Mark Twain's works, which
>>>>> are part of Mark Twain's bibliography, of the translator's
>>>>> bibliography, of American literature, and of Swedish language
>>>>> literature.  In OpenLibrary all of these are contained in one
>>>>> project.  In Wikisource, they are split in one section for English
>>>>> and another section for Swedish.  That division makes sense for
>>>>> the contents of the book, but not for the book metadata.
>>>>
>>>> This is a problem that Wikisource needs to address, regardless of
>>>> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
>>>> problem of wanting some content - the array of translations of a
>>>> single definition - to exist in one place and be transcluded in each
>>>> language.
>>>>
>>>>> Now you write:
>>>>>
>>>>>> However, the project I have in mind for OCR cleaning and
>>>>>> translation needs to
>>>>> That is a change of subject. That sounds just like what Wikisource
>>>>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>>>>> entirely different thing to set up "a wiki for book metadata, with
>>>>> an entry for every published work".  So which of these two project
>>>>> ideas are we talking about?
>>>>
>>>> They are closely related.
>>>>
>>>> There needs to be a global authority file for works -- a [set of]
>>>> universal identifier[s] for a given work in order for wikisource (as
>>>> it currently stands) to link the German translation of the English
>>>> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
>>>> to its metadata entry [or entries].
>>>>
>>>> I would prefer for this authority file to be wiki-like, as the
>>>> Wikipedia authority file is, so that it supports renames, merges, and
>>>> splits with version history and minimal overhead; hence I wish to see
>>>> a wiki for this sort of metadata.
>>>>
>>>> Currently OL does not quite provide this authority file, but it could.
>>>>  I do not know how easily.
>>>>
>>>>> Every book ever published means more than 10 million records.
>>>>> (It probably means more than 100 million records.) OCR cleaning
>>>>> attracts hundreds or a few thousand volunteers, which is
>>>>> sufficient to take on thousands of books, but not millions.
>>>>
>>>> Focusing efforts on notable works with verifiable OCR, and using the
>>>> sorts of helper tools that Greg's paper describes, I do not doubt that
>>>> we could effectively clean and publish OCR for all primary sources
>>>> that are actively used and referenced in scholarship today (and more
>>>> besides).  Though 'we' here is the world - certainly more than a few
>>>> thousand volunteers have at least one book they would like to polish.
>>>> Most of them are not currently Wikimedia contributors, that much is
>>>> certain -- we don't provide any tools to make this work convenient or
>>>> rewarding.
>>>>
>>>>> Google scanned millions of books already, but I haven't heard of
>>>>> any plans for cleaning all that OCR text.
>>>>
>>>> Well, Google does not believe in distributed human effort.  (This came
>>>> up in a recent Knol thread as well.)  I'm not sure that is the best
>>>> comparison.
>>>>
>>>> SJ
>>>
>>> --
>>> http://www.non-violence.org/ | Site collaboratif sur la non-violence
>>> http://www.forget-me.net/ | Alternatives sur le Net
>>> http://fr.wikisource.org/ | Bibliothèque libre
>>> http://wikilivres.info | Documents libres
>>>
>>> _______________________________________________
>>> foundation-l mailing list
>>> [hidden email]
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>>>
>>
>> _______________________________________________
>> foundation-l mailing list
>> [hidden email]
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>>
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

Pavlo Shevelo
> The training is typically an apprenticeship under the senior...

To my regret training/apprenticeship does not fit to "everyone
can...", "be bold!" set of wikimedia slogans/motto.
As to me I would stand behind (vote for) training and apprenticeship.


On Sat, Aug 15, 2009 at 12:23 AM, David Goodman<[hidden email]> wrote:

> The training is typically an apprenticeship under the senior
> cataloging librarians.
>
> David Goodman, Ph.D, M.L.S.
> http://en.wikipedia.org/wiki/User_talk:DGG
>
>
>
> On Thu, Aug 13, 2009 at 1:48 AM, Samuel Klein<[hidden email]> wrote:
>> DGG, I appreciate your points.  Would we be so motivated by this
>> thread if it weren't a complex problem?
>>
>> The fact that all of this is quite new, and that there are so many
>> unknowns and gray areas, actually makes me consider it more likely
>> that a body of wikimedians, experienced with their own form of
>> large-scale authority file coordination, are in a position to say
>> something meaningful about how to achieve something similar for tens
>> of millions of metadata records.
>>
>>> OL rather than Wikimedia has the advantage that more of the people
>>> there understand the problems.
>>
>> In some areas that is certainly so.  In others, Wikimedia communities
>> have useful recent experience.  I hope that those who understand these
>> problems  on both sides recognize the importance of sharing what they
>> know openly -- and  showing others how to understand them as well.  We
>> will not succeed as a global community if we say that this class of
>> problems can only be solved by the limited group of people with an MLS
>> and a few years of focused training.  (how would you name the sort of
>> training you mean here, btw?)
>>
>> SJ
>>
>>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

David Goodman
Exactly. That is why Wikipedia is an inappropriate place for this
project. It lacks sufficient stability. I think Wikipedia should go on
being what it is, an almost completely open place,and projects which
need disciplined long term expertise should be organized separately.
Wikipedia is a wonderful place to do many things, but not all.

David Goodman, Ph.D, M.L.S.
http://en.wikipedia.org/wiki/User_talk:DGG



On Fri, Aug 14, 2009 at 7:42 PM, Pavlo Shevelo<[hidden email]> wrote:

>> The training is typically an apprenticeship under the senior...
>
> To my regret training/apprenticeship does not fit to "everyone
> can...", "be bold!" set of wikimedia slogans/motto.
> As to me I would stand behind (vote for) training and apprenticeship.
>
>
> On Sat, Aug 15, 2009 at 12:23 AM, David Goodman<[hidden email]> wrote:
>> The training is typically an apprenticeship under the senior
>> cataloging librarians.
>>
>> David Goodman, Ph.D, M.L.S.
>> http://en.wikipedia.org/wiki/User_talk:DGG
>>
>>
>>
>> On Thu, Aug 13, 2009 at 1:48 AM, Samuel Klein<[hidden email]> wrote:
>>> DGG, I appreciate your points.  Would we be so motivated by this
>>> thread if it weren't a complex problem?
>>>
>>> The fact that all of this is quite new, and that there are so many
>>> unknowns and gray areas, actually makes me consider it more likely
>>> that a body of wikimedians, experienced with their own form of
>>> large-scale authority file coordination, are in a position to say
>>> something meaningful about how to achieve something similar for tens
>>> of millions of metadata records.
>>>
>>>> OL rather than Wikimedia has the advantage that more of the people
>>>> there understand the problems.
>>>
>>> In some areas that is certainly so.  In others, Wikimedia communities
>>> have useful recent experience.  I hope that those who understand these
>>> problems  on both sides recognize the importance of sharing what they
>>> know openly -- and  showing others how to understand them as well.  We
>>> will not succeed as a global community if we say that this class of
>>> problems can only be solved by the limited group of people with an MLS
>>> and a few years of focused training.  (how would you name the sort of
>>> training you mean here, btw?)
>>>
>>> SJ
>>>
>>>
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

John Mark Vandenberg
(top-posting unravelled)
On Sat, Aug 15, 2009 at 1:12 PM, David Goodman<[hidden email]> wrote:

> On Sat, Aug 15, 2009 at 9:42 AM, Pavlo Shevelo<[hidden email]> wrote:
>> On Sat, Aug 15, 2009 at 12:23 AM, David Goodman<[hidden email]> wrote:
>>> The training is typically an apprenticeship under the senior
>>> cataloging librarians.
>>
>> To my regret training/apprenticeship does not fit to "everyone
>> can...", "be bold!" set of wikimedia slogans/motto.
>> As to me I would stand behind (vote for) training and apprenticeship.
>
> Exactly. That is why Wikipedia is an inappropriate place for this
> project. It lacks sufficient stability. I think Wikipedia should go on
> being what it is, an almost completely open place,and projects which
> need disciplined long term expertise should be organized separately.
> Wikipedia is a wonderful place to do many things, but not all.

The good news is that the broader Wikimedia community is not all like
English Wikipedia, where "be bold" is often interpreted as demanding
that the worst of anarchy be present in every situation. ;-)

Commons and Wikisource are able to build a sensible metadata layer
around their collection using plain wiki text.  We also have a project
designed to add structure this metadata.

http://meta.wikimedia.org/wiki/Wikicat

Either way, wikisource and commons will likely figure out a way to
have Dublin Core and MODS records for their collection in the next few
years.

--
John Vandenberg

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

David Goodman
Yes, i think they are quite capable of doing it, & should take the
primary responsibility.  What I think they are not capable of is
extending it to every published book in the world.


David Goodman, Ph.D, M.L.S.
http://en.wikipedia.org/wiki/User_talk:DGG



On Sat, Aug 15, 2009 at 9:21 PM, John Vandenberg<[hidden email]> wrote:

> (top-posting unravelled)
> On Sat, Aug 15, 2009 at 1:12 PM, David Goodman<[hidden email]> wrote:
>> On Sat, Aug 15, 2009 at 9:42 AM, Pavlo Shevelo<[hidden email]> wrote:
>>> On Sat, Aug 15, 2009 at 12:23 AM, David Goodman<[hidden email]> wrote:
>>>> The training is typically an apprenticeship under the senior
>>>> cataloging librarians.
>>>
>>> To my regret training/apprenticeship does not fit to "everyone
>>> can...", "be bold!" set of wikimedia slogans/motto.
>>> As to me I would stand behind (vote for) training and apprenticeship.
>>
>> Exactly. That is why Wikipedia is an inappropriate place for this
>> project. It lacks sufficient stability. I think Wikipedia should go on
>> being what it is, an almost completely open place,and projects which
>> need disciplined long term expertise should be organized separately.
>> Wikipedia is a wonderful place to do many things, but not all.
>
> The good news is that the broader Wikimedia community is not all like
> English Wikipedia, where "be bold" is often interpreted as demanding
> that the worst of anarchy be present in every situation. ;-)
>
> Commons and Wikisource are able to build a sensible metadata layer
> around their collection using plain wiki text.  We also have a project
> designed to add structure this metadata.
>
> http://meta.wikimedia.org/wiki/Wikicat
>
> Either way, wikisource and commons will likely figure out a way to
> have Dublin Core and MODS records for their collection in the next few
> years.
>
> --
> John Vandenberg
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

Lars Aronsson
In reply to this post by Yann Forget-2
Yann Forget wrote:

> This discussion is very interesting. I would like to make a summary, so
> that we can go further.
>
> 1. A database of all books ever published is one of the thing
>    still missing.

No, no, no, this is *not* missing. This is exactly the scope of
OpenLibrary. Just as Wikipedia is not yet a complete encyclopedia,
or OpenStreetMap is not yet a complete map of the world, some
books are still missing from OpenLibrary's database, but it is a
project aiming to compile a database of every book ever published.

> Personally I don't find OL very practical. May be I am too much
> used too Mediawiki. ;oD

And therefore, you would not try to improve OpenLibrary, but
rather start an entirely new project based on MediaWiki?  I'm
afraid that this ("not invented here") is a common sentiment, and
a major reason that we will get nowhere.


--
  Lars Aronsson ([hidden email])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Open Library, Wikisource, and cleaning and translating OCR of Classics

Ray Saintonge
In reply to this post by David Goodman
David Goodman wrote:
> The problem is extraordinarily   complex. A database of all "books"
> (and other media) ever published is beyond the joint  capabilities of
> everyone interested. There are intermediate entities between "books"
> and "works", and important subordinate entities, such as "article" ,
> "chapter" , and those like "poem" which could be at any of several
> levels.  

I've already been in raging arguments at Wikisource about the meaning of
"work".  The general tendency there has been to treat "work" as
equivalent to a book or set of related books.  This is highly
problematical for periodicals, encyclopedias and dictionaries.

I do agree that the problem is so complex, but there is a resistance on
the part of many to accept standards that have been developed over a
long period of time. Before the Category: namespace was made a part of
Wikipedia there was considerable antipathy to adopting any kind of
established category system.  Muddling through from square one was the
preferred option.

> This is not a job for amateurs, unless they are prepared to
> first learn the actual standards of bibliographic description for
> different types of material, and to at least recognize the
> inter-relationships, and the many undefined areas. At research
> libraries, one allows a few years of training for a newcomer with just
> a MLS degree to work with a small subset of this. I have thirty years
> of experience in related areas of librarianship, and I know only
> enough to be aware of the problems.
>  

This does not bode well!  The big factor in Wiki participation and
success is amateur involvement and crowd sourcing.  What are the PhDs
doing to bridge the gap?  What efforts are being made to at least bring
the most significant points to the level of the general contributor?  
Saying that it takes several years to bring an MLS up to speed is not
good enough.  Knowledge needs to be brought to the level where it was
most useful.  When I went to school typing was not introduced as a
subject until the 10th grade; my son learned keyboarding in the first grade.

Our wiki projects also have a superfluity of people with an IT
background who also do not do a very good job bringing information to
where it belongs, and end up creating a mind-boggling assortment of
templates of questionable value.  In theory they are trying to bring
standardization and simplicity to the projects, but just as often
produce a simplistic and premature narrowing of the way knowledge is
organized.



> The difficulty of merging the many thousands of partial correct and
> incorrect sources of available data typically requires the manual
> resolution of each of the tens of millions of instances.
>  

Yes, of course.  There is no magic software that will do it all.  Humans
need to retain the right to decide the limits of technology.

> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.

The librarians have their work cut out for them.  They can help to build
a system for the future, or they can let everyone muddle their way into
a fuck-up.

Ec

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Yann Forget-2
In reply to this post by Lars Aronsson
Hello,

Lars Aronsson wrote:

> Yann Forget wrote:
>
>> This discussion is very interesting. I would like to make a summary, so
>> that we can go further.
>>
>> 1. A database of all books ever published is one of the thing
>> still missing.
>
> No, no, no, this is *not* missing. This is exactly the scope of
> OpenLibrary. Just as Wikipedia is not yet a complete encyclopedia,
> or OpenStreetMap is not yet a complete map of the world, some
> books are still missing from OpenLibrary's database, but it is a
> project aiming to compile a database of every book ever published.

At least Wikipedia can say that it has the most complete encyclopedia,
and OpenStreetMap the most complete free maps that ever existed. AFAIK
OpenLibrary is very very far to have anything comprensive, through I am
curious to have the figures. As I already said, the first steps would be
to import existing databases, and Wikimedians are very good at this job.

>> Personally I don't find OL very practical. May be I am too much
>> used too Mediawiki. ;oD
>
> And therefore, you would not try to improve OpenLibrary, but
> rather start an entirely new project based on MediaWiki?  I'm
> afraid that this ("not invented here") is a common sentiment, and
> a major reason that we will get nowhere.

You are wrong here. I was delighted to see a project as OL and I
inserted a few books and authors, but I have not been convinced. On
books and authors, Wikimedia projects have already much more data than
OL, and a lot of basic funtionalities are not available: tagging 2
entries as identical (redirect), multilinguism, links between related
entries (interwiki), etc.

I don't really care who would host this "Universal Library", as long as
it is freely available with a powerful search engine, and no restriction
on reuse. What I say is that Mediawiki is really much better that
anything else for any massive online cooperative work. The most
important point for such a project is building a community. OpenLibrary
has certainly done a good job, but I don't see _a community_. The tools
and the social environment available on Wikimedia projects are missing.
I believe the social environment is a consequence both of the software
and the leadership. Once the community exists it may be self-sustaining
if other conditions are met. OL lacks a good software as Mediawiki and a
leader as Jimbo.

Yann
--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Lars Aronsson
Yann Forget wrote:

> As I already said, the first steps would be to import existing
> databases, and Wikimedians are very good at this job.

Do you have a bibliographic database (library catalog) of French
literature that you can upload?  How many records?  Convincing
libraries to donate copies of their catalogs has been a bottleneck
for OpenLibrary.


--
  Lars Aronsson ([hidden email])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
12345