English language dominationism is striking again

classic Classic list List threaded Threaded
57 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Wikisource and reCAPTCHA

Michael Peel
(Renaming the subject as we've changed topic)

On 23 Jun 2010, at 21:31, Mariano Cecowski wrote:

> --- El mié 23-jun-10, Michael Peel <[hidden email]> escribió:
>
>> I always think than not using reCaptcha is a shame, as it's
>> a nice way to get people to proofread text in a reasonably
>> efficient way. It would be really nice if someone could
>> create something similar that proofreads OCR'd text from
>> Wikisource... <hint, hint>.
>
> And how do you decide that what was entered is wrong or right?
>
> Better take a look at Project Gutemberg's Distributed Proofreaders[1].
>
> Cheers,
> MarianoC.-
>
> [1] http://pgdp.net

My understanding is that original text within the reCAPTCHA is shown to several different people; if they agree then the word is counted as correct. Looking at the Wikipedia article, it's a little more complex than that:
http://en.wikipedia.org/wiki/ReCAPTCHA
There's a reason why there are two words to solve during a reCAPTCHA.

What Distributed Proofreaders can do, Wikisource can do - but in a Wiki environment. If you haven't checked out the proofreading features that Wikisource now has, I would encourage you to give them a go, e.g. at:
http://en.wikisource.org/wiki/Page:Frederic_Shoberl_-_Persia.djvu/92

Mike
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: English language dominationism is striking again

David Gerard-2
In reply to this post by Mariano Cecowski
On 23 June 2010 21:31, Mariano Cecowski <[hidden email]> wrote:
> --- El mié 23-jun-10, Michael Peel <[hidden email]> escribió:

>> I always think than not using reCaptcha is a shame, as it's
>> a nice way to get people to proofread text in a reasonably
>> efficient way. It would be really nice if someone could
>> create something similar that proofreads OCR'd text from
>> Wikisource... <hint, hint>.

> And how do you decide that what was entered is wrong or right?


It turns out that having several randomly-selected people check a
given recaptcha is very accurate indeed.

http://recaptcha.net/learnmore.html

"But if a computer can't read such a CAPTCHA, how does the system know
the correct answer to the puzzle? Here's how: Each new word that
cannot be read correctly by OCR is given to a user in conjunction with
another word for which the answer is already known. The user is then
asked to read both words. If they solve the one for which the answer
is known, the system assumes their answer is correct for the new one.
The system then gives the new image to a number of other people to
determine, with higher confidence, whether the original answer was
correct."

Your question is similar to "But if anyone can edit Wikipedia, how do
you know what's entered will be accurate?"


- d.

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

metasj
In reply to this post by Michael Peel
I love those proofreading features, and the new default layout for a
book's pages and TOC.  Wikisource is becoming AWESOME.

Do we have PGDP contributors who can weigh on on how similar the
processes are?  Is there a way for us to actually merge workflows with
them?

Prof. Greg Crane of The Perseus Project @ Tufts is looking to upload a
few score classical manuscripts, and perhaps eventually their whole
corpus, into Wikisource -- but they want better multilingual
proofreading and annotation tools (which they are also considering
developing.  Hear, hear!)  All of this work needs a bit more
visibility.

SJ

On Wed, Jun 23, 2010 at 4:39 PM, Michael Peel <[hidden email]> wrote:

> (Renaming the subject as we've changed topic)
>
> On 23 Jun 2010, at 21:31, Mariano Cecowski wrote:
>
>> --- El mié 23-jun-10, Michael Peel <[hidden email]> escribió:
>>
>>> I always think than not using reCaptcha is a shame, as it's
>>> a nice way to get people to proofread text in a reasonably
>>> efficient way. It would be really nice if someone could
>>> create something similar that proofreads OCR'd text from
>>> Wikisource... <hint, hint>.
>>
>> And how do you decide that what was entered is wrong or right?
>>
>> Better take a look at Project Gutemberg's Distributed Proofreaders[1].
>>
>> Cheers,
>> MarianoC.-
>>
>> [1] http://pgdp.net
>
> My understanding is that original text within the reCAPTCHA is shown to several different people; if they agree then the word is counted as correct. Looking at the Wikipedia article, it's a little more complex than that:
> http://en.wikipedia.org/wiki/ReCAPTCHA
> There's a reason why there are two words to solve during a reCAPTCHA.
>
> What Distributed Proofreaders can do, Wikisource can do - but in a Wiki environment. If you haven't checked out the proofreading features that Wikisource now has, I would encourage you to give them a go, e.g. at:
> http://en.wikisource.org/wiki/Page:Frederic_Shoberl_-_Persia.djvu/92
>
> Mike
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>



--
Samuel Klein          identi.ca:sj           w:user:sj

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

James Forrester-5
On 24 June 2010 15:37, Samuel Klein <[hidden email]> wrote:
> I love those proofreading features, and the new default layout for a
> book's pages and TOC.  Wikisource is becoming AWESOME.

Ahem. Even more awesome, you mean. :-)

> Do we have PGDP contributors who can weigh on on how similar the
> processes are?  Is there a way for us to actually merge workflows with
> them?

Disclaimer - my PGDP account dates from 2004, but I only get involved
in fits every couple of years. This should be seen mostly as an
"outsider's" viewpoint. :-)

IME, PGDP's processes are /seriously/ heavy-weight, burning lots of
worker time on 2nd or even 3rd-level passes, and multiple tiers of
work (Proofreading, Formatting, and all the special management levels
for people running projects). The pyramid of processes has grown so
great that they have seemed to crash in on themselves - there's a huge
dearth of people at the "higher" levels (you need to qualify at the
lower levels before the system will let you contribute to the
activities at the end). It's generally quite "unwiki".

I think Wikisource's model is a great deal more light weight that
PGDP's - and that we really don't want to push Wikisource down that
route. :-) Unfortunately I think that this means linking the two up
might prove challenging - and there's also a danger that people may
jump ship, damaging PGDP still further and making them upset with us.

J.
--
James D. Forrester
[hidden email] | [hidden email]
[[Wikipedia:User:Jdforrester|James F.]]

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and PGDP

Birgitte_sb



--- On Thu, 6/24/10, James Forrester <[hidden email]> wrote:


>
> IME, PGDP's processes are /seriously/ heavy-weight, burning
> lots of
> worker time on 2nd or even 3rd-level passes, and multiple
> tiers of
> work (Proofreading, Formatting, and all the special
> management levels
> for people running projects). The pyramid of processes has
> grown so
> great that they have seemed to crash in on themselves -
> there's a huge
> dearth of people at the "higher" levels (you need to
> qualify at the
> lower levels before the system will let you contribute to
> the
> activities at the end). It's generally quite "unwiki".
>
> I think Wikisource's model is a great deal more light
> weight that
> PGDP's - and that we really don't want to push Wikisource
> down that
> route. :-) Unfortunately I think that this means linking
> the two up
> might prove challenging - and there's also a danger that
> people may
> jump ship, damaging PGDP still further and making them
> upset with us.
>

I definitely wouldn't want to see Wikisource move to a more heavy weight structure.  Right now it is easy for anyone completely unfamiliar to the nuts and bolts of setting up a text to show up at the Proofread of the Month and validate a single page and then have nothing further to do with the text.  Seldom do you even need to deal with formatting when you are validating an already proofread page.  I think that this is important to keep this very simple.  I would really encourage anyone who has never participated to try it out [1]

Of course, we don't really have any push to focus on a "finished" release like PGDP must have.  And this eventualism has the usual results even as it keeps the structure lightweight.

Linking up with PGDP texts is mostly avoided at en.WS because it is so often impossible to match their texts with a specific edition, which we are looking for to attach scanned images.  It has become easier to just start from scratch with a file we can more easily put through the Proofread Page extention. Their more rigid structure makes edition verification after release unnecessary for them, but it is very important for us since our structure is so open.  It is difficult to see how we might help one another given such basic incompatibilities in structure.

Birgitte SB


[1]http://en.wikisource.org/wiki/Index:Frederic_Shoberl_-_Persia.djvu
Click on any yellow highlighted number.  Validate the wikitext against the image.  Edit the page to make changes (if necessary) and to move the radio button to validated.


     


_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

metasj
In reply to this post by James Forrester-5
On Thu, Jun 24, 2010 at 11:16 AM, James Forrester <[hidden email]> wrote:
> On 24 June 2010 15:37, Samuel Klein <[hidden email]> wrote:
>> I love those proofreading features, and the new default layout for a
>> book's pages and TOC.  Wikisource is becoming AWESOME.
>
> Ahem. Even more awesome, you mean. :-)

It used to be just lowercase awesome... THINGS HAVE CHANGED.  >:-)

> Disclaimer - my PGDP account dates from 2004, but I only get involved
> in fits every couple of years.

Could you ask some of the wiki-savvy continuously active proofreaders
to join this discussion for a little while?  I like the work PGDP
does, and bet we can find a way to support and amplify it.

SJ

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

Andre Engels
In reply to this post by metasj
On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein <[hidden email]> wrote:
> I love those proofreading features, and the new default layout for a
> book's pages and TOC.  Wikisource is becoming AWESOME.
>
> Do we have PGDP contributors who can weigh on on how similar the
> processes are?  Is there a way for us to actually merge workflows with
> them?

I am quite active on PGDP, but not on Wikisource, so I can tell about
how things work there, but not on how similar it is to Wikisource.

Typical about the PGDP workflow are an emphasis on quality above
quantity (exemplified in running not 1 or 2 but 3 rounds of human
checking of the OCR result - correctness in copying is well above
99.99% for most books) and work being done in page-size chunks rather
than whole books, chapters, paragraphs, sentences, words or whatever
else one could think of.

There's a number of people involved, although people can and often do
fill several roles for one book.

First, there is the Content Provider (CP).

He or she first contacts Project Gutenberg to get a clearance. This is
basically a statement from PG that they believe the work is out of
copyright. In general, US copyright is what is taken into account for
this, although there are also servers in other countries (Canada and
Australia as far as I know), which publish some material that is out
of copyright in those countries even if it is not in the US. Such
works do not go through PGDP, but may go through its sister projects
DPCanada or DPEurope.

Next, the CP will scan the book, or harvest the scans from the web,
and run OCR on them. They will usually also write a description of the
book for the proofreaders, so those can see whether they are
interested. The scans and the OCR are uploaded to the PGDP servers,
and the project is handed over to the Project Manager (PM) (although
in most cases CP and PM are the same person).

The Project Manager is responsible for the project in the next stages.
This means:
* specifying the rules and guidelines that are to be followed when
proofreading the book, at least there where those differ from the
standard guidelines
* answer questions by proofreaders
* keep the good and bad words lists up to date. These are used in
wordcheck (a kind of spellchecker) so that words are considered
correct or incorrect by it

The project then goes through a number of rounds. The standard number
is 5 rounds, of which 3 are proofreading and 2 are formatting, but it
is possible for the PM to make a request to skip one or more rounds or
go through a round twice.

In the first three, proofreading, rounds, a proofreader requests one
page at a time, compares the OCR output (or the previous proofreader's
output) with the scan, and changes the text to correspond to the scan.
In the first round (P1) everyone can do this, the second round (P2) is
only accessible to those who have been at the site some time and done
a certain amount of pages (21 days and 300 pages, if I recall
correctly), for the third round (P3) one has to qualify. For
qualification one's P2 pages are checked (using the subsequent edits
of P3). The norm is that one should not leave more than one error per
five pages.

After the three (or two or four) rounds of proofing, the foofing
(formatting) rounds are gone through. In these, again a proofreader
(now called formatter) requests and edits one page at the time, but
where the proofreaders dealt with copying the text as precisely as
possible, the formatter will deal with all other aspects of the work.
They denote when text is italic, bold or otherwise in a special
format, which texts are chapter headers, how tables are laid out,
etcetera. Here there are two rounds, although the second one can be
skipped or a round duplicated, like before. The first formatting round
(F1) has the same entrance restrictions as P2, F2 has a qualification
system comparable to P3.

After this, the PM gives the book on to the Post-Processor (PP).
Again, this is often the same person, but not always. In some other
cases, the PP has already been appointed, in others it will sit in a
pool until picked up by a willing PP. The PP does all that is needed
to get from the F2 output to something that can be put on Project
Gutenberg: they recombine the pages into one work, move stuff around
where needed, change the formatters' mark-up in something that's more
appropriate for reading, in most cases generate an HTML version,
etcetera.

A PP that has already post-processed several books in a good way can
then send it to PG. In other cases, the book will then go to the PPV
(Post-Processing Verifier), an experienced PP, who checks the PP's
work, and gives them hints on what should be improved or makes those
improvements themselves.

Finally, if the PP or PPV sends the book to PG, there is a whitewasher
who checks the book once again; however, that is outside the scope of
this (already too long) description, because it belongs to PG's
process rather than PGDP's.

To stop the rounds from overcrowding with books, there are queues for
each round, containing books that are ready to enter the round, but
have not yet done so. To keep some variety, there are different queues
by language and/or subject type. A problem with this has been that the
later rounds, having less manpower because of the higher standards
required, could not keep up with P1 and F1. There has been work to do
something about it, and the P2 queues have been brought down to decent
size, but in P3 and F2 books can literally sit in the queues for
years, and PP still is a bottleneck as well.


--
André Engels, [hidden email]

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

Sam Klein
Andre, this is a great summary -- I've linked to it from the english
ws Scriptorium.

Do you see opportunities for the two projects to coordinate their
wofklows better?

SJ

On Thu, Jun 24, 2010 at 11:13 PM, Andre Engels <[hidden email]> wrote:

> On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein <[hidden email]> wrote:
>> I love those proofreading features, and the new default layout for a
>> book's pages and TOC.  Wikisource is becoming AWESOME.
>>
>> Do we have PGDP contributors who can weigh on on how similar the
>> processes are?  Is there a way for us to actually merge workflows with
>> them?
>
> I am quite active on PGDP, but not on Wikisource, so I can tell about
> how things work there, but not on how similar it is to Wikisource.
>
> Typical about the PGDP workflow are an emphasis on quality above
> quantity (exemplified in running not 1 or 2 but 3 rounds of human
> checking of the OCR result - correctness in copying is well above
> 99.99% for most books) and work being done in page-size chunks rather
> than whole books, chapters, paragraphs, sentences, words or whatever
> else one could think of.
>
> There's a number of people involved, although people can and often do
> fill several roles for one book.
>
> First, there is the Content Provider (CP).
>
> He or she first contacts Project Gutenberg to get a clearance. This is
> basically a statement from PG that they believe the work is out of
> copyright. In general, US copyright is what is taken into account for
> this, although there are also servers in other countries (Canada and
> Australia as far as I know), which publish some material that is out
> of copyright in those countries even if it is not in the US. Such
> works do not go through PGDP, but may go through its sister projects
> DPCanada or DPEurope.
>
> Next, the CP will scan the book, or harvest the scans from the web,
> and run OCR on them. They will usually also write a description of the
> book for the proofreaders, so those can see whether they are
> interested. The scans and the OCR are uploaded to the PGDP servers,
> and the project is handed over to the Project Manager (PM) (although
> in most cases CP and PM are the same person).
>
> The Project Manager is responsible for the project in the next stages.
> This means:
> * specifying the rules and guidelines that are to be followed when
> proofreading the book, at least there where those differ from the
> standard guidelines
> * answer questions by proofreaders
> * keep the good and bad words lists up to date. These are used in
> wordcheck (a kind of spellchecker) so that words are considered
> correct or incorrect by it
>
> The project then goes through a number of rounds. The standard number
> is 5 rounds, of which 3 are proofreading and 2 are formatting, but it
> is possible for the PM to make a request to skip one or more rounds or
> go through a round twice.
>
> In the first three, proofreading, rounds, a proofreader requests one
> page at a time, compares the OCR output (or the previous proofreader's
> output) with the scan, and changes the text to correspond to the scan.
> In the first round (P1) everyone can do this, the second round (P2) is
> only accessible to those who have been at the site some time and done
> a certain amount of pages (21 days and 300 pages, if I recall
> correctly), for the third round (P3) one has to qualify. For
> qualification one's P2 pages are checked (using the subsequent edits
> of P3). The norm is that one should not leave more than one error per
> five pages.
>
> After the three (or two or four) rounds of proofing, the foofing
> (formatting) rounds are gone through. In these, again a proofreader
> (now called formatter) requests and edits one page at the time, but
> where the proofreaders dealt with copying the text as precisely as
> possible, the formatter will deal with all other aspects of the work.
> They denote when text is italic, bold or otherwise in a special
> format, which texts are chapter headers, how tables are laid out,
> etcetera. Here there are two rounds, although the second one can be
> skipped or a round duplicated, like before. The first formatting round
> (F1) has the same entrance restrictions as P2, F2 has a qualification
> system comparable to P3.
>
> After this, the PM gives the book on to the Post-Processor (PP).
> Again, this is often the same person, but not always. In some other
> cases, the PP has already been appointed, in others it will sit in a
> pool until picked up by a willing PP. The PP does all that is needed
> to get from the F2 output to something that can be put on Project
> Gutenberg: they recombine the pages into one work, move stuff around
> where needed, change the formatters' mark-up in something that's more
> appropriate for reading, in most cases generate an HTML version,
> etcetera.
>
> A PP that has already post-processed several books in a good way can
> then send it to PG. In other cases, the book will then go to the PPV
> (Post-Processing Verifier), an experienced PP, who checks the PP's
> work, and gives them hints on what should be improved or makes those
> improvements themselves.
>
> Finally, if the PP or PPV sends the book to PG, there is a whitewasher
> who checks the book once again; however, that is outside the scope of
> this (already too long) description, because it belongs to PG's
> process rather than PGDP's.
>
> To stop the rounds from overcrowding with books, there are queues for
> each round, containing books that are ready to enter the round, but
> have not yet done so. To keep some variety, there are different queues
> by language and/or subject type. A problem with this has been that the
> later rounds, having less manpower because of the higher standards
> required, could not keep up with P1 and F1. There has been work to do
> something about it, and the P2 queues have been brought down to decent
> size, but in P3 and F2 books can literally sit in the queues for
> years, and PP still is a bottleneck as well.
>
>
> --
> André Engels, [hidden email]
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>



--
Samuel Klein          identi.ca:sj           w:user:sj

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

metasj
On Wed, Jun 30, 2010 at 5:49 AM, Samuel Klein <[hidden email]> wrote:
> Andre, this is a great summary -- I've linked to it from the english
> ws Scriptorium.
>
> Do you see opportunities for the two projects to coordinate their
> wofklows better?
  ^^^^^^^
Clearly this email needed 1 more round of human checking.

SJ

>
> On Thu, Jun 24, 2010 at 11:13 PM, Andre Engels <[hidden email]> wrote:
>>
>> Typical about the PGDP workflow are an emphasis on quality above
>> quantity (exemplified in running not 1 or 2 but 3 rounds of human
>> checking of the OCR result - correctness in copying is well above
>> 99.99% for most books) and work being done in page-size chunks rather

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

John Mark Vandenberg
In reply to this post by Sam Klein
On Wed, Jun 30, 2010 at 7:49 PM, Samuel Klein <[hidden email]> wrote:
> Andre, this is a great summary -- I've linked to it from the english
> ws Scriptorium.
>
> Do you see opportunities for the two projects to coordinate their
> wofklows better?

I don't understand your use of 'coordinate' in this context.

Wikisource has a very lax workflow (it's a wiki), it publishes the
scans & text immediately, irrespective of whether it is verified, OCR
quality, or if it is vandalism.  However, wikisource keeps the images
and the text unified from day 0 to eternity.

PGDP has a very strict and arduous workflow, big projects end up stuck
in the rounds (the remaining EB projects are a great example), and
they are not published until they make it out of the rounds.  The
result is quality, however only the text is sent downstream.

Wikisource and PGDP don't interoperate.  We *could*, but when I looked
at importing a PGDP project into Wikisource, I put it in the too hard
basket.

Wikisource is trying to become a credible competitor to PGDP.  However
this isnt a zero-sum game.  If the Wikisource projects succeeds in
demonstrating the wiki way is a viable approach, the result is
different people choosing to work in different workflows/projects, and
more reliable etexts being produced.

--
John Vandenberg

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

Samuel Klein-4
On Wed, Jun 30, 2010 at 6:13 AM, John Vandenberg <[hidden email]> wrote:
> irrespective of whether it is verified, OCR
> quality, or if it is vandalism.  However, wikisource keeps the images
> and the text unified from day 0 to eternity.

Some works become verified, and reach high OCR quality.

< PGDP has a very strict and arduous workflow...  The
> result is quality, however only the text is sent downstream.

Why not send images and text downstream?

> Wikisource and PGDP don't interoperate.  We *could*, but when I looked
> at importing a PGDP project into Wikisource, I put it in the too hard basket.

That's what I mean by 'coordinate'.  "hard" here seems like a one-time
hardship followed by a permanent useful coordination.

> Wikisource is trying to become a credible competitor to PGDP.

Perhaps we have competing interfaces / workflows.  but I expect we
would be glad to share 99.99%-verified high-quality
texts-unified-with-images if it were easy for both projects to
identify that combination of quality and comprehensive data... and
would be glad to share metadata so that a WS editor could quickly
check to see if there's a PGDP effort covering an edition of the text
she is proofing; and vice-versa.

I want us to get better, faster, less held up by the idea of
coordinating with other projects, because there are much larger
projects out there worthy of coordinating with.  The annotators who
work on the Perseus Project come to mind... but that's truly a harder
problem than this one.

> If the Wikisource projects succeeds in
> demonstrating the wiki way is a viable approach, the result is
> different people choosing to work in different workflows/projects, and
> more reliable etexts being produced.

Absolutely.

SJ

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

John Mark Vandenberg
On Wed, Jun 30, 2010 at 8:42 PM, Samuel J Klein <[hidden email]> wrote:

> On Wed, Jun 30, 2010 at 6:13 AM, John Vandenberg <[hidden email]> wrote:
>> irrespective of whether it is verified, OCR
>> quality, or if it is vandalism.  However, wikisource keeps the images
>> and the text unified from day 0 to eternity.
>
> Some works become verified, and reach high OCR quality.
>
> < PGDP has a very strict and arduous workflow...  The
>> result is quality, however only the text is sent downstream.
>
> Why not send images and text downstream?

Good question! ;-)
Storage is one issue.
It would be interesting to estimate the storage requirements of
Wikisource if we had produced the PGDP etexts.

>> Wikisource and PGDP don't interoperate.  We *could*, but when I looked
>> at importing a PGDP project into Wikisource, I put it in the too hard basket.
>
> That's what I mean by 'coordinate'.  "hard" here seems like a one-time
> hardship followed by a permanent useful coordination.

They don't have an 'export' function, and I doubt they are going to
build one so that they can interoperate with us.

My 'import' function was a scraper; not something that can be used in
a large scale without their permission.

In the end, it is simpler to avoid starting WS projects that would
duplicate unfinished PGDP projects.  There are plenty of works that
have not been transcribed yet ;-)

>> Wikisource is trying to become a credible competitor to PGDP.
>
> Perhaps we have competing interfaces / workflows.

This is like saying that Wikipedia and Brittanica have competing
interfaces / workflows.

The wikisource workflow is a *symptom* of it being a "wiki", with all
that entails.  There is a lot more than merely the workflow which
distinguishes the two projects.

> .. but I expect we
> would be glad to share 99.99%-verified high-quality
> texts-unified-with-images if it were easy for both projects to
> identify that combination of quality and comprehensive data.

Good luck with that.

PGDP publishes etexts via PG.

If PGDP gives images+text to Wikisource for projects that are stuck in
their rounds, it becomes published online immediately at whatever
stage it is at - its a wiki.  That is at odds with the objective of
PGDP, unless they are completely abandoning the project.

It is more likely that PGDP will release images+text at the same time
they publish the etext to PG.
The best way for PGDP to do this is to produce a djvu with images and
verified text, and then upload it to archive.org so everyone benefits.

> and
> would be glad to share metadata so that a WS editor could quickly
> check to see if there's a PGDP effort covering an edition of the text
> she is proofing; and vice-versa.

IIRC, obtaining the list of ongoing PGDP projects requires a PGDP
account, but anyone can create an account.

The WS project list is in google. ;-)

--
John Vandenberg

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

Andre Engels
In reply to this post by Samuel Klein-4
On Wed, Jun 30, 2010 at 12:42 PM, Samuel J Klein <[hidden email]> wrote:

> < PGDP has a very strict and arduous workflow...  The
>> result is quality, however only the text is sent downstream.
>
> Why not send images and text downstream?

Because PGDP produces for Project Gutenberg, which publishes text and
html versions, not scans.

> Perhaps we have competing interfaces / workflows.  but I expect we
> would be glad to share 99.99%-verified high-quality
> texts-unified-with-images if it were easy for both projects to
> identify that combination of quality and comprehensive data... and
> would be glad to share metadata so that a WS editor could quickly
> check to see if there's a PGDP effort covering an edition of the text
> she is proofing; and vice-versa.

For the PGDP side, it's possible to check at PGDP itself (one will
need to get a login for that, but it's as free and unencumbered as the
same on Wikimedia), but there is also a useful superset at
http://www.dprice48.freeserve.co.uk/GutIP.html (warning! I'm talking
of a 7 megabyte html file here). This contains, sorted by author
(books by more than one author given multiple times) all books that
have a clearance for Project Gutenberg.

For cooperation, one idea could be to get the PGDP material either
after the P3 stage or after the F2 stage. As long as a project is
still active, it isn't hard at all to get both the text and the scan
pages.


--
André Engels, [hidden email]

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

Andre Engels
In reply to this post by John Mark Vandenberg
On Wed, Jun 30, 2010 at 1:24 PM, John Vandenberg <[hidden email]> wrote:

> Good question! ;-)
> Storage is one issue.
> It would be interesting to estimate the storage requirements of
> Wikisource if we had produced the PGDP etexts.

I think it is the main reason; however, a back-of-the-envelope
calculation (20.000 books, 300 pages, 100k per page; the first is
quite a good estimate, the other two could be a factor 2 off) tells me
that the total storage requirements would be measured in 100s of
gigabytes - which means that one or two state of the art hard disks
should be enough to contain it.

> They don't have an 'export' function, and I doubt they are going to
> build one so that they can interoperate with us.
>
> My 'import' function was a scraper; not something that can be used in
> a large scale without their permission.

On the other hand, if you _do_ get permission, there might well be a
more elegant ftp-based method.

> The wikisource workflow is a *symptom* of it being a "wiki", with all
> that entails.  There is a lot more than merely the workflow which
> distinguishes the two projects.

Certainly. I think the deeper-laying difference is one of attitude,
which as you write is for WS a symptom of being a wiki. As a wiki, WS
uses such attitudes/principles as "make it easy for people to
contribute", "publish early, publish often", "let people do what they
want, as long as it's a step, however small forward". PGDP on the
other hand derives its attitudes/principles from a wish to create high
quality end products. As such it uses "check and doublecheck", "limit
the amount of projects we work on", "quality control" and "division of
tasks".


--
André Engels, [hidden email]

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Wikisource and reCAPTCHA

Andrea Zanni-2
In reply to this post by Samuel Klein-4
> Perhaps we have competing interfaces / workflows.  but I expect we
> would be glad to share 99.99%-verified high-quality
> texts-unified-with-images if it were easy for both projects to
> identify that combination of quality and comprehensive data... and
> would be glad to share metadata so that a WS editor could quickly
> check to see if there's a PGDP effort covering an edition of the text
> she is proofing; and vice-versa.

As John was saying, right now there's plenty of stuff to be transcribed and
proofread, it is not so easy to duplicate ;-)

The issue of metadata is nontheless serious, because it's one of the most
important flaws of Wikisource: not applying standards (i.e Dublin Core) and not
having a proper tools for export/import and harvest metadata is still make us
amateurs, at least for "real" digital libraries (who focus mainly on the
metadata stuff, and sometimes provide either texts or images (it is really rare
to have both)).

> I want us to get better, faster, less held up by the idea of
> coordinating with other projects, because there are much larger
> projects out there worthy of coordinating with.  The annotators who
> work on the Perseus Project come to mind... but that's truly a harder
> problem than this one.

The Perseus project is an *amazing* project, but I regard them far more ahead
than us. The PP is actually a Virtual Research Environment, with tools for
scholars and researcher for studying texts, (concordances and similar stuff).

It happens that I just finished my Master thesis about collaborative digital
libraries for scholars (in the Italian context), and the outcome is quite clear:
researcher do want collaborative tools in DLs, but wiki system are
to simple and (right now) too naive to really help scholars in their work (and
there's a lot of other issues I'm not going to explain here).

I would love to have PP people involved in collaboration with Wikisource, just
don't know if this is possible.

> > If the Wikisource projects succeeds in
> > demonstrating the wiki way is a viable approach, the result is
> > different people choosing to work in different workflows/projects, and
> > more reliable etexts being produced.

It is interesting because a project similar to PGDP (it is Italian and started
in 1993, emulating the glorious PG, just with Italian texts) is, right now,
moving to a wiki.
Although the scale is way smaller, Wikipedia and Wikisource showed them a system
which tends to eliminate bottlenecks, and for them this is becoming crucial.
Luckily, the relationships with the Italian Wikisource are really good, and
they'll probably share an office with Wikimedia Italy, in October.
The interesting fact is that the offices will be within a library ;-), so I
really expect a collaboration there.


Just one more thing: why this awesome thread has not been linked to the
source-l? Probably that would have been the best place to discuss.

My regards,
Aubrey  





_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikisource and reCAPTCHA

Samuel Klein-4
Hello Aubrey,

On Thu, Jul 15, 2010 at 7:26 PM, Aubrey <[hidden email]> wrote:

> The issue of metadata is nontheless serious, because it's one of the most
> important flaws of Wikisource: not applying standards (i.e Dublin Core) and not
> having a proper tools for export/import and harvest metadata

Both good points. Are there proposals on wikisource to address these
two points in a way that's friendly to wikisource contributors?

>> I want us to get better, faster, less held up by the idea of
>> coordinating with other projects, because there are much larger
>> projects out there worthy of coordinating with.  The annotators who
>> work on the Perseus Project come to mind... but that's truly a harder
>> problem than this one.
>
> The Perseus project is an *amazing* project, but I regard them far more ahead
> than us. The PP is actually a Virtual Research Environment, with tools for
> scholars and researcher for studying texts, (concordances and similar stuff).
<
> I would love to have PP people involved in collaboration with Wikisource, just
> don't know if this is possible.

Yes, PP is ahead of us in some ways.  But in other ways they have run
into bottleneck and multilingual issues that a wiki environment can
resolve.

I believe that Prof. Greg Crane of the Perseus Project (cc:ed here) is
interested in starting to collaborate with Wikisource, even while
pursuing ideas about developing a larger framework for wiki-style
annotations and editions.

While it may be hard in the short term, in the long term that's what I
think we all want wikisource to become.

> It is interesting because a project similar to PGDP (it is Italian and started
> in 1993, emulating the glorious PG, just with Italian texts) is, right now,
> moving to a wiki. Although the scale is way smaller, Wikipedia and Wikisource
> showed them a system which tends to eliminate bottlenecks, and for them this is
> becoming crucial.
<
> Luckily, the relationships with the Italian Wikisource are really good, and
> they'll probably share an office with Wikimedia Italy, in October.
> The interesting fact is that the offices will be within a library ;-), so I
> really expect a collaboration there.

Wow.  This is all great to hear -- can you include a link to the
project?  I'd like to blog about it.

Warmly,
SJ

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikisource-l] Wikisource and reCAPTCHA

Federico Leva (Nemo)
Samuel J Klein, 16/07/2010 21:49:
> On Thu, Jul 15, 2010 at 7:26 PM, Aubrey <[hidden email]> wrote:
>> Luckily, the relationships with the Italian Wikisource are really good, and
>> they'll probably share an office with Wikimedia Italy, in October.
>> The interesting fact is that the offices will be within a library ;-), so I
>> really expect a collaboration there.
>
> Wow.  This is all great to hear -- can you include a link to the
> project?  I'd like to blog about it.

There are some info in the newly published Wikimedia News no. 29 (WMI
report January-July 2010):
http://www.wikimedia.it/index.php/Wikimedia_news/numero_29/en#Wikimedia_Roma

Nemo

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
123