Re: [cultural-partners] [Wikisource-l] ABBYY Finereader 11 on Toolserver: do we like it?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [cultural-partners] [Wikisource-l] ABBYY Finereader 11 on Toolserver: do we like it?

Lars Aronsson
On 11/28/2011 10:23 PM, Alex Brollo wrote:
> [...] FineReader 11 [...] produces a complete djvu file [...] Text
> layer hasn't full range of details, it's organized into two levels
> (page and line), while OCR engine on  IA servers produces a very rich
> "tree" (page, column, region, paragraph, line and word).

Has anybody designed a web interface that shows the scanned
image and the zones or regions of the Djvu text layer? It would
look similar to image annotation on Commons,
http://commons.wikimedia.org/wiki/Commons:Image_annotations

For a Djvu file uploaded to Commons, could you automatically
generate image annotations for the various text columns and
illustrations? Does image annotation handle multi-page
document formats such as PDF and Djvu?

(Shouldn't image annotations and timed text be the same thing?)


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [cultural-partners] [Wikisource-l] ABBYY Finereader 11 on Toolserver: do we like it?

Alex Brollo
2011/11/29 Lars Aronsson <[hidden email]>
On 11/28/2011 10:23 PM, Alex Brollo wrote:
> [...] FineReader 11 [...] produces a complete djvu file [...] Text
> layer hasn't full range of details, it's organized into two levels
> (page and line), while OCR engine on  IA servers produces a very rich
> "tree" (page, column, region, paragraph, line and word).

Has anybody designed a web interface that shows the scanned
image and the zones or regions of the Djvu text layer? It would
look similar to image annotation on Commons,
http://commons.wikimedia.org/wiki/Commons:Image_annotations

For a Djvu file uploaded to Commons, could you automatically
generate image annotations for the various text columns and
illustrations? Does image annotation handle multi-page
document formats such as PDF and Djvu?

Thanks for interesing questions. I'm exploring as deeply as I can djvu text layer, metadata, anf informations wrapped into djvu file, and my feel is that djvu support is very primitive, the first needed step perhaps being conversion from "bundled" to "indirect" format; djvu files into the web are great exactly because single pages can be shared into the web, with their complete content.

I'll take a look to Image annotations, I don't know anything about them even if I tested ImageMap extension as a proofreading tool: take a look here: http://it.wikisource.org/wiki/Pagina:Vettura_a_vapore_del_signor_Dietz.djvu/1
 
Presently I'm building a python DjvuDsed "object", containing any information about the whole text layer and annotations and informations of a djvu file, and I'm adding, one by one, methods and attributes such a formidable object. I'll care for your ideas while going on.

Alex

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [cultural-partners] [Wikisource-l] ABBYY Finereader 11 on Toolserver: do we like it?

Eugene Zelenko
Hi!

It's reply to original mail in Wikisource-l (but I'm not subscribed to it).

I tried to push similar project on Russian Wikiqoutes (see
http://ru.wikisource.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D1%82%D0%B5%D0%BA%D0%B0:%D0%9F%D1%80%D0%BE%D0%B5%D0%BA%D1%82:ABBYY),
but looks like Wikimedia Russia has things with bigger priorities.

I see next potential benefits for ABBYY: beta-testing, publicity,
spell-checking improvements, tax deductions.

FineReader is definitely valuable for not widespread languages or
variants (like old Russian orthography).

I don't think that remote OCR will works good in all cases. Sometimes
it's necessary to make page options modifications. FineReader
sometimes may not correctly find prose text boundaries. Also user
intervention will be necessary if text contains several languages.

ABBYY has own online OCR service http://finereader.abbyyonline.com. I
think from their point of view, it's much better to offer it instead
of standalone software, because they could actually check if software
used as intended or not as well as make accounting for tax deductions.

There are some open source OCR projects:
* OpenOCR (http://en.openocr.org) - former FineReader competitor.
Support looks poor, but it could OCR some languages (I tired moder
Russian), likely major European ones too.
* Tesseract (http://code.google.com/p/tesseract-ocr) - can't comment
on it, but Google looks used it.

Eugene.

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [cultural-partners] [Wikisource-l] ABBYY Finereader 11 on Toolserver: do we like it?

Lars Aronsson
On 11/30/2011 09:55 PM, Eugene Zelenko wrote:
> ABBYY has own online OCR service http://finereader.abbyyonline.com

This is very interesting, OCR as a cloud service. I didn't know they
were doing this. They charge EUR 7 per 200 pages, or US$ 0.05
per page, which I guess can be (almost) reasonable for the
Wikimedia Foundation to pay. I sometimes feel bad because I have
OCRed so many tens of thousand pages with a single EUR 129
license of Finereader. Here, EUR 129 would buy us 3700 pages.

All languages of Wikisource together are proofreading slightly
less than 900 pages/day, for which OCR would cost EUR 32/day
or US$ 43/day. With good OCR, proofreading is more fun, and
these numbers may increase. But then again, we wouldn't need
the service for all pages, as some books already have OCR.

The most interesting feature of a cloud-based OCR service, is
if they can accumulate improvements in font training (?) and
dictionaries from a large number of users over time. With
Wikisource, they can of course get direct access to the page
after proofreading.

So, is the service any good? They even promise to do Fraktur
(blackletter). Does it work well?


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikisource-l] [cultural-partners] ABBYY Finereader 11 on Toolserver: do we like it?

Lars Aronsson
On 12/01/2011 04:08 AM, Lars Aronsson wrote:
> On 11/30/2011 09:55 PM, Eugene Zelenko wrote:
>> ABBYY has own online OCR service http://finereader.abbyyonline.com
>
> So, is the service any good? They even promise to do Fraktur
> (blackletter). Does it work well?

After having tried it, I'm less enthusiastic. The web user interface
is only upload images, download OCR text. There is no interaction
with adjusting segments / zones or training the OCR output. Only
40 languages are supported, and there is no way to indicate
special dictionaries for old spelling. Blackletter is only supported
for German and Latvian. The upload button is based on Flash,
and didn't quite work in Firefox on Linux, but it worked in Opera.

It worked OK for a modern (not blackletter) Norwegian text from
the 1930s. An advantage is that you can start as low as 50 pages
for EUR 3.50. Double that and you get 200 pages. For advanced
jobs, I still recommend buying the Professional edition, but some
users might find the online version useful.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l