Book scans from Tuebingen Digital Library to Wikimedia Commons

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Book scans from Tuebingen Digital Library to Wikimedia Commons

Shiju Alex
Hello,

Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled *Gundert
Legacy project* to digitize close to 137,000 pages from *850 public domain
books*.

All these public domain books are in the South Indian languages *Malayalam,
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, 187
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.

Also there was  a separate sub-project which was run as part of this
project to convert 136 titles in Malayalam to Malayalam Unicode. The number
of pages that were converted to Unicode is close to *25,700* pages.The
Unicode conversion project was ran only for Malayalam. For the other
languages it is just the scanning of books

The project is complete now and the results of the project is available in
the Hermman Gundert Portal https://www.gundert-portal.de/?language=en which
was released on Nov 20. A news report is available here.
<https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms>

To view the books in each language you can navigate through the various
links in the portal. For example, malayalam books are available here:
https://www.gundert-portal.de/?page=malayalam

Now we need to upload these scans to Wikimedia Commons and Unicode text to
Malayalam Wikisource (25,700 Unicode converted pages)

The first priority is for the scans that are converted to Unicode. Is it
possible to write a script to migrate the scans from Tuebingen Digital
library to Wikimedia Commons? (I can share the exact details of books
converted to Unicode if needed)

All the digitized files are heavy and the size ranges from 100 MB to 1.5 GB
depending on the number of pages in the books. So manually managing this is
going to be a big challenge.

Can some one help with this?

Shiju Alex
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Book scans from Tuebingen Digital Library to Wikimedia Commons

Andre Klapper-2
Hi,

Great! Some questions below for better understanding what's wanted:

On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:

> Recently Tuebingen University
> <https://uni-tuebingen.de/en/university.html> (with
> the support from German Research Foundation) ran a project titled *Gundert
> Legacy project* to digitize close to 137,000 pages from *850 public domain
> books*.
>
> All these public domain books are in the South Indian languages *Malayalam,
> Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, 187
> in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
>
> Also there was  a separate sub-project which was run as part of this
> project to convert 136 titles in Malayalam to Malayalam Unicode. The number
> of pages that were converted to Unicode is close to *25,700* pages .The
> Unicode conversion project was ran only for Malayalam. For the other
> languages it is just the scanning of books

What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?

> The project is complete now and the results of the project is available in
> the Hermman Gundert Portal https://www.gundert-portal.de/?language=en which
> was released on Nov 20. A news report is available here.
> <https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms>
>
> To view the books in each language you can navigate through the various
> links in the portal. For example, malayalam books are available here:
> https://www.gundert-portal.de/?page=malayalam
>
> Now we need to upload these scans to Wikimedia Commons and Unicode text to
> Malayalam Wikisource (25,700 Unicode converted pages)
>
> The first priority is for the scans that are converted to Unicode. Is it
> possible to write a script to migrate the scans from Tuebingen Digital
> library to Wikimedia Commons? (I can share the exact details of books
> converted to Unicode if needed)

What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these files to Commons?
OCR (identify letters in pure images and converting those letters to
text which could be marked and copied)? Something else?

To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
is also https://phabricator.wikimedia.org/T120788 for more info/tools.

> All the digitized files are heavy and the size ranges from 100 MB to 1.5 GB
> depending on the number of pages in the books. So manually managing this is
> going to be a big challenge.
>
> Can some one help with this?

Cheers,
andre
--
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Book scans from Tuebingen Digital Library to Wikimedia Commons

Shiju Alex
Hi

Here are the answers

What does "converted to Unicode" mean? Converted from what exactly? Do
> you maybe mean "converted via OCR (Optical character recognition) from
> images in file formats (JPG, PNG, images in a PDF) which don't allow
> marking text to a file format which allows marking text in those files?


There is no good OCR for languages like Malayalam. So each scanned image is
manually typed and proofread  For example, See the 7th page of this book
<http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. You
can see the scan image on the right and the transcribed text for that page
on the left in the *Transcript *tab.  This is done for 136 books, and total
pages on these books are close to 25,700 pages.

What would you want the script to do exactly? Pull the files from the
> Tuebingen Digital Library and then mass-upload these files to Commons?


Yes, this is what is required. Unicode migration we will handle separately.


Shiju Alex





>









On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[hidden email]> wrote:

> Hi,
>
> Great! Some questions below for better understanding what's wanted:
>
> On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > Recently Tuebingen University
> > <https://uni-tuebingen.de/en/university.html> (with
> > the support from German Research Foundation) ran a project titled
> *Gundert
> > Legacy project* to digitize close to 137,000 pages from *850 public
> domain
> > books*.
> >
> > All these public domain books are in the South Indian languages
> *Malayalam,
> > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam,
> 187
> > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> >
> > Also there was  a separate sub-project which was run as part of this
> > project to convert 136 titles in Malayalam to Malayalam Unicode. The
> number
> > of pages that were converted to Unicode is close to *25,700* pages .The
> > Unicode conversion project was ran only for Malayalam. For the other
> > languages it is just the scanning of books
>
> What does "converted to Unicode" mean? Converted from what exactly? Do
> you maybe mean "converted via OCR (Optical character recognition) from
> images in file formats (JPG, PNG, images in a PDF) which don't allow
> marking text to a file format which allows marking text in those files?
>
> > The project is complete now and the results of the project is available
> in
> > the Hermman Gundert Portal https://www.gundert-portal.de/?language=en
> which
> > was released on Nov 20. A news report is available here.
> > <
> https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
> >
> >
> > To view the books in each language you can navigate through the various
> > links in the portal. For example, malayalam books are available here:
> > https://www.gundert-portal.de/?page=malayalam
> >
> > Now we need to upload these scans to Wikimedia Commons and Unicode text
> to
> > Malayalam Wikisource (25,700 Unicode converted pages)
> >
> > The first priority is for the scans that are converted to Unicode. Is it
> > possible to write a script to migrate the scans from Tuebingen Digital
> > library to Wikimedia Commons? (I can share the exact details of books
> > converted to Unicode if needed)
>
> What would you want the script to do exactly? Pull the files from the
> Tuebingen Digital Library and then mass-upload these files to Commons?
> OCR (identify letters in pure images and converting those letters to
> text which could be marked and copied)? Something else?
>
> To convert image files available on Wikimedia Commons to recognized
> text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
> is also https://phabricator.wikimedia.org/T120788 for more info/tools.
>
> > All the digitized files are heavy and the size ranges from 100 MB to 1.5
> GB
> > depending on the number of pages in the books. So manually managing this
> is
> > going to be a big challenge.
> >
> > Can some one help with this?
>
> Cheers,
> andre
> --
> Andre Klapper | Bugwrangler / Developer Advocate
> https://blogs.gnome.org/aklapper/
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Book scans from Tuebingen Digital Library to Wikimedia Commons

Ryan Kaldari-2
>There is no good OCR for languages like Malayalam.

Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.

On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <[hidden email]> wrote:

> Hi
>
> Here are the answers
>
> What does "converted to Unicode" mean? Converted from what exactly? Do
> > you maybe mean "converted via OCR (Optical character recognition) from
> > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > marking text to a file format which allows marking text in those files?
>
>
> There is no good OCR for languages like Malayalam. So each scanned image is
> manually typed and proofread  For example, See the 7th page of this book
> <http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>.
> You
> can see the scan image on the right and the transcribed text for that page
> on the left in the *Transcript *tab.  This is done for 136 books, and total
> pages on these books are close to 25,700 pages.
>
> What would you want the script to do exactly? Pull the files from the
> > Tuebingen Digital Library and then mass-upload these files to Commons?
>
>
> Yes, this is what is required. Unicode migration we will handle separately.
>
>
> Shiju Alex
>
>
>
>
>
> >
>
>
>
>
>
>
>
>
>
> On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[hidden email]>
> wrote:
>
> > Hi,
> >
> > Great! Some questions below for better understanding what's wanted:
> >
> > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > > Recently Tuebingen University
> > > <https://uni-tuebingen.de/en/university.html> (with
> > > the support from German Research Foundation) ran a project titled
> > *Gundert
> > > Legacy project* to digitize close to 137,000 pages from *850 public
> > domain
> > > books*.
> > >
> > > All these public domain books are in the South Indian languages
> > *Malayalam,
> > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam,
> > 187
> > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> > >
> > > Also there was  a separate sub-project which was run as part of this
> > > project to convert 136 titles in Malayalam to Malayalam Unicode. The
> > number
> > > of pages that were converted to Unicode is close to *25,700* pages .The
> > > Unicode conversion project was ran only for Malayalam. For the other
> > > languages it is just the scanning of books
> >
> > What does "converted to Unicode" mean? Converted from what exactly? Do
> > you maybe mean "converted via OCR (Optical character recognition) from
> > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > marking text to a file format which allows marking text in those files?
> >
> > > The project is complete now and the results of the project is available
> > in
> > > the Hermman Gundert Portal https://www.gundert-portal.de/?language=en
> > which
> > > was released on Nov 20. A news report is available here.
> > > <
> >
> https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
> > >
> > >
> > > To view the books in each language you can navigate through the various
> > > links in the portal. For example, malayalam books are available here:
> > > https://www.gundert-portal.de/?page=malayalam
> > >
> > > Now we need to upload these scans to Wikimedia Commons and Unicode text
> > to
> > > Malayalam Wikisource (25,700 Unicode converted pages)
> > >
> > > The first priority is for the scans that are converted to Unicode. Is
> it
> > > possible to write a script to migrate the scans from Tuebingen Digital
> > > library to Wikimedia Commons? (I can share the exact details of books
> > > converted to Unicode if needed)
> >
> > What would you want the script to do exactly? Pull the files from the
> > Tuebingen Digital Library and then mass-upload these files to Commons?
> > OCR (identify letters in pure images and converting those letters to
> > text which could be marked and copied)? Something else?
> >
> > To convert image files available on Wikimedia Commons to recognized
> > text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
> > is also https://phabricator.wikimedia.org/T120788 for more info/tools.
> >
> > > All the digitized files are heavy and the size ranges from 100 MB to
> 1.5
> > GB
> > > depending on the number of pages in the books. So manually managing
> this
> > is
> > > going to be a big challenge.
> > >
> > > Can some one help with this?
> >
> > Cheers,
> > andre
> > --
> > Andre Klapper | Bugwrangler / Developer Advocate
> > https://blogs.gnome.org/aklapper/
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Book scans from Tuebingen Digital Library to Wikimedia Commons

Shiju Alex
>
> Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
> API (which is usable from a Wikisource gadget
> <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
> will do OCR on Tamil. I can't vouch for these being "good", but they do
> exist.


The request in this post is not for creating an OCR for any language
script; but to migrate certain Public Domain book scans from Tuebingen
digital library to Wikimedia Commons.

Also there is another task of migrating *already proofread Unicode text* to
Wikisource. But to take up the Unicode migration first the scans need to be
in Commons.

I am making this request only because of the huge amount of pages that we
need to handle. If it was just few hundreds of pages volunteers would have
manually done it.


Shiju


On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <[hidden email]> wrote:

> >There is no good OCR for languages like Malayalam.
>
> Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
> API (which is usable from a Wikisource gadget
> <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
> will do OCR on Tamil. I can't vouch for these being "good", but they do
> exist.
>
> On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <[hidden email]>
> wrote:
>
> > Hi
> >
> > Here are the answers
> >
> > What does "converted to Unicode" mean? Converted from what exactly? Do
> > > you maybe mean "converted via OCR (Optical character recognition) from
> > > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > > marking text to a file format which allows marking text in those files?
> >
> >
> > There is no good OCR for languages like Malayalam. So each scanned image
> is
> > manually typed and proofread  For example, See the 7th page of this book
> > <http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>.
> > You
> > can see the scan image on the right and the transcribed text for that
> page
> > on the left in the *Transcript *tab.  This is done for 136 books, and
> total
> > pages on these books are close to 25,700 pages.
> >
> > What would you want the script to do exactly? Pull the files from the
> > > Tuebingen Digital Library and then mass-upload these files to Commons?
> >
> >
> > Yes, this is what is required. Unicode migration we will handle
> separately.
> >
> >
> > Shiju Alex
> >
> >
> >
> >
> >
> > >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[hidden email]>
> > wrote:
> >
> > > Hi,
> > >
> > > Great! Some questions below for better understanding what's wanted:
> > >
> > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > > > Recently Tuebingen University
> > > > <https://uni-tuebingen.de/en/university.html> (with
> > > > the support from German Research Foundation) ran a project titled
> > > *Gundert
> > > > Legacy project* to digitize close to 137,000 pages from *850 public
> > > domain
> > > > books*.
> > > >
> > > > All these public domain books are in the South Indian languages
> > > *Malayalam,
> > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
> Malayalam,
> > > 187
> > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> > > >
> > > > Also there was  a separate sub-project which was run as part of this
> > > > project to convert 136 titles in Malayalam to Malayalam Unicode. The
> > > number
> > > > of pages that were converted to Unicode is close to *25,700* pages
> .The
> > > > Unicode conversion project was ran only for Malayalam. For the other
> > > > languages it is just the scanning of books
> > >
> > > What does "converted to Unicode" mean? Converted from what exactly? Do
> > > you maybe mean "converted via OCR (Optical character recognition) from
> > > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > > marking text to a file format which allows marking text in those files?
> > >
> > > > The project is complete now and the results of the project is
> available
> > > in
> > > > the Hermman Gundert Portal
> https://www.gundert-portal.de/?language=en
> > > which
> > > > was released on Nov 20. A news report is available here.
> > > > <
> > >
> >
> https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
> > > >
> > > >
> > > > To view the books in each language you can navigate through the
> various
> > > > links in the portal. For example, malayalam books are available here:
> > > > https://www.gundert-portal.de/?page=malayalam
> > > >
> > > > Now we need to upload these scans to Wikimedia Commons and Unicode
> text
> > > to
> > > > Malayalam Wikisource (25,700 Unicode converted pages)
> > > >
> > > > The first priority is for the scans that are converted to Unicode. Is
> > it
> > > > possible to write a script to migrate the scans from Tuebingen
> Digital
> > > > library to Wikimedia Commons? (I can share the exact details of books
> > > > converted to Unicode if needed)
> > >
> > > What would you want the script to do exactly? Pull the files from the
> > > Tuebingen Digital Library and then mass-upload these files to Commons?
> > > OCR (identify letters in pure images and converting those letters to
> > > text which could be marked and copied)? Something else?
> > >
> > > To convert image files available on Wikimedia Commons to recognized
> > > text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
> > > is also https://phabricator.wikimedia.org/T120788 for more info/tools.
> > >
> > > > All the digitized files are heavy and the size ranges from 100 MB to
> > 1.5
> > > GB
> > > > depending on the number of pages in the books. So manually managing
> > this
> > > is
> > > > going to be a big challenge.
> > > >
> > > > Can some one help with this?
> > >
> > > Cheers,
> > > andre
> > > --
> > > Andre Klapper | Bugwrangler / Developer Advocate
> > > https://blogs.gnome.org/aklapper/
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Book scans from Tuebingen Digital Library to Wikimedia Commons

bawolff
Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
? I think the folks at commons are more likely to be able to give you
the help you need than wikitech-l would be.

--
Brian

On Mon, Dec 3, 2018 at 5:22 AM Shiju Alex <[hidden email]> wrote:

>
> >
> > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
> > API (which is usable from a Wikisource gadget
> > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
> > will do OCR on Tamil. I can't vouch for these being "good", but they do
> > exist.
>
>
> The request in this post is not for creating an OCR for any language
> script; but to migrate certain Public Domain book scans from Tuebingen
> digital library to Wikimedia Commons.
>
> Also there is another task of migrating *already proofread Unicode text* to
> Wikisource. But to take up the Unicode migration first the scans need to be
> in Commons.
>
> I am making this request only because of the huge amount of pages that we
> need to handle. If it was just few hundreds of pages volunteers would have
> manually done it.
>
>
> Shiju
>
>
> On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <[hidden email]> wrote:
>
> > >There is no good OCR for languages like Malayalam.
> >
> > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
> > API (which is usable from a Wikisource gadget
> > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
> > will do OCR on Tamil. I can't vouch for these being "good", but they do
> > exist.
> >
> > On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <[hidden email]>
> > wrote:
> >
> > > Hi
> > >
> > > Here are the answers
> > >
> > > What does "converted to Unicode" mean? Converted from what exactly? Do
> > > > you maybe mean "converted via OCR (Optical character recognition) from
> > > > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > > > marking text to a file format which allows marking text in those files?
> > >
> > >
> > > There is no good OCR for languages like Malayalam. So each scanned image
> > is
> > > manually typed and proofread  For example, See the 7th page of this book
> > > <http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>.
> > > You
> > > can see the scan image on the right and the transcribed text for that
> > page
> > > on the left in the *Transcript *tab.  This is done for 136 books, and
> > total
> > > pages on these books are close to 25,700 pages.
> > >
> > > What would you want the script to do exactly? Pull the files from the
> > > > Tuebingen Digital Library and then mass-upload these files to Commons?
> > >
> > >
> > > Yes, this is what is required. Unicode migration we will handle
> > separately.
> > >
> > >
> > > Shiju Alex
> > >
> > >
> > >
> > >
> > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[hidden email]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Great! Some questions below for better understanding what's wanted:
> > > >
> > > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > > > > Recently Tuebingen University
> > > > > <https://uni-tuebingen.de/en/university.html> (with
> > > > > the support from German Research Foundation) ran a project titled
> > > > *Gundert
> > > > > Legacy project* to digitize close to 137,000 pages from *850 public
> > > > domain
> > > > > books*.
> > > > >
> > > > > All these public domain books are in the South Indian languages
> > > > *Malayalam,
> > > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
> > Malayalam,
> > > > 187
> > > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> > > > >
> > > > > Also there was  a separate sub-project which was run as part of this
> > > > > project to convert 136 titles in Malayalam to Malayalam Unicode. The
> > > > number
> > > > > of pages that were converted to Unicode is close to *25,700* pages
> > .The
> > > > > Unicode conversion project was ran only for Malayalam. For the other
> > > > > languages it is just the scanning of books
> > > >
> > > > What does "converted to Unicode" mean? Converted from what exactly? Do
> > > > you maybe mean "converted via OCR (Optical character recognition) from
> > > > images in file formats (JPG, PNG, images in a PDF) which don't allow
> > > > marking text to a file format which allows marking text in those files?
> > > >
> > > > > The project is complete now and the results of the project is
> > available
> > > > in
> > > > > the Hermman Gundert Portal
> > https://www.gundert-portal.de/?language=en
> > > > which
> > > > > was released on Nov 20. A news report is available here.
> > > > > <
> > > >
> > >
> > https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
> > > > >
> > > > >
> > > > > To view the books in each language you can navigate through the
> > various
> > > > > links in the portal. For example, malayalam books are available here:
> > > > > https://www.gundert-portal.de/?page=malayalam
> > > > >
> > > > > Now we need to upload these scans to Wikimedia Commons and Unicode
> > text
> > > > to
> > > > > Malayalam Wikisource (25,700 Unicode converted pages)
> > > > >
> > > > > The first priority is for the scans that are converted to Unicode. Is
> > > it
> > > > > possible to write a script to migrate the scans from Tuebingen
> > Digital
> > > > > library to Wikimedia Commons? (I can share the exact details of books
> > > > > converted to Unicode if needed)
> > > >
> > > > What would you want the script to do exactly? Pull the files from the
> > > > Tuebingen Digital Library and then mass-upload these files to Commons?
> > > > OCR (identify letters in pure images and converting those letters to
> > > > text which could be marked and copied)? Something else?
> > > >
> > > > To convert image files available on Wikimedia Commons to recognized
> > > > text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
> > > > is also https://phabricator.wikimedia.org/T120788 for more info/tools.
> > > >
> > > > > All the digitized files are heavy and the size ranges from 100 MB to
> > > 1.5
> > > > GB
> > > > > depending on the number of pages in the books. So manually managing
> > > this
> > > > is
> > > > > going to be a big challenge.
> > > > >
> > > > > Can some one help with this?
> > > >
> > > > Cheers,
> > > > andre
> > > > --
> > > > Andre Klapper | Bugwrangler / Developer Advocate
> > > > https://blogs.gnome.org/aklapper/
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Book scans from Tuebingen Digital Library to Wikimedia Commons

Shiju Alex
>
> Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
> the help you need than wikitech-l would be.

? I think the folks at commons are more likely to be able to give you


Thank you. I was not aware about this option. Let me try this.

Shiju Alex



On Mon, Dec 3, 2018 at 1:55 PM bawolff <[hidden email]> wrote:

> Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
> ? I think the folks at commons are more likely to be able to give you
> the help you need than wikitech-l would be.
>
> --
> Brian
>
> On Mon, Dec 3, 2018 at 5:22 AM Shiju Alex <[hidden email]>
> wrote:
> >
> > >
> > > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
> Vision
> > > API (which is usable from a Wikisource gadget
> > > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
> Wilson)
> > > will do OCR on Tamil. I can't vouch for these being "good", but they do
> > > exist.
> >
> >
> > The request in this post is not for creating an OCR for any language
> > script; but to migrate certain Public Domain book scans from Tuebingen
> > digital library to Wikimedia Commons.
> >
> > Also there is another task of migrating *already proofread Unicode text*
> to
> > Wikisource. But to take up the Unicode migration first the scans need to
> be
> > in Commons.
> >
> > I am making this request only because of the huge amount of pages that we
> > need to handle. If it was just few hundreds of pages volunteers would
> have
> > manually done it.
> >
> >
> > Shiju
> >
> >
> > On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <[hidden email]>
> wrote:
> >
> > > >There is no good OCR for languages like Malayalam.
> > >
> > > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
> Vision
> > > API (which is usable from a Wikisource gadget
> > > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
> Wilson)
> > > will do OCR on Tamil. I can't vouch for these being "good", but they do
> > > exist.
> > >
> > > On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <[hidden email]>
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > Here are the answers
> > > >
> > > > What does "converted to Unicode" mean? Converted from what exactly?
> Do
> > > > > you maybe mean "converted via OCR (Optical character recognition)
> from
> > > > > images in file formats (JPG, PNG, images in a PDF) which don't
> allow
> > > > > marking text to a file format which allows marking text in those
> files?
> > > >
> > > >
> > > > There is no good OCR for languages like Malayalam. So each scanned
> image
> > > is
> > > > manually typed and proofread  For example, See the 7th page of this
> book
> > > > <
> http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>.
> > > > You
> > > > can see the scan image on the right and the transcribed text for that
> > > page
> > > > on the left in the *Transcript *tab.  This is done for 136 books, and
> > > total
> > > > pages on these books are close to 25,700 pages.
> > > >
> > > > What would you want the script to do exactly? Pull the files from the
> > > > > Tuebingen Digital Library and then mass-upload these files to
> Commons?
> > > >
> > > >
> > > > Yes, this is what is required. Unicode migration we will handle
> > > separately.
> > > >
> > > >
> > > > Shiju Alex
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Great! Some questions below for better understanding what's wanted:
> > > > >
> > > > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > > > > > Recently Tuebingen University
> > > > > > <https://uni-tuebingen.de/en/university.html> (with
> > > > > > the support from German Research Foundation) ran a project titled
> > > > > *Gundert
> > > > > > Legacy project* to digitize close to 137,000 pages from *850
> public
> > > > > domain
> > > > > > books*.
> > > > > >
> > > > > > All these public domain books are in the South Indian languages
> > > > > *Malayalam,
> > > > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
> > > Malayalam,
> > > > > 187
> > > > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> > > > > >
> > > > > > Also there was  a separate sub-project which was run as part of
> this
> > > > > > project to convert 136 titles in Malayalam to Malayalam Unicode.
> The
> > > > > number
> > > > > > of pages that were converted to Unicode is close to *25,700*
> pages
> > > .The
> > > > > > Unicode conversion project was ran only for Malayalam. For the
> other
> > > > > > languages it is just the scanning of books
> > > > >
> > > > > What does "converted to Unicode" mean? Converted from what
> exactly? Do
> > > > > you maybe mean "converted via OCR (Optical character recognition)
> from
> > > > > images in file formats (JPG, PNG, images in a PDF) which don't
> allow
> > > > > marking text to a file format which allows marking text in those
> files?
> > > > >
> > > > > > The project is complete now and the results of the project is
> > > available
> > > > > in
> > > > > > the Hermman Gundert Portal
> > > https://www.gundert-portal.de/?language=en
> > > > > which
> > > > > > was released on Nov 20. A news report is available here.
> > > > > > <
> > > > >
> > > >
> > >
> https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
> > > > > >
> > > > > >
> > > > > > To view the books in each language you can navigate through the
> > > various
> > > > > > links in the portal. For example, malayalam books are available
> here:
> > > > > > https://www.gundert-portal.de/?page=malayalam
> > > > > >
> > > > > > Now we need to upload these scans to Wikimedia Commons and
> Unicode
> > > text
> > > > > to
> > > > > > Malayalam Wikisource (25,700 Unicode converted pages)
> > > > > >
> > > > > > The first priority is for the scans that are converted to
> Unicode. Is
> > > > it
> > > > > > possible to write a script to migrate the scans from Tuebingen
> > > Digital
> > > > > > library to Wikimedia Commons? (I can share the exact details of
> books
> > > > > > converted to Unicode if needed)
> > > > >
> > > > > What would you want the script to do exactly? Pull the files from
> the
> > > > > Tuebingen Digital Library and then mass-upload these files to
> Commons?
> > > > > OCR (identify letters in pure images and converting those letters
> to
> > > > > text which could be marked and copied)? Something else?
> > > > >
> > > > > To convert image files available on Wikimedia Commons to recognized
> > > > > text, see https://tools.wmflabs.org/ws-google-ocr/ for example.
> There
> > > > > is also https://phabricator.wikimedia.org/T120788 for more
> info/tools.
> > > > >
> > > > > > All the digitized files are heavy and the size ranges from 100
> MB to
> > > > 1.5
> > > > > GB
> > > > > > depending on the number of pages in the books. So manually
> managing
> > > > this
> > > > > is
> > > > > > going to be a big challenge.
> > > > > >
> > > > > > Can some one help with this?
> > > > >
> > > > > Cheers,
> > > > > andre
> > > > > --
> > > > > Andre Klapper | Bugwrangler / Developer Advocate
> > > > > https://blogs.gnome.org/aklapper/
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Wikitech-l mailing list
> > > > > [hidden email]
> > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Book scans from Tuebingen Digital Library to Wikimedia Commons

Shrinivasan T
we used this script
https://github.com/tshrinivasan/tools-for-wiki/tree/master/pdf-upload-commons

to upload some 2000 public domain tamil books to commons.

Explore the batch uploading to commons.
If it is not apt for you, I can help to customize this script.

Regards,
T. Shrinivasan

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l