wikt2dict - a tool for extracting translations from Wiktionaries

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

wikt2dict - a tool for extracting translations from Wiktionaries

Judit, Ács
Hi All,

I created a tool to extract translations from different editions of
Wiktionary. Right now it supports 39 different Wiktionaries. It only
extracts translations and ignores the rest.

Supported Wiktionaries:
Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto,
Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian,
Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy,
Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak,
Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and
Chinese.

Adding a new Wiktionary is done via a configuration file.

Right now the beta version is available for download at:
https://github.com/juditacs/wikt2dict

Documentation is in progress, until then the README should be enough to get
started.

Please test it and send me your feedback and bug reports.

Thanks,
Judit Ács
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: wikt2dict - a tool for extracting translations from Wiktionaries

mathieu lovato stumpf guntz
Great,

Do you plane to add more functions, like generating misceleanous output
(ebooks versions, "printable" pdf, etc.) from a dump? The main problem
is probably to convert all templates…

Le 2013-07-12 13:19, Judit a écrit :

> Hi All,
>
> I created a tool to extract translations from different editions of
> Wiktionary. Right now it supports 39 different Wiktionaries. It only
> extracts translations and ignores the rest.
>
> Supported Wiktionaries:
> Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English,
> Esperanto,
> Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew,
> Croatian,
> Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian,
> Malagasy,
> Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian,
> Slovak,
> Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese
> and
> Chinese.
>
> Adding a new Wiktionary is done via a configuration file.
>
> Right now the beta version is available for download at:
> https://github.com/juditacs/wikt2dict
>
> Documentation is in progress, until then the README should be enough
> to get
> started.
>
> Please test it and send me your feedback and bug reports.
>
> Thanks,
> Judit Ács
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

--
Association Culture-Libre
http://www.culture-libre.org/

_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: wikt2dict - a tool for extracting translations from Wiktionaries

Judit, Ács
Hi,

I don't plan to generate different output formats as the dictionaries by
themselves are more suitable for automated usage than as a normal
dictionary but it sounds interesting, I may do it in the future.

Since the first version I added a triangulating function that basically
tries to build new translation pairs based on the ones extracted from the
Wiktionaries. It works reasonably well (85%+ correct manually tested on a
few language pairs) and yields many results. I plan to further improve
these methods.

BTW the data is available on demand (e.g. you send me an email).

Judit


2013/7/12 Mathieu Stumpf <[hidden email]>

> Great,
>
> Do you plane to add more functions, like generating misceleanous output
> (ebooks versions, "printable" pdf, etc.) from a dump? The main problem is
> probably to convert all templates…
>
> Le 2013-07-12 13:19, Judit a écrit :
>
>> Hi All,
>>
>> I created a tool to extract translations from different editions of
>> Wiktionary. Right now it supports 39 different Wiktionaries. It only
>> extracts translations and ignores the rest.
>>
>> Supported Wiktionaries:
>> Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto,
>> Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian,
>> Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy,
>> Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak,
>> Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and
>> Chinese.
>>
>> Adding a new Wiktionary is done via a configuration file.
>>
>> Right now the beta version is available for download at:
>> https://github.com/juditacs/**wikt2dict<https://github.com/juditacs/wikt2dict>
>>
>> Documentation is in progress, until then the README should be enough to
>> get
>> started.
>>
>> Please test it and send me your feedback and bug reports.
>>
>> Thanks,
>> Judit Ács
>> ______________________________**_________________
>> Wiktionary-l mailing list
>> [hidden email].**org <[hidden email]>
>> https://lists.wikimedia.org/**mailman/listinfo/wiktionary-l<https://lists.wikimedia.org/mailman/listinfo/wiktionary-l>
>>
>
> --
> Association Culture-Libre
> http://www.culture-libre.org/
>
> ______________________________**_________________
> Wiktionary-l mailing list
> [hidden email].**org <[hidden email]>
> https://lists.wikimedia.org/**mailman/listinfo/wiktionary-l<https://lists.wikimedia.org/mailman/listinfo/wiktionary-l>
>
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: wikt2dict - a tool for extracting translations from Wiktionaries

Dimitris Kontokostas
For those who are not aware of DBpedia Wiktionary [1]
it also supports translations (among many other lexical information)
i.e. http://wiktionary.dbpedia.org/page/german-English-Adjective-2en\

It's a little harder to fully configure a new language but you can get a
lot more with that
For now we support en, de, el, fr & ru and we will happily accept
contributions for other languages

Best,
Dimitris

[1] http://wiktionary.dbpedia.org/


On Fri, Jul 12, 2013 at 3:22 PM, Judit, Ács <[hidden email]> wrote:

> Hi,
>
> I don't plan to generate different output formats as the dictionaries by
> themselves are more suitable for automated usage than as a normal
> dictionary but it sounds interesting, I may do it in the future.
>
> Since the first version I added a triangulating function that basically
> tries to build new translation pairs based on the ones extracted from the
> Wiktionaries. It works reasonably well (85%+ correct manually tested on a
> few language pairs) and yields many results. I plan to further improve
> these methods.
>
> BTW the data is available on demand (e.g. you send me an email).
>
> Judit
>
>
> 2013/7/12 Mathieu Stumpf <[hidden email]>
>
> > Great,
> >
> > Do you plane to add more functions, like generating misceleanous output
> > (ebooks versions, "printable" pdf, etc.) from a dump? The main problem is
> > probably to convert all templates…
> >
> > Le 2013-07-12 13:19, Judit a écrit :
> >
> >> Hi All,
> >>
> >> I created a tool to extract translations from different editions of
> >> Wiktionary. Right now it supports 39 different Wiktionaries. It only
> >> extracts translations and ignores the rest.
> >>
> >> Supported Wiktionaries:
> >> Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English,
> Esperanto,
> >> Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian,
> >> Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy,
> >> Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian,
> Slovak,
> >> Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and
> >> Chinese.
> >>
> >> Adding a new Wiktionary is done via a configuration file.
> >>
> >> Right now the beta version is available for download at:
> >> https://github.com/juditacs/**wikt2dict<
> https://github.com/juditacs/wikt2dict>
> >>
> >> Documentation is in progress, until then the README should be enough to
> >> get
> >> started.
> >>
> >> Please test it and send me your feedback and bug reports.
> >>
> >> Thanks,
> >> Judit Ács
> >> ______________________________**_________________
> >> Wiktionary-l mailing list
> >> [hidden email].**org <[hidden email]>
> >> https://lists.wikimedia.org/**mailman/listinfo/wiktionary-l<
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l>
> >>
> >
> > --
> > Association Culture-Libre
> > http://www.culture-libre.org/
> >
> > ______________________________**_________________
> > Wiktionary-l mailing list
> > [hidden email].**org <[hidden email]>
> > https://lists.wikimedia.org/**mailman/listinfo/wiktionary-l<
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l>
> >
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>
>


--
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: wikt2dict - a tool for extracting translations from Wiktionaries

Judit, Ács
In reply to this post by Judit, Ács
Hi,

I added the support for German Wiktionary, it is available in the newest
version. There is a quick test script that should get you 300k+
translations from the German Wiktionary in less than 15 minutes.

The dictionaries in 50 languages built using wikt2dict and other resources
(parallel and comparable corpora) are available here:
http://hlt.sztaki.hu/resources/index.html
Please let me know if you find parsing errors.

I understand that DBPedia Wiktionary does a lot more than wikt2dict and I
do not plan to compete with that. However, adding 35+ Wiktionaries would
have been near impossible for me. This a quick (and dirty) way to extract
the translations.

Cheers,
Judit



2013/7/12 Judit, Ács <[hidden email]>

> Hi All,
>
> I created a tool to extract translations from different editions of
> Wiktionary. Right now it supports 39 different Wiktionaries. It only
> extracts translations and ignores the rest.
>
> Supported Wiktionaries:
> Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto,
> Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian,
> Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy,
> Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak,
> Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and
> Chinese.
>
> Adding a new Wiktionary is done via a configuration file.
>
> Right now the beta version is available for download at:
> https://github.com/juditacs/wikt2dict
>
> Documentation is in progress, until then the README should be enough to
> get started.
>
> Please test it and send me your feedback and bug reports.
>
> Thanks,
> Judit Ács
>
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l