contributing automatically built dictionaries

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

contributing automatically built dictionaries

Judit, Ács
Dear Wiktionary Community,

We have been working on a triangulation method to expand existing
dictionaries in many languages. We were able to parse translations from 40
Wiktionary editions and using these as seed dictionaries (appr. 3.6M
translation pairs), we created an additional 16M pairs in 50 languages. It
is possible to extend the number of languages.

While the automatically generated dictionary is not a 100% correct, with
correct filtering, 90%+ can be reached.

One version of the parsed Wiktionaries and the generated pairs can be found
here: https://www.dropbox.com/sh/r95tdr52o5rzzrw/a54Y66YGOJ
We used dumps from August to create these.
The software used to build dictionaries:
https://github.com/juditacs/wikt2dict

Do you think there is a way to contribute this dictionary back to
Wiktionary?

Best,
Judit Ács
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: contributing automatically built dictionaries

Gerard Meijssen-3
Hoi,
I think it makes more sense to contribute it to the upcoming Wiktionary
effort on Wikidata.
Thanks,
     GerardM


On 8 October 2013 12:21, Judit, Ács <[hidden email]> wrote:

> Dear Wiktionary Community,
>
> We have been working on a triangulation method to expand existing
> dictionaries in many languages. We were able to parse translations from 40
> Wiktionary editions and using these as seed dictionaries (appr. 3.6M
> translation pairs), we created an additional 16M pairs in 50 languages. It
> is possible to extend the number of languages.
>
> While the automatically generated dictionary is not a 100% correct, with
> correct filtering, 90%+ can be reached.
>
> One version of the parsed Wiktionaries and the generated pairs can be found
> here: https://www.dropbox.com/sh/r95tdr52o5rzzrw/a54Y66YGOJ
> We used dumps from August to create these.
> The software used to build dictionaries:
> https://github.com/juditacs/wikt2dict
>
> Do you think there is a way to contribute this dictionary back to
> Wiktionary?
>
> Best,
> Judit Ács
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: contributing automatically built dictionaries

Lars Aronsson
On 10/09/2013 08:16 AM, Gerard Meijssen wrote:
> I think it makes more sense to contribute it to the upcoming Wiktionary
> effort on Wikidata.

I strongly disagree. That 'effort' is still science fiction,
and suggesting to wait for it, is just procrastination.

> On 8 October 2013 12:21, Judit, Ács <[hidden email]> wrote:
>> Do you think there is a way to contribute this dictionary back to
>> Wiktionary?

Wiktionary is already half-full of bot-generated
articles, and adding more is no harm. However, you
need to address each user community on its own.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: contributing automatically built dictionaries

Federico Leva (Nemo)
In reply to this post by Judit, Ács
Judit, Ács, 08/10/2013 12:21:
> Do you think there is a way to contribute this dictionary back to
> Wiktionary?

Sure! You could first of all upload the dataset with a free license
somewhere, for instance archive.org. Actually, it's probably better if
you choose CC-0 as "license", otherwise – being EU-based – you could add
database rights which would be a nightmare. (Or CC-0 for your work +
CC-BY-SA for any copyrightable text from Wiktionary, if there is any.)

Then, you can build upon one of out WebAPI clients to contribute it
directly to Wiktionary: https://www.mediawiki.org/wiki/API:Client_code
I say "you" because you are the ones knowing your own dataset better.
You need local consensus of course, so you could proceed this way:
1) determine what Wiktionary editions has the biggest overlap with your
entries (i.e. which would require less page creation; adding to existing
pages is less controversial than adding new ones);
2) propose to those editions, or wait for the most interested to ask
you, and get local green light (ideally a not-so-huge one to start with);
3) run on your own a bot on that language and identify what's the kind
and amount of needed work;
4) share the code and information from (3) to let others continue on
other editions.
Of course someone else could do 1-3 too, but it would be a
disproportionate effort for them compared to you; peer review of the
code at (3) should also help make the coding of the bot a shared effort.

Nemo

_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: contributing automatically built dictionaries

Judit, Ács
Thanks for the very helpful answers.

I will look at the possibilities for uploading (and licensing) the data
sets.

Meanwhile I have another question. Currently I don't parse any information
other than the words or expressions, meaning gender and other
language-specific information is ignored, even though they might appear in
the translation tables. This is probably a huge problem for large
Wiktionaries (e.g. I doubt that the enwiktionary would accept French nouns
without their gender). Adding this functionality would be very tedious and
probably impossible for languages I can't even read. Should I try it anyway
or can the data be useful without these?


2013/10/9 Federico Leva (Nemo) <[hidden email]>

> Judit, Ács, 08/10/2013 12:21:
>
>  Do you think there is a way to contribute this dictionary back to
>> Wiktionary?
>>
>
> Sure! You could first of all upload the dataset with a free license
> somewhere, for instance archive.org. Actually, it's probably better if
> you choose CC-0 as "license", otherwise – being EU-based – you could add
> database rights which would be a nightmare. (Or CC-0 for your work +
> CC-BY-SA for any copyrightable text from Wiktionary, if there is any.)
>
> Then, you can build upon one of out WebAPI clients to contribute it
> directly to Wiktionary: https://www.mediawiki.org/**wiki/API:Client_code<https://www.mediawiki.org/wiki/API:Client_code>
> I say "you" because you are the ones knowing your own dataset better. You
> need local consensus of course, so you could proceed this way:
> 1) determine what Wiktionary editions has the biggest overlap with your
> entries (i.e. which would require less page creation; adding to existing
> pages is less controversial than adding new ones);
> 2) propose to those editions, or wait for the most interested to ask you,
> and get local green light (ideally a not-so-huge one to start with);
> 3) run on your own a bot on that language and identify what's the kind and
> amount of needed work;
> 4) share the code and information from (3) to let others continue on other
> editions.
> Of course someone else could do 1-3 too, but it would be a
> disproportionate effort for them compared to you; peer review of the code
> at (3) should also help make the coding of the bot a shared effort.
>
> Nemo
>
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l