thesis: automatically building a multilingual thesaurus from wikipedia

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

thesis: automatically building a multilingual thesaurus from wikipedia

Luca de Alfaro-4

This looks very interesting!
Is this a thesaurus that can be used for translation of words across languages?
Is there some way to quickly have a demo or view the data?
I browsed some files, and I see entries of the kind:

:xf5bfa ww:displayLabel "de:Feliner_Diabetes_mellitus" .
:xf5bfa ww:type wwct:OTHER .
:xf5bfa rdf:type skos:Concept .
:xf5bfa skos:inScheme <http://brightbyte.de/vocab/wikiword/dataset/*/animals:thesaurus

which tells me that Diabetes Mellitus of a feline is a concept... I was interested in the animal thesaurus as a way to translate animal names across languages... there are a lot of files, and I don't know if I am looking at the right ones.  Perhaps if you pointed us to the most interesting / understandable datasets, it would be very useful.

I am sorry if the above remarks seem superficial; I cannot read German well enough to read dissertations in it...

Best, Luca.

On Fri, May 30, 2008 at 2:54 AM, Daniel Kinzler <[hidden email]> wrote:
My diploma thesis about a system to automatically build a multilingual thesaurus
from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My
research will hopefully help to make Wikipedia more accessible for automatic
processing, especially for applications natural languae processing, machine
translation and information retrieval. What this could mean for Wikipedia is:
better search and conceptual navigation, tools for suggesting categories, and more.

Here's the thesis (in German, i'm afraid): <http://brightbyte.de/DA/WikiWord.pdf>

 Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch
 Extraktion semantischer und lexikalischer Relationen aus der Wikipedia",
 Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut
 für Informatik, Universität Leipzig, 2008.

For the curious, http://brightbyte.de/DA/ also contains source code and data.
See <http://brightbyte.de/page/WikiWord> for more information.

Some more data is for now avialable at
<http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/>. This includes
full SKOS dumps for en, de, fr, nl, and no covering about six million concepts.

The thesis ended up being rather large... 220 pages thesis and 30k lines of
code. I'm plannign to write a research paper in english soon, which will give an
overview over WikiWord and what it can be used for.

The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken
or derived from wikipedia is GFDL.


Enjoy,
Daniel

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

Daniel Kinzler
Luca de Alfaro wrote:
>
> This looks very interesting!
> Is this a thesaurus that can be used for translation of words across
> languages?

Yes, in the sense that it (potentially) contains labels in different languages
for the same concept.

> Is there some way to quickly have a demo or view the data?

Sadly, no. I started to implement a web based query interface, but there was no
time to finish it while working on the thesis. Maybe I'll get it up one day.

On the other hand: if you find a decent viewer/explorer for SKOS (I didn't), you
should be able to explore the contents without problems. It's a standard RDF
vocabulary.

> I browsed some files, and I see entries of the kind:
>
> :xf5bfa ww:displayLabel "de:Feliner_Diabetes_mellitus" .
> :xf5bfa ww:type wwct:OTHER .
> :xf5bfa rdf:type skos:Concept .
> :xf5bfa skos:inScheme
> <http://brightbyte.de/vocab/wikiword/dataset/*/animals:thesaurus
>
> which tells me that Diabetes Mellitus of a feline is a concept... I was
> interested in the animal thesaurus as a way to translate animal names
> across languages... there are a lot of files, and I don't know if I am
> looking at the right ones.  Perhaps if you pointed us to the most
> interesting / understandable datasets, it would be very useful.

the animals:thesaurus dataset, as found in
<http://brightbyte.de/DA/rdfdumps/animals_thesaurus.*.n3.bz2>, *should* contain
what you are looking for, namely different names for the same animal in
different languages. However, due to the way the sample was taken, the overlap
of pages analyzed from the different wikis is not as good as it should, and the
english wikipedia is missing entierly from this dataset. This is due to the fact
that the categories deadling with domesticated animals appear to be structured
in very different way in the different wikipedias. This is why it's a bit hard
to find a working example of a trans-language concept in that dataset. One
example would be x4c4b45, the entry fro domestic cattly, providing information
for German and French (english is, as i said, missing from that dataset).


A better example for seeing this WORK is probably colors:thesaurus as found in
<http://brightbyte.de/DA/rdfdumps/color_thesaurus.ww.n3.bz2> (or, if you want
plain SKOS, <http://brightbyte.de/DA/rdfdumps/color_thesaurus.skos.n3.bz2>).
Here's an excerpt for the color green (xa7d8c5):

:xa7d8c5 ww:displayLabel
"de:Grün|en:Green|fr:Vert|nl:Groen_(kleur)|no:Grønn|simple:Green" .
:xa7d8c5 ww:type wwct:OTHER .
:xa7d8c5 rdf:type skos:Concept .
...
:xa7d8c5 skos:definition "Grün ist jener Farbreiz der wahrgenommenen wird, wenn
Licht mit einer spektralen Verteilung ins Auge fällt bei dem das Maximum im
Wellenlängenintervall zwischen 520 und 565 nm liegt."@de .
:xa7d8c5 skos:altLabel "Blassgrün"@de .
:xa7d8c5 skos:altLabel "Dunkelgrün"@de .
:xa7d8c5 skos:altLabel "Grün"@de .
:xa7d8c5 skos:altLabel "Grüne"@de .
:xa7d8c5 skos:altLabel "Grünliche"@de .
...
:xa7d8c5 skos:definition "Green is a color, the perception of which is evoked by
light having a spectrum dominated by energy with a wavelength of roughly
520–570 nm."@en .
:xa7d8c5 skos:altLabel "Avacado"@en .
:xa7d8c5 skos:altLabel "Avocado"@en .
:xa7d8c5 skos:altLabel "Dark green"@en .
:xa7d8c5 skos:altLabel "Dark pastel green"@en .
:xa7d8c5 skos:altLabel "Dark spring green"@en .
:xa7d8c5 skos:altLabel "GREEN"@en .
:xa7d8c5 skos:altLabel "Green"@en .
:xa7d8c5 skos:altLabel "Green (HTML/CSS green)"@en .
:xa7d8c5 skos:altLabel "Greenness"@en .
...
:xa7d8c5 skos:definition "Le vert est une couleur complémentaire correspondant à
la lumière qui a une longueur d'onde comprise entre 490 et 570 nm."@fr .
:xa7d8c5 skos:altLabel "Couleur vert"@fr .
:xa7d8c5 skos:altLabel "Vert"@fr .
:xa7d8c5 skos:altLabel "Verte"@fr .
:xa7d8c5 skos:altLabel "Viridis"@fr .
:xa7d8c5 skos:altLabel "green"@fr .
:xa7d8c5 skos:altLabel "vert"@fr .
:xa7d8c5 skos:altLabel "verte"@fr .
...
:xa7d8c5 skos:definition "Groen is een secundaire kleur bij de subtractieve
kleurmenging."@nl .
:xa7d8c5 skos:altLabel "Groen"@nl .
:xa7d8c5 skos:altLabel "groen"@nl .
:xa7d8c5 skos:altLabel "groenachtige"@nl .
:xa7d8c5 skos:altLabel "groenblauw"@nl .
:xa7d8c5 skos:definition "Grønn er en farge som inngår i fargespekteret."@no .
:xa7d8c5 skos:altLabel "Grønn"@no .
:xa7d8c5 skos:altLabel "green"@no .
:xa7d8c5 skos:altLabel "grønn"@no .
:xa7d8c5 skos:altLabel "grønne"@no .
:xa7d8c5 skos:altLabel "grønt"@no .
:xa7d8c5 skos:definition "Green is one of the colors of the rainbow."@simple .
:xa7d8c5 skos:altLabel "Green"@simple .
:xa7d8c5 skos:altLabel "green"@simple .
:xa7d8c5 skos:altLabel "greenish"@simple .

I hope this gives an impression of the labels and glosses available in different
languages. In addition to this, there are the relations "broader"/"narrower",
"similar" and "related" for navigating the structure, as well as cross-links to
the respective wikipedia-pages, etc.

x89548b (First-order logic) from the dataset logic:thesaurus may also be a good
example.


Regards,
Daniel

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

alain_desilets
I was able to convert the PDF file to .txt. Not very readable, but should be good enough to allow me to gist the content through Google translate.

But in order to do that, it would be useful if I posted the .txt file somewhere on the web.

Do you mind if I do that?

Alain

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

Daniel Kinzler
Desilets, Alain wrote:
> I was able to convert the PDF file to .txt. Not very readable, but should be good enough to allow me to gist the content through Google translate.
>
> But in order to do that, it would be useful if I posted the .txt file somewhere on the web.
>
> Do you mind if I do that?
>
> Alain

Go right ahead, it's GFDL :) Post the link for the benefit of others too.
Hm... I guess we should be careful about one point: please make sure this is
clearly credited to me. I only handed it in yesterday. Someone is going to check
if i stole the text from somehwere. So when it does show up, it better has my
name on it :)

Anyway, I'll try to provide a readable HTML version soon, and an english
translation of  some selected chapters.

-- Daniel

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

alain_desilets
> Go right ahead, it's GFDL :) Post the link for the benefit of others
> too.
> Hm... I guess we should be careful about one point: please make sure
> this is clearly credited to me. I only handed it in yesterday. Someone
> is going to check if i stole the text from somehwere. So when it does
> show up, it better has my name on it :)

I posted it here:

http://www.wiki-translation.com/tiki-index.php?page=DanielKinzlerThesis&
bl=n

You can see the English translation here:

http://translate.google.com/translate?u=http%3A%2F%2Fwww.wiki-translatio
n.com%2Ftiki-index.php%3Fpage%3DDanielKinzlerThesis%26bl%3Dn&hl=en&ie=UT
F8&sl=de&tl=en

BTW: While you are there, you might want to snoop around the
wiki-translation site. It's a community of people interested in
massively collaborative translation and terminology.

Alain

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

alain_desilets
In reply to this post by Daniel Kinzler
> I posted it here:
>
> http://www.wiki-translation.com/tiki-
> index.php?page=DanielKinzlerThesis&bl=n
>
> You can see the English translation here:
>
> http://translate.google.com/translate?u=http%3A%2F%2Fwww.wiki-
> translation.com%2Ftiki-
>
index.php%3Fpage%3DDanielKinzlerThesis%26bl%3Dn&hl=en&ie=UTF8&sl=de&tl=
> en

Hum... If I go to the above translation link, only the first bit is
actually translated. I guess Google gives up after a while and leaves
the rest in German.

If you can split it into separate HTML pages, it would make it easier
for people to read it with Google translate.

Alain

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

Daniel Kinzler
> Hum... If I go to the above translation link, only the first bit is
> actually translated. I guess Google gives up after a while and leaves
> the rest in German.
>
> If you can split it into separate HTML pages, it would make it easier
> for people to read it with Google translate.

I guess before spending a day messing with bad converters, I should rather spend
that day translating the important bits myself :)

-- Daniel

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

alain_desilets
Well, I'm almost done splitting the TXT file into chapters using an
EMACS macro. I'll post that.

Alain

> -----Original Message-----
> From: [hidden email] [mailto:wiki-
> [hidden email]] On Behalf Of Daniel Kinzler
> Sent: May 30, 2008 5:10 PM
> To: Research into Wikimedia content and communities
> Subject: Re: [Wiki-research-l] thesis: automatically building a
> multilingual thesaurus from wikipedia
>
> > Hum... If I go to the above translation link, only the first bit is
> > actually translated. I guess Google gives up after a while and
leaves
> > the rest in German.
> >
> > If you can split it into separate HTML pages, it would make it
easier

> > for people to read it with Google translate.
>
> I guess before spending a day messing with bad converters, I should
> rather spend that day translating the important bits myself :)
>
> -- Daniel
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: thesis: automatically building a multilingual thesaurus from wikipedia

Daniel Kinzler
Desilets, Alain wrote:
> Well, I'm almost done splitting the TXT file into chapters using an
> EMACS macro. I'll post that.

Cool! And I got started with translating. I guess I can have it ready
tomorrow... or the day after that.

I'm off to bed now.

Thanks to everyone for all the comments
-- Daniel

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l