integrating wiktionary

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

integrating wiktionary

Cedric De Vroey
Hi Guys,

I'd like to integrate and cache wiktionary in an application I'm developing
but I'm having this problem: How can I retrieve content from wiktionary in
a structured format or translate it to a structured format (with structured
format like XML, CSV, Json,...)? I have already looked into the
Special:Export page but that didn't really helped me cause the actual
content is still all in one field. Are there any known best-practices to do
this?

Thanks!
Cedric
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: integrating wiktionary

Andrew Krizhanovsky
You can try the structured format based on data extracted from the
English Wiktionary
 and the Russian Wiktionary here:
http://code.google.com/p/wikokit/

Best regards,
Andrew Krizhanovsky.

_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: integrating wiktionary

Jonas Brekle
In reply to this post by Cedric De Vroey
Hi Crederic,

we currently work on an extractor for Wiktionary reusing/extending the
DBpedia framework[1]. In my opinion there is no best practice yet, it's
a work in progress, and it's not trivial: If you want a extractor for
many languages, a straight-forward regex-approach will fail at the
second or third language you want to include, because of heterogeneous
syntax and  modeling. So we try to make a declarative parser, that
interprets a rather complex config file, containing little "templates",
that define which element of a Wiktionary page should be interpreted and
processed in which way. we are currently working on it: we made a config
for german and english, and the data we extract is the "entry
layout" (the language, etymology and part of speech - everything thats
in the "outline" boxes) and for each of these found "contexts", we
extract the definition sentence. Of course we aim to extract much more
properties later.
you can checkout our current state at our SVN Repo
        hg clone
        http://dbpedia.hg.sourceforge.net/hgroot/dbpedia/extraction_framework dbpedia
        cd dbpedia
        hg update wiktionary
        cd core
        mvn install
        cd ../dump
        mvn install
        cd ../wiktionary
        mkdir wiktionaryDump
copy the enwiktionary-???-pages-articles.xml file from [2] in that new
folder, the language should be the one set in config.xml - and a
config-[language].xml needs to exist
        mvn scala:run

The extraction is outputting RDF data (ntriples format, which can be
transformed to xml easily). Unfortunatly there is no comprehensive
documentation yet and we are pre-beta. But we would like to get feedback
and/or requirements. In a few days i will send a official announcement
to our mailinglist [3], containing more details and dump files for en
and de.

But depending on your use case, this could be over-complicated for your
needs, and you would depend on us... If you only need one language,
another idea we had (but not implemented) could be practicable:
Regex-replace everything that has a special semantic with XML nodes.
Then apply a set of rules that hierarchically order this flat sequence
of nodes. At last you can iterate over the XML tree and extract what you
want using xPath or an XML-api.
An example (in pseudo-Scala, not compiling):
        val page = "==English==
        ===Noun===
        * Something that is...
        ===Verb===
        * to be..."
now we apply some regexes like
        var pageXMLFlat = new Regex("==(.?*)==").replaceAllIn(page, m =>
        "<section level='2' title='"+m.group(0)+"' />")
        ...
and we get
        <section level="2" title="English" />
        <section level="3" title="Noun" />
        <indent/><text content="Something that is..."/><linebreak/>
        <section level="3" title="Verb" />
        <indent/><text content="to be..."/><linebreak/>
then we try to bring in some hierarchy heuristic
        val nodes = XML.fromString(pageXMLFlat)
        val stack = new Stack(nodes)
        while(stack.size > 0){
          val n = stack.pop
          val sub = stack.takeWhile(o => n.level != o.level) //well you
        get the idea
          n addChildren sub
        }
and we get something like
<section level="2" title="English">
  <section level="3" title="Noun">
    <line><indent/><text content="Something that is..."/></line>
  </section>
  <section level="3" title="Verb">
    <line><indent/><text content="to be..."/></line>
  </section>
</section>
as long as the structure of the page is stable (within one language it
mostly is), you can work with this XML... depending on how deep you go
with the replacements (replacing even commas etc. e.g. for the list of
synonyms) you could get a pretty detailed representation of the page.
We would also be interested in that.

Regards,
Jonas

[1] http://dbpedia.org/About
[2] http://dumps.wikimedia.org/backup-index.html
[3] [hidden email]

> Cedric De Vroey <[hidden email]> wrote:
>         Hi Guys,
>        
>         I'd like to integrate and cache wiktionary in an application I'm developing
>         but I'm having this problem: How can I retrieve content from wiktionary in
>         a structured format or translate it to a structured format (with structured
>         format like XML, CSV, Json,...)? I have already looked into the
>         Special:Export page but that didn't really helped me cause the actual
>         content is still all in one field. Are there any known best-practices to do
>         this?
>        
>         Thanks!
>         Cedric
>        
>         ______________________________________________________________
>        
>         Wiktionary-l mailing list
>         [hidden email]
>         https://lists.wikimedia.org/mailman/listinfo/wiktionary-l



_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: integrating wiktionary

Jonas Brekle
In reply to this post by Cedric De Vroey
Hi Crederic,

i forgot to mention two existing tools (duh)

russian and englisch
http://code.google.com/p/wikokit/

german and english, well researched, available for research purposes
http://www.ukp.tu-darmstadt.de/software/jwktl/ 

Regards,
Jonas

> Cedric De Vroey <[hidden email]> wrote:
>         Hi Guys,
>        
>         I'd like to integrate and cache wiktionary in an application I'm developing
>         but I'm having this problem: How can I retrieve content from wiktionary in
>         a structured format or translate it to a structured format (with structured
>         format like XML, CSV, Json,...)? I have already looked into the
>         Special:Export page but that didn't really helped me cause the actual
>         content is still all in one field. Are there any known best-practices to do
>         this?
>        
>         Thanks!
>         Cedric
>        
>         ______________________________________________________________
>        
>         Wiktionary-l mailing list
>         [hidden email]
>         https://lists.wikimedia.org/mailman/listinfo/wiktionary-l




_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l