Extracting word list and brief definitions from Wiktionary

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting word list and brief definitions from Wiktionary

kellyterryjones
I want a list of all English words + a brief definition of each [1].

I tried downloading enwiktionary-latest-pages-articles.xml.bz2, but
this is way too much: it includes foreign words, word roots/origins,
and a lot more that I don't need.

How can I extract just a word list w/ definitions from wiktionary?

I know about mthes and scowl, but wiktionary supercedes these, yes?

[1] I realize this isn't well-defined: I'll settle for an approximation

--
We're just a Bunch Of Regular Guys, a collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.

_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word list and brief definitions from Wiktionary

Dennis During
Your best bet is likely to be to go to the Grease Pit at Wiktionary. Someone
had a similar request recently, I think and seemed to get some help. This
list is rarely used.

On Fri, Oct 16, 2009 at 6:58 PM, Kelly Jones <[hidden email]>wrote:

> I want a list of all English words + a brief definition of each [1].
>
> I tried downloading enwiktionary-latest-pages-articles.xml.bz2, but
> this is way too much: it includes foreign words, word roots/origins,
> and a lot more that I don't need.
>
> How can I extract just a word list w/ definitions from wiktionary?
>
> I know about mthes and scowl, but wiktionary supercedes these, yes?
>
> [1] I realize this isn't well-defined: I'll settle for an approximation
>
> --
> We're just a Bunch Of Regular Guys, a collective group that's trying
> to understand and assimilate technology. We feel that resistance to
> new ideas and technology is unwise and ultimately futile.
>
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>



--
Dennis C. During

Cynolatry is tolerant so long as the dog is not denied an equal divinity
with the deities of other faiths. - Ambrose Bierce

http://en.wiktionary.org/wiki/cynolatry
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word list and brief definitions from Wiktionary

Lars Aronsson
In reply to this post by kellyterryjones
Kelly Jones wrote:

> How can I extract just a word list w/ definitions from wiktionary?

A very simple Perl script for extracting information from the
Wikimedia XML dumps is found on
http://meta.wikimedia.org/wiki/User:LA2/Extraktor

If you know Perl, you can modify this script to filter out the
articles and sections you want, and output them separately.


--
  Lars Aronsson ([hidden email])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word list and brief definitions from Wiktionary

Andrew Dunbar
In reply to this post by Dennis During
2009/10/16 Dennis During <[hidden email]>

> Your best bet is likely to be to go to the Grease Pit at Wiktionary.
> Someone
> had a similar request recently, I think and seemed to get some help. This
> list is rarely used.
>
> On Fri, Oct 16, 2009 at 6:58 PM, Kelly Jones <[hidden email]
> >wrote:
>
> > I want a list of all English words + a brief definition of each [1].
> >
> > I tried downloading enwiktionary-latest-pages-articles.xml.bz2, but
> > this is way too much: it includes foreign words, word roots/origins,
> > and a lot more that I don't need.
> >
> > How can I extract just a word list w/ definitions from wiktionary?
> >
> > I know about mthes and scowl, but wiktionary supercedes these, yes?
> >
> > [1] I realize this isn't well-defined: I'll settle for an approximation
>

There is no one simple direct way to download just the English words and
definitions.

Wiktionary uses the same software as Wikipedia which is designed for
encyclopedias which just need one big blob of text for an article. A
dictionary has a structure which has no support in the software and so
instead we represent in a big blob of text.

Now it is possibe to extract useful content from these blobs of text.

The English Wiktionary is divided into sections and subsections with
more-or-less standard formats. It is possible to write a program which
parses this format. But because it is not totally standard some things are
easier to parse than others.

The English list of words in the easiest to extract because you just need to
find every page in the article namespace (not talk pages etc) which contains
==English==

But you might need to ask yourself what you mean be "word" because
Wiktionary also contains many forms of the same word including spelling
variations and compounding variations such as "treeline" vs "tree-line" vs
"tree line". Not ot only this but it includes many many inflected forms such
as "word" vs "words"; "look" vs "looks" vs "looked" vs "looking"; and "fast"
vs "faster" vs "fastest". It is not always easy to filter these out. Worse,
Wiktionary also includes "common misspellings" such as "alot" which are also
tricky to filter out.

Words can have many definitions. There are both "homonyms" and "senses".
Homonyms are words of different origins which share a spelling such as
"sewer" (that which sews) vs "sewer" (wast drainage pipes) and senses are
different meanings of the same word with the same origin such as "chicken"
(domestic fowl) vs (coward).

So you have to decide what "a brief definition" of each means for you. Do
you want just the first definition for each entry, ignoring all the others,
do you want all definitions lumped together, or do you want all definitions
grouped by homonym?

All that being said I think it is a very fair expectation that the English
Wiktionary should make avilable such lists on a regular basis, much like the
Wikimedia foundation makes the raw dump files avaiable. Please feel free to
further this discussion here on the mailing list or in the "Greast pit" on
the English Wiktionary.

Andrew Dunbar (hippietrail)



> --
> > We're just a Bunch Of Regular Guys, a collective group that's trying
> > to understand and assimilate technology. We feel that resistance to
> > new ideas and technology is unwise and ultimately futile.
> >
> > _______________________________________________
> > Wiktionary-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
> >
>
>
>
> --
> Dennis C. During
>
> Cynolatry is tolerant so long as the dog is not denied an equal divinity
> with the deities of other faiths. - Ambrose Bierce
>
> http://en.wiktionary.org/wiki/cynolatry
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: Extracting word list and brief definitions from Wiktionary

Andrew Dunbar
In reply to this post by Lars Aronsson
2009/10/17 Lars Aronsson <[hidden email]>

> Kelly Jones wrote:
>
> > How can I extract just a word list w/ definitions from wiktionary?
>
> A very simple Perl script for extracting information from the
> Wikimedia XML dumps is found on
> http://meta.wikimedia.org/wiki/User:LA2/Extraktor
>
> If you know Perl, you can modify this script to filter out the
> articles and sections you want, and output them separately.
>
> There are some Perl tools I've made on the Toolserver subversion repository
but since I'm the only one using them so far they be tricky for others to
use: https://fisheye.toolserver.org/browse/enwikt/wiktdump/
wiktsplitnames.pl
<https://fisheye.toolserver.org/browse/enwikt/wiktdump/wiktsplitnames.pl>will
split an English Wiktionary dumpfile into word lists or mini dump files for
each language at the most simplistic level

I've also created a feature request on bugzilla: "Regularly publish updated
word lists and definition lists"
https://bugzilla.wikimedia.org/show_bug.cgi?id=21164

Andrew Dunbar (hippietrail)


> --
>  Lars Aronsson ([hidden email])
>  Aronsson Datateknik - http://aronsson.se
>
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l