Re: [Wikitech-l] Listing missing words of wiktionnaries

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

Lars Aronsson
On 07/23/2013 11:23 AM, Mathieu Stumpf wrote:
> Here is what I would like to do : generating reports which give, for a
> given language, a list of words which are used on the web with a
> number evaluating its occurencies, but which are not in a given
> wiktionary.
>
> How would you recommand to implemente that within the wikimedia
> infrastructure?

Some years back, I undertook to add entries for
Swedish words in the English Wiktionary. You can
follow my diary at http://en.wiktionary.org/wiki/User:LA2

Among the things I did was to extract a list of all
Swedish words that already had entries. The best
way was to use CatScan to list entries in categories
for Swedish words. Even if there is a page called
"men", this doesn't mean the Swedish word "men"
has an entry, because it could be the English word
"men" that is in that page.

Then I extracted all words from some known texts,
e.g. novels, the Bible, government reports, and the
Swedish Wikipedia, counting the number of
occurrencies of each word. Case significance is
a bit tricky. There should not be an entry for
lower-case stockholm, so you can't just convert
everything to lower case. But if a sentence begins
with a capital letter, that word should not have
a capitalized entry. Another tricky issue is
abbreviations, which should keep the period,
for example "i.e." rather than "i" and "e". But
the period that ends a sentence should be removed.
When splitting a text into words, I decided to keep
all periods and initial capital letters, even if this
leads to some false words.

When you have word frequency statistics for a text,
and a list of existing entries from Wiktionary, you
can compute the coverage, and I wrote a little
script for this. I found that English Wiktionary already
had Swedish entries covering 72% of the words in the
Bible, and when I started to add entries for the most
common of the missing words, I was able to increase
this to 87% in just a single month (September 2010).

Many of the common words that were missing when
I started were adverbs such as "thereof", "herein",
which occur frequently in any text but are not very
exciting to write entries about. This statistics-based
approach gave me a reason to add those entries.

It is interesting to contrast a given text to a given
dictionary in this way. The Swedish entries in the
English Wiktionary is a different dictionary than the
Swedish entries in the German or Danish Wiktionary.
The kinds of words found in the Bible are different
from those found in Wikipedia or in legal texts.
There is not a single, universal text corpus that we
can aim to cover. Google has released its ngram
dataset. I'm not sure if it covers Swedish, but even
if it does, it must differ from the corpus frequencies
published by the Swedish Academy.

It is relatively easy to extract a list of existing entries
from Wiktionary. But to prepare a given text corpus
for frequency and coverage analysis needs more
preparation.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

Amgine-3
The request is to create a web-based text corpus[1] from which to derive
frequencies and then compare with existing wiktionaries. Not a light
undertaking, but one which has been proposed and implemented previously
(e.g. Connel's Gutenberg project[2])

Generically speaking, someone would need to determine the appropriate
size of the corpus sample, it's temporal currency, and the method of
creating and maintaining it. This isn't easy to do, and having no
strictures results in unwieldy and mostly irrelevant products like
Google's n-grams[3] (on the other hand, if someone can figure out how to
filter n-grams usefully it would mean we don't have to build our own.)

Amgine

[1] https://en.wikipedia.org/wiki/Linguistic_corpus
[2] https://en.wiktionary.org/wiki/User:Connel_MacKenzie/Gutenberg
[3] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html


On 26/07/13 09:18, Lars Aronsson wrote:

> On 07/23/2013 11:23 AM, Mathieu Stumpf wrote:
>> Here is what I would like to do : generating reports which give, for
>> a given language, a list of words which are used on the web with a
>> number evaluating its occurencies, but which are not in a given
>> wiktionary.
>>
>> How would you recommand to implemente that within the wikimedia
>> infrastructure?
>
> Some years back, I undertook to add entries for
> Swedish words in the English Wiktionary. You can
> follow my diary at http://en.wiktionary.org/wiki/User:LA2
>
> Among the things I did was to extract a list of all
> Swedish words that already had entries. The best
> way was to use CatScan to list entries in categories
> for Swedish words. Even if there is a page called
> "men", this doesn't mean the Swedish word "men"
> has an entry, because it could be the English word
> "men" that is in that page.
>
> Then I extracted all words from some known texts,
> e.g. novels, the Bible, government reports, and the
> Swedish Wikipedia, counting the number of
> occurrencies of each word. Case significance is
> a bit tricky. There should not be an entry for
> lower-case stockholm, so you can't just convert
> everything to lower case. But if a sentence begins
> with a capital letter, that word should not have
> a capitalized entry. Another tricky issue is
> abbreviations, which should keep the period,
> for example "i.e." rather than "i" and "e". But
> the period that ends a sentence should be removed.
> When splitting a text into words, I decided to keep
> all periods and initial capital letters, even if this
> leads to some false words.
>
> When you have word frequency statistics for a text,
> and a list of existing entries from Wiktionary, you
> can compute the coverage, and I wrote a little
> script for this. I found that English Wiktionary already
> had Swedish entries covering 72% of the words in the
> Bible, and when I started to add entries for the most
> common of the missing words, I was able to increase
> this to 87% in just a single month (September 2010).
>
> Many of the common words that were missing when
> I started were adverbs such as "thereof", "herein",
> which occur frequently in any text but are not very
> exciting to write entries about. This statistics-based
> approach gave me a reason to add those entries.
>
> It is interesting to contrast a given text to a given
> dictionary in this way. The Swedish entries in the
> English Wiktionary is a different dictionary than the
> Swedish entries in the German or Danish Wiktionary.
> The kinds of words found in the Bible are different
> from those found in Wikipedia or in legal texts.
> There is not a single, universal text corpus that we
> can aim to cover. Google has released its ngram
> dataset. I'm not sure if it covers Swedish, but even
> if it does, it must differ from the corpus frequencies
> published by the Swedish Academy.
>
> It is relatively easy to extract a list of existing entries
> from Wiktionary. But to prepare a given text corpus
> for frequency and coverage analysis needs more
> preparation.


_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

Lars Aronsson
On 07/26/2013 08:26 PM, Amgine wrote:
> Google's n-grams[3] (on the other hand, if someone can figure out how to
> filter n-grams usefully it would mean we don't have to build our own.)

Exactly. And nothing stops us from going both ways,
compare the results and let the best frequency list win.
If it was a good idea to arrive at the one and true list,
then linguists would have done so long ago.

Since the 1960s, Gothenburg University collects word
frequencies for Swedish based on newspaper text,
where the text is copyrighted but the frequency lists
are made openly available,
http://spraakbanken.gu.se/pub/statistik/

I'm sure you can find similar resources for many other
languages.

What WMF could do is to compile its own frequency lists
based on Wikipedia and Wikisource, and publish them
at regular intervals (annually?) along with XML dumps.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

mathieu lovato stumpf guntz
In reply to this post by Amgine-3
Le 2013-07-26 20:26, Amgine a écrit :

> The request is to create a web-based text corpus[1] from which to
> derive
> frequencies and then compare with existing wiktionaries. Not a light
> undertaking, but one which has been proposed and implemented
> previously
> (e.g. Connel's Gutenberg project[2])
>
> Generically speaking, someone would need to determine the appropriate
> size of the corpus sample, it's temporal currency, and the method of
> creating and maintaining it. This isn't easy to do, and having no
> strictures results in unwieldy and mostly irrelevant products like
> Google's n-grams[3] (on the other hand, if someone can figure out how
> to
> filter n-grams usefully it would mean we don't have to build our
> own.)

Actually, I think it would be interesting to have a trend history of
words usage over centuries (current trend would also be interesting but
probably harder to implement). Wikisource may be used in order to
achieve that.

>
> Amgine
>
> [1] https://en.wikipedia.org/wiki/Linguistic_corpus
> [2] https://en.wiktionary.org/wiki/User:Connel_MacKenzie/Gutenberg
> [3] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
>
> On 26/07/13 09:18, Lars Aronsson wrote:
>> On 07/23/2013 11:23 AM, Mathieu Stumpf wrote:
>>> Here is what I would like to do : generating reports which give,
>>> for
>>> a given language, a list of words which are used on the web with a
>>> number evaluating its occurencies, but which are not in a given
>>> wiktionary.
>>>
>>> How would you recommand to implemente that within the wikimedia
>>> infrastructure?
>>
>> Some years back, I undertook to add entries for
>> Swedish words in the English Wiktionary. You can
>> follow my diary at http://en.wiktionary.org/wiki/User:LA2
>>
>> Among the things I did was to extract a list of all
>> Swedish words that already had entries. The best
>> way was to use CatScan to list entries in categories
>> for Swedish words. Even if there is a page called
>> "men", this doesn't mean the Swedish word "men"
>> has an entry, because it could be the English word
>> "men" that is in that page.
>>
>> Then I extracted all words from some known texts,
>> e.g. novels, the Bible, government reports, and the
>> Swedish Wikipedia, counting the number of
>> occurrencies of each word. Case significance is
>> a bit tricky. There should not be an entry for
>> lower-case stockholm, so you can't just convert
>> everything to lower case. But if a sentence begins
>> with a capital letter, that word should not have
>> a capitalized entry. Another tricky issue is
>> abbreviations, which should keep the period,
>> for example "i.e." rather than "i" and "e". But
>> the period that ends a sentence should be removed.
>> When splitting a text into words, I decided to keep
>> all periods and initial capital letters, even if this
>> leads to some false words.
>>
>> When you have word frequency statistics for a text,
>> and a list of existing entries from Wiktionary, you
>> can compute the coverage, and I wrote a little
>> script for this. I found that English Wiktionary already
>> had Swedish entries covering 72% of the words in the
>> Bible, and when I started to add entries for the most
>> common of the missing words, I was able to increase
>> this to 87% in just a single month (September 2010).
>>
>> Many of the common words that were missing when
>> I started were adverbs such as "thereof", "herein",
>> which occur frequently in any text but are not very
>> exciting to write entries about. This statistics-based
>> approach gave me a reason to add those entries.
>>
>> It is interesting to contrast a given text to a given
>> dictionary in this way. The Swedish entries in the
>> English Wiktionary is a different dictionary than the
>> Swedish entries in the German or Danish Wiktionary.
>> The kinds of words found in the Bible are different
>> from those found in Wikipedia or in legal texts.
>> There is not a single, universal text corpus that we
>> can aim to cover. Google has released its ngram
>> dataset. I'm not sure if it covers Swedish, but even
>> if it does, it must differ from the corpus frequencies
>> published by the Swedish Academy.
>>
>> It is relatively easy to extract a list of existing entries
>> from Wiktionary. But to prepare a given text corpus
>> for frequency and coverage analysis needs more
>> preparation.
>
>
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

--
Association Culture-Libre
http://www.culture-libre.org/

_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

Geert Lap
DEAR SENDERS OF WIKIMEDIA ETC.
PLAEASE STOP SENDING ME REQUEST OR QUESTION ABOUT PARTICIPATING IN WIKIPEDIA
DELETE MY E-MAIL-ADRESS
I AM NOT INTERESTED!

Op 30 jul. 2013, om 17:15 heeft Mathieu Stumpf <[hidden email]> het volgende geschreven:

> Le 2013-07-26 20:26, Amgine a écrit :
>> The request is to create a web-based text corpus[1] from which to derive
>> frequencies and then compare with existing wiktionaries. Not a light
>> undertaking, but one which has been proposed and implemented previously
>> (e.g. Connel's Gutenberg project[2])
>>
>> Generically speaking, someone would need to determine the appropriate
>> size of the corpus sample, it's temporal currency, and the method of
>> creating and maintaining it. This isn't easy to do, and having no
>> strictures results in unwieldy and mostly irrelevant products like
>> Google's n-grams[3] (on the other hand, if someone can figure out how to
>> filter n-grams usefully it would mean we don't have to build our own.)
>
> Actually, I think it would be interesting to have a trend history of words usage over centuries (current trend would also be interesting but probably harder to implement). Wikisource may be used in order to achieve that.
>
>>
>> Amgine
>>
>> [1] https://en.wikipedia.org/wiki/Linguistic_corpus
>> [2] https://en.wiktionary.org/wiki/User:Connel_MacKenzie/Gutenberg
>> [3] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>
>>
>> On 26/07/13 09:18, Lars Aronsson wrote:
>>> On 07/23/2013 11:23 AM, Mathieu Stumpf wrote:
>>>> Here is what I would like to do : generating reports which give, for
>>>> a given language, a list of words which are used on the web with a
>>>> number evaluating its occurencies, but which are not in a given
>>>> wiktionary.
>>>>
>>>> How would you recommand to implemente that within the wikimedia
>>>> infrastructure?
>>>
>>> Some years back, I undertook to add entries for
>>> Swedish words in the English Wiktionary. You can
>>> follow my diary at http://en.wiktionary.org/wiki/User:LA2
>>>
>>> Among the things I did was to extract a list of all
>>> Swedish words that already had entries. The best
>>> way was to use CatScan to list entries in categories
>>> for Swedish words. Even if there is a page called
>>> "men", this doesn't mean the Swedish word "men"
>>> has an entry, because it could be the English word
>>> "men" that is in that page.
>>>
>>> Then I extracted all words from some known texts,
>>> e.g. novels, the Bible, government reports, and the
>>> Swedish Wikipedia, counting the number of
>>> occurrencies of each word. Case significance is
>>> a bit tricky. There should not be an entry for
>>> lower-case stockholm, so you can't just convert
>>> everything to lower case. But if a sentence begins
>>> with a capital letter, that word should not have
>>> a capitalized entry. Another tricky issue is
>>> abbreviations, which should keep the period,
>>> for example "i.e." rather than "i" and "e". But
>>> the period that ends a sentence should be removed.
>>> When splitting a text into words, I decided to keep
>>> all periods and initial capital letters, even if this
>>> leads to some false words.
>>>
>>> When you have word frequency statistics for a text,
>>> and a list of existing entries from Wiktionary, you
>>> can compute the coverage, and I wrote a little
>>> script for this. I found that English Wiktionary already
>>> had Swedish entries covering 72% of the words in the
>>> Bible, and when I started to add entries for the most
>>> common of the missing words, I was able to increase
>>> this to 87% in just a single month (September 2010).
>>>
>>> Many of the common words that were missing when
>>> I started were adverbs such as "thereof", "herein",
>>> which occur frequently in any text but are not very
>>> exciting to write entries about. This statistics-based
>>> approach gave me a reason to add those entries.
>>>
>>> It is interesting to contrast a given text to a given
>>> dictionary in this way. The Swedish entries in the
>>> English Wiktionary is a different dictionary than the
>>> Swedish entries in the German or Danish Wiktionary.
>>> The kinds of words found in the Bible are different
>>> from those found in Wikipedia or in legal texts.
>>> There is not a single, universal text corpus that we
>>> can aim to cover. Google has released its ngram
>>> dataset. I'm not sure if it covers Swedish, but even
>>> if it does, it must differ from the corpus frequencies
>>> published by the Swedish Academy.
>>>
>>> It is relatively easy to extract a list of existing entries
>>> from Wiktionary. But to prepare a given text corpus
>>> for frequency and coverage analysis needs more
>>> preparation.
>>
>>
>> _______________________________________________
>> Wiktionary-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>
> --
> Association Culture-Libre
> http://www.culture-libre.org/
>
> _______________________________________________
> Wiktionary-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l


_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

Amgine-3
In reply to this post by mathieu lovato stumpf guntz
On 30/07/13 08:15, Mathieu Stumpf wrote:
> Actually, I think it would be interesting to have a trend history of
> words usage over centuries (current trend would also be interesting
> but probably harder to implement). Wikisource may be used in order to
> achieve that.


Not really. Or, more fairly, the available texts are probably not a
valid sample though they could be used for an informal guideline.

    "Full documentation: The sobering examples of the research
    experiences of Timberlake and Ruppenhofer (mentiolned above) show
    that even 100,000,000 words is at least an order of magnitude too
    small to capture phenomena that, though of low frequency, are in the
    competence of ordinary native speakers. That would represent at
    least 20,000 recorded hours, and it is too low by an order of
    magnitude."[1]


Of course this is referencing spoken language which, in most cases,
differs significantly from written language, but a running word corpus
of 100,000,000 seems a useful target, with samples weighted between
transcripts, periodicals, and texts from a delimited time and region.
Lemmatized corpus of 6,000-10,000.

Amgine


[1] http://emeld.org/school/classroom/text/lexicon-size.html
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

Lars Aronsson
On 07/30/2013 07:17 PM, Amgine wrote:
> Of course this is referencing spoken language which, in most cases,
> differs significantly from written language, but a running word corpus
> of 100,000,000 seems a useful target, with samples weighted between
> transcripts, periodicals, and texts from a delimited time and region.
> Lemmatized corpus of 6,000-10,000.

If you want to compare one year or decade to the next,
you need a similar sample from both years. One way
to get this is to narrow down to a corpus of just one
journal or newspaper. Wikisource can do this with
Popular Science Monthly,
https://en.wikisource.org/wiki/PSM

You'll get popular science and only that for every year.
You won't have romantic poetry for one year, and
theological texts for the next year. You can spot trends
in the use of words like engine/motor or steam/electricity,
just because that is what this journal is about, and
you get the same number of issues and pages each year.

Some assembly required: Most volumes of PSM are
not complete yet. Lots of proofreading remains.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Listing missing words of wiktionnaries

Federico Leva (Nemo)
In reply to this post by Lars Aronsson
Relatedly:
* an old effort to list missing entries for en.wikt:
<http://thread.gmane.org/gmane.org.wikimedia.wiktionary/784>,
* a recent bugzilla report to get lists of search queries which gave no
results/no title matches, to identify requested entries:
<https://bugzilla.wikimedia.org/show_bug.cgi?id=56830>.

Nemo

_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l