Historic stats

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Historic stats

Lars Aronsson
In Wiktionary, every site/language documents words from every language,
as I am sure you know. A typical wiki page, e.g. "war" contains information
about the English noun as well as the German verb.

Through categories, we also know how many entries there are. How many
English lemmas, how many English nouns, how many German verbs.

But if I want to plot a graph of the growth over time of English nouns
and German verbs, it is a pity that this is not available anywhere.
But it would be possible to generate such data from the history
dump, by finding out when the page "war" was created and when its
English and German sections were created. In SQL terms, it would be
for each combination of page and section (heading), find the earliest
date when that section was present in that page. But a practical
implementation would of course solve that as a single-pass filter,
reading the stdout from bunzip.

So has anybody already written a program that reads through the
XML dump of articles and their history, and generates statistics
of this kind?


--
   Lars Aronsson ([hidden email])
   Link√∂ping, Sweden



_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Reply | Threaded
Open this post in threaded view
|

Re: Historic stats

Christian Meyer
Hi Lars,

https://dkpro.github.io/dkpro-jwktl/ might be a good starting point for you. It does not solve all steps of your use case right away, but you could save a lot of implementation time compared to starting from scratch. The software is written in Java.

Best,
Christian
 

-----Original Message-----
From: Wiktionary-l [mailto:[hidden email]] On Behalf Of Lars Aronsson
Sent: Friday, September 08, 2017 8:56 PM
To: Wikimedia developers
Cc: Wiktionary
Subject: [Wiktionary-l] Historic stats

In Wiktionary, every site/language documents words from every language,
as I am sure you know. A typical wiki page, e.g. "war" contains information
about the English noun as well as the German verb.

Through categories, we also know how many entries there are. How many
English lemmas, how many English nouns, how many German verbs.

But if I want to plot a graph of the growth over time of English nouns
and German verbs, it is a pity that this is not available anywhere.
But it would be possible to generate such data from the history
dump, by finding out when the page "war" was created and when its
English and German sections were created. In SQL terms, it would be
for each combination of page and section (heading), find the earliest
date when that section was present in that page. But a practical
implementation would of course solve that as a single-pass filter,
reading the stdout from bunzip.

So has anybody already written a program that reads through the
XML dump of articles and their history, and generates statistics
of this kind?


--
   Lars Aronsson ([hidden email])
   Link√∂ping, Sweden



_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
_______________________________________________
Wiktionary-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l