> Making the script aware of namespace names would be quite easy.
Yes it is more a matter of priority than feasibility.
I already use localized namespace names in wikistats, obviously.
Without those the dumps can't be interpreted.
Each xml (full) archive dump starts with list of localized namespace names.
I also parse php files for localization of reserved words like #REDIRECT
And parse other php files for language names translations
And extract many more language name translations from wp:en interwiki links
But every such action takes time, needs safeguards (files can be moved, can
be temporary inaccessible,
formats change, maybe not in xml, but in php for sure) and requires
occasional attention for maintenance.
So for a housekeeping job where really almost no-one seemed to care about at
I just chose to keep it simple (this particular optimization can always be
If we find a better place to store them than on the wikistats server we
might be able to store them
unfiltered, but still condensed as one daily file, as this speeds up
or maybe repackaged into a monthly file per wiki.
It was the first time I uploaded something to IA, and since this
was not sound or movies, it was put under "opensource books".
Even though I have a 100 Mbit/s connection, the FTP upload only
got 2.5 Mbit/s (317 kB/s) and the entire upload took 12 hours.
Even though the pagecounts files (each covering one hour) are
compressed, each one contains the same dictionary (article titles)
and I think the total could be more efficiently compressed
(without loss of any information) if they were unpacked and
organized differently. I don't really have the time and energy to
Now I would feel less frustrated if these are removed from my
Should I continue to do this for the files for 2008, one batch per
month? Or do you have any better ideas?
2009/9/18 Erik Zachte <[hidden email]>:
> I think it is extremely important to keep these files for later analysis by
> historians and others.
> Mathias Schindler also keep an archive or at least did till April (Berlin
> He even bought a dedicated external drive for it.
Right now, I have a single copy of all the files from December 2007 to
April 2009 on a single hard drive. I haven't done any integrity checks
beyond some initial tests. The dataset has some missing spots when the
service to produce the files was not working. In some cases, it is
just an empty .gz file, in some cases there was no file produced at
In my spare time, I will try to load the files from May to now to this
hard drive until it is full.
The situation is rather uncomfortable for me since I am in no way able
to guarantee the integrity and safety of these files for a longer time
frame. While I might continue downloading and "storing" the files, I
would be extremely happy to hear that the full and unabridged set of
files is available a) to anyone b) for an indefinite time span c) free
of charge d) with some backup and data integrity check in place.
Speaking of wish lists, a web-accessible service to work with the data
would be nice. We know for sure that journalists and hopefully some
more demographics like the data, numbers and resulting shiny graphs.