Archive of visitor stats

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Toolserver-l] Archive of visitor stats

erikzachte
> Making the script aware of namespace names would be quite easy.

Yes it is more a matter of priority than feasibility.

I already use localized namespace names in wikistats, obviously.
Without those the dumps can't be interpreted.
Each xml (full) archive dump starts with list of localized namespace names.

I also parse php files for localization of reserved words like #REDIRECT
And parse other php files for language names translations
And extract many more language name translations from wp:en interwiki links
via api.

But every such action takes time, needs safeguards (files can be moved, can
be temporary inaccessible,
formats change, maybe not in xml, but in php for sure) and requires
occasional attention for maintenance.

So for a housekeeping job where really almost no-one seemed to care about at
the time,
I just chose to keep it simple (this particular optimization can always be
retrofitted).

If we find a better place to store them than on the wikistats server we
might be able to store them
unfiltered, but still condensed as one daily file, as this speeds up
processing greatly,
or maybe repackaged into a monthly file per wiki.

Erik Zachte




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Lars Aronsson
In reply to this post by Lars Aronsson
Earlier, I wrote:

> Are visitor stats (as produced by Domas) safely archived
> somewhere...?

As an experiment, I uploaded the files for December 2007 to the
Internet Archive,
http://www.archive.org/details/wikipedia_visitor_stats_200712

It was the first time I uploaded something to IA, and since this
was not sound or movies, it was put under "opensource books".
Even though I have a 100 Mbit/s connection, the FTP upload only
got 2.5 Mbit/s (317 kB/s) and the entire upload took 12 hours.

Even though the pagecounts files (each covering one hour) are
compressed, each one contains the same dictionary (article titles)
and I think the total could be more efficiently compressed
(without loss of any information) if they were unpacked and
organized differently. I don't really have the time and energy to
investigate this.

Now I would feel less frustrated if these are removed from my
disk.

Should I continue to do this for the files for 2008, one batch per
month? Or do you have any better ideas?


--
  Lars Aronsson ([hidden email])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Mathias Schindler-4
In reply to this post by erikzachte
2009/9/18 Erik Zachte <[hidden email]>:
> I think it is extremely important to keep these files for later analysis by
> historians and others.
>
> Mathias Schindler also keep an archive or at least did till April (Berlin
> conference).
> He even bought a dedicated external drive for it.

Right now, I have a single copy of all the files from December 2007 to
April 2009 on a single hard drive. I haven't done any integrity checks
beyond some initial tests. The dataset has some missing spots when the
service to produce the files was not working. In some cases, it is
just an empty .gz file, in some cases there was no file produced at
all.

In my spare time, I will try to load the files from May to now to this
hard drive until it is full.

The situation is rather uncomfortable for me since I am in no way able
to guarantee the integrity and safety of these files for a longer time
frame. While I might continue downloading and "storing" the files, I
would be extremely happy to hear that the full and unabridged set of
files is available a) to anyone b) for an indefinite time span c) free
of charge d) with some backup and data integrity check in place.

Speaking of wish lists, a web-accessible service to work with the data
would be nice. We know for sure that journalists and hopefully some
more demographics like the data, numbers and resulting shiny graphs.

Mathias

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12