Archive of visitor stats

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Archive of visitor stats

Lars Aronsson

Are visitor stats (as produced by Domas) safely archived
somewhere, for example on the toolserver, where development
projects can easily access them for analysis?  I have made my own
copies of the files (I guess my plan was to use them, but this
hasn't started yet), but now I'm running out of disk and I
urgently need to clear some space on that server.

I just deleted September 2009 (last 2 weeks) and that freed 9 GB.

The oldest I have is pagecounts-20071209-180000.gz


--
  Lars Aronsson ([hidden email])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

erikzachte
I think it is extremely important to keep these files for later analysis by
historians and others.

Mathias Schindler also keep an archive or at least did till April (Berlin
conference).
He even bought a dedicated external drive for it.

I collect files daily and merge 24 hourly files into one daily file.
That saves a lot on disk space and makes processing faster.
Titles with less than 10 requests per day are discarded that also saves a
lot.

For the remainder instead of 24 comma separated values I use a 'sparse
array' as follows:

B2D15G2 means 2 views in 2nd hour (0100-0200), 15 in 4th, 2 in 7th
The string starts with total for whole day
(redundant but eases processing for some purposes)
So actually it is 19B2D15G2

Example:
de Berlie_Doherty 9L2O1Q1R2T3
de Berliet 20E2F1K1M1N2O3P3Q4R2X1
de Berliet_GBC_8_KT 17B1E1J3M2N1O1P1Q1R2S1T1U1V1
de Berlin
8488A116B56C32D56E21F43G98H172I316J531K636L675M601N533O524P508Q510R576S426T4
92U530V508W328X200

I have files from August 2008. Roughly 3 Gb per month now.
And yes a more permanent, fail-safe and more accessible storage location
would be great.

Erik Zachte


> -----Original Message-----
> From: Frédéric Schütz [mailto:[hidden email]]
> Sent: Thursday, September 17, 2009 22:34
> To: [hidden email]
> Cc: [hidden email]; Erik Zachte
> Subject: Re: [Toolserver-l] Archive of visitor stats
>
> Lars Aronsson wrote:
>
> > Are visitor stats (as produced by Domas) safely archived
> > somewhere, for example on the toolserver, where development
> > projects can easily access them for analysis?  I have made my own
> > copies of the files (I guess my plan was to use them, but this
> > hasn't started yet), but now I'm running out of disk and I
> > urgently need to clear some space on that server.
> >
> > I just deleted September 2009 (last 2 weeks) and that freed 9 GB.
> >
> > The oldest I have is pagecounts-20071209-180000.gz
>
> As Platonides mentioned, they are in /mnt/user-store/stats on the
> toolserver; however, I would not call that "safely archived": one of my
> cron jobs just copies them from Domas server, and that's it.
>
> At the moment, there should be everything starting from 1 January 2009
> (although part of it disappeared at some point, but I managed to
> recover
> it).
>
> However, this is definitively not a sustainable solution in the long
> run: the files currently take 335 Gb (out of a 1.5 Tb total space).
>
> Erik Zachte stores archives of visitor stats in a better format,
> aggregating some of the older data and storing several days of data in
> one file. I started looking into these files earlier this year,
> planning
> to spend some time playing with this data. One of my ideas was to
> replicate the statistical data that is on the WMF stats server
> somewhere
> on the toolserver -- and do it "officially" and not just by copying
> files using a personal cron job. Unfortunately, "real life" took over
> and I did not manage to continue this (and still can't). However, if
> there is any interest in improving the situation, I'd be glad to look
> into it as soon as I can.
>
> I cc' Erik who may have more to say.
>
> Cheers,
>
> Frédéric



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Robert Rohde
2009/9/17 Erik Zachte <[hidden email]>:

> I think it is extremely important to keep these files for later analysis by
> historians and others.
>
> Mathias Schindler also keep an archive or at least did till April (Berlin
> conference).
> He even bought a dedicated external drive for it.
>
> I collect files daily and merge 24 hourly files into one daily file.
> That saves a lot on disk space and makes processing faster.
> Titles with less than 10 requests per day are discarded that also saves a
> lot.

Careful, a recent analysis I did suggested that 15% of all page
requests for articles on Wikipedia are for topics requested less than
once per hour.  There are a very large number of pages that rarely see
hits, but collectively the traffic to such topics is important.  You
could end up biasing certain kinds of analysis if you always exclude
the rarely visited pages.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Steve Bennett-8
2009/9/18 Robert Rohde <[hidden email]>:
> Careful, a recent analysis I did suggested that 15% of all page
> requests for articles on Wikipedia are for topics requested less than
> once per hour.  There are a very large number of pages that rarely see
> hits, but collectively the traffic to such topics is important.  You
> could end up biasing certain kinds of analysis if you always exclude
> the rarely visited pages.

Is there a link to that analysis? It would be interesting to see which
are the least requested articles, for example.

Steve

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Robert Rohde
On Thu, Sep 17, 2009 at 6:24 PM, Steve Bennett <[hidden email]> wrote:

> 2009/9/18 Robert Rohde <[hidden email]>:
>> Careful, a recent analysis I did suggested that 15% of all page
>> requests for articles on Wikipedia are for topics requested less than
>> once per hour.  There are a very large number of pages that rarely see
>> hits, but collectively the traffic to such topics is important.  You
>> could end up biasing certain kinds of analysis if you always exclude
>> the rarely visited pages.
>
> Is there a link to that analysis? It would be interesting to see which
> are the least requested articles, for example.

That particular result is unpublished.  I could make you a list of
infrequently viewed articles, but it would be quite long.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Steve Bennett-8
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde <[hidden email]> wrote:
> That particular result is unpublished.  I could make you a list of
> infrequently viewed articles, but it would be quite long.

Could you make a list of the 100 least viewed? Or are there are large
number which are essentially equal?

Steve

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Brian J Mingus
On Thu, Sep 17, 2009 at 9:25 PM, Steve Bennett <[hidden email]> wrote:

> On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde <[hidden email]> wrote:
> > That particular result is unpublished.  I could make you a list of
> > infrequently viewed articles, but it would be quite long.
>
> Could you make a list of the 100 least viewed? Or are there are large
> number which are essentially equal?
>
> Steve
>

There is a strong correlation between start/stub quality articles and the
number of times they are viewed.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Brian J Mingus
On Thu, Sep 17, 2009 at 9:28 PM, Brian <[hidden email]> wrote:

>
>
> On Thu, Sep 17, 2009 at 9:25 PM, Steve Bennett <[hidden email]>wrote:
>
>> On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde <[hidden email]> wrote:
>> > That particular result is unpublished.  I could make you a list of
>> > infrequently viewed articles, but it would be quite long.
>>
>> Could you make a list of the 100 least viewed? Or are there are large
>> number which are essentially equal?
>>
>> Steve
>>
>
> There is a strong correlation between start/stub quality articles and the
> number of times they are viewed.
>

Further correlated with number of edits..
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Steve Bennett-8
In reply to this post by Brian J Mingus
On Fri, Sep 18, 2009 at 1:28 PM, Brian <[hidden email]> wrote:
> There is a strong correlation between start/stub quality articles and the
> number of times they are viewed.

Ah, ok. What about a list of exceptions to that: articles over 1000
characters, that have been around more than a year, and still receive
less than a hit a day or something. I'm asking because perhaps
something like this could help inform WP:NOT, WP:N etc.

Steve

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Robert Rohde
In reply to this post by Steve Bennett-8
On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett <[hidden email]> wrote:
> On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde <[hidden email]> wrote:
>> That particular result is unpublished.  I could make you a list of
>> infrequently viewed articles, but it would be quite long.
>
> Could you make a list of the 100 least viewed? Or are there are large
> number which are essentially equal?

My sample consisted of collating 30 non-consecutive hours of data on
enwiki traffic where each hour was randomly chosen from any point
during the last 8 months.  This was filtered to only include page
titles that were valid mainspace pages.

In those 30 hours, there are 1.36 million valid article titles that
are viewed exactly once [1].

Examples include:

129342_Ependes
1421_in_literature
Antiprotonic_helium
Antonella_Mularoni
Madhusoodhanan_Nair
Blue_Murder_(play)
Ozonotherapy
Veronika_Krausas
Verret,_New_Brunswick
Bare_Truth_(Nat_album)

As you can see, these are obscure topics, but they are not necessarily
crazy topics.  If I were to repeat it with a longer baseline (say 1000
hours rather than 30) I'm suspect you might get more interesting
information on the tail, but right now probably the best I can say is
that a cumulatively significant amount of traffic goes to relatively
obscure pages.

-Robert Rohde

[1] Note: Because the traffic data is based on url request stings, and
some url strings map to the same pages, i.e. Blue_Ocean and
Blue%20Ocean, the number of valid article titles in not necessarily
the same as the number of distinct pages.  For practical reasons my
analysis was based of the url strings, and so probably over counts the
number of distinct articles involved, and to a degree overstates the
fraction of traffic to obscure pages.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Brian J Mingus
On Thu, Sep 17, 2009 at 9:55 PM, Robert Rohde <[hidden email]> wrote:

> On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett <[hidden email]>
> wrote:
> > On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde <[hidden email]>
> wrote:
> >> That particular result is unpublished.  I could make you a list of
> >> infrequently viewed articles, but it would be quite long.
> >
> > Could you make a list of the 100 least viewed? Or are there are large
> > number which are essentially equal?
>
> My sample consisted of collating 30 non-consecutive hours of data on
> enwiki traffic where each hour was randomly chosen from any point
> during the last 8 months.  This was filtered to only include page
> titles that were valid mainspace pages.
>
> In those 30 hours, there are 1.36 million valid article titles that
> are viewed exactly once [1].
>
> Examples include:
>
> 129342_Ependes
> 1421_in_literature
> Antiprotonic_helium
> Antonella_Mularoni
> Madhusoodhanan_Nair
> Blue_Murder_(play)
> Ozonotherapy
> Veronika_Krausas
> Verret,_New_Brunswick
> Bare_Truth_(Nat_album)
>
> As you can see, these are obscure topics, but they are not necessarily
> crazy topics.  If I were to repeat it with a longer baseline (say 1000
> hours rather than 30) I'm suspect you might get more interesting
> information on the tail, but right now probably the best I can say is
> that a cumulatively significant amount of traffic goes to relatively
> obscure pages.
>
> -Robert Rohde
>
> [1] Note: Because the traffic data is based on url request stings, and
> some url strings map to the same pages, i.e. Blue_Ocean and
> Blue%20Ocean, the number of valid article titles in not necessarily
> the same as the number of distinct pages.  For practical reasons my
> analysis was based of the url strings, and so probably over counts the
> number of distinct articles involved, and to a degree overstates the
> fraction of traffic to obscure pages.


How sure are you that they were viewed by a person and not a bot?
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

erikzachte
In reply to this post by Robert Rohde
Sure, info gets lost. And the Long Tail is meaningful for some research no
doubt.
But my resources are finite.

Actually I do store some all inclusive counts in the compacted 24 hr file:

# Lines starting with ampersand (@) show totals per 'namespace' (including
omitted counts for low traffic articles)
# Since valid namespace string are not known in the compression script any
string followed by colon (:) counts as possible namespace string
# Please reconcile with real namespace name strings later
# 'namespaces' with count < 5 are combined in 'Other' (on larger wikis these
are surely false positives)

@ aa.z Category 9
@ aa.z File 20
@ aa.z Image 9
@ aa.z MediaWiki 20
@ aa.z NamespaceArticles 163
@ aa.z Special 97
@ aa.z Talk 17
@ aa.z User 35
@ aa.z Wikipedia 16
@ aa.z -other- 11

Erik Zachte



> -----Original Message-----
> From: [hidden email] [mailto:wikitech-l-
> [hidden email]] On Behalf Of Robert Rohde
> Sent: Friday, September 18, 2009 02:33
> To: Wikimedia developers
> Cc: Mathias Schindler; Frédéric Schütz; toolserver-
> [hidden email]
> Subject: Re: [Wikitech-l] [Toolserver-l] Archive of visitor stats
>
> 2009/9/17 Erik Zachte <[hidden email]>:
> > I think it is extremely important to keep these files for later
> analysis by
> > historians and others.
> >
> > Mathias Schindler also keep an archive or at least did till April
> (Berlin
> > conference).
> > He even bought a dedicated external drive for it.
> >
> > I collect files daily and merge 24 hourly files into one daily file.
> > That saves a lot on disk space and makes processing faster.
> > Titles with less than 10 requests per day are discarded that also
> saves a
> > lot.
>
> Careful, a recent analysis I did suggested that 15% of all page
> requests for articles on Wikipedia are for topics requested less than
> once per hour.  There are a very large number of pages that rarely see
> hits, but collectively the traffic to such topics is important.  You
> could end up biasing certain kinds of analysis if you always exclude
> the rarely visited pages.
>
> -Robert Rohde
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Robert Rohde
In reply to this post by Brian J Mingus
On Thu, Sep 17, 2009 at 8:58 PM, Brian <[hidden email]> wrote:

> On Thu, Sep 17, 2009 at 9:55 PM, Robert Rohde <[hidden email]> wrote:
>
>> On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett <[hidden email]>
>> wrote:
>> > On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde <[hidden email]>
>> wrote:
>> >> That particular result is unpublished.  I could make you a list of
>> >> infrequently viewed articles, but it would be quite long.
>> >
>> > Could you make a list of the 100 least viewed? Or are there are large
>> > number which are essentially equal?
>>
>> My sample consisted of collating 30 non-consecutive hours of data on
>> enwiki traffic where each hour was randomly chosen from any point
>> during the last 8 months.  This was filtered to only include page
>> titles that were valid mainspace pages.
>>
>> In those 30 hours, there are 1.36 million valid article titles that
>> are viewed exactly once [1].
>>
>> Examples include:
>>
>> 129342_Ependes
>> 1421_in_literature
>> Antiprotonic_helium
>> Antonella_Mularoni
>> Madhusoodhanan_Nair
>> Blue_Murder_(play)
>> Ozonotherapy
>> Veronika_Krausas
>> Verret,_New_Brunswick
>> Bare_Truth_(Nat_album)
>>
>> As you can see, these are obscure topics, but they are not necessarily
>> crazy topics.  If I were to repeat it with a longer baseline (say 1000
>> hours rather than 30) I'm suspect you might get more interesting
>> information on the tail, but right now probably the best I can say is
>> that a cumulatively significant amount of traffic goes to relatively
>> obscure pages.
>>
>> -Robert Rohde
>>
>> [1] Note: Because the traffic data is based on url request stings, and
>> some url strings map to the same pages, i.e. Blue_Ocean and
>> Blue%20Ocean, the number of valid article titles in not necessarily
>> the same as the number of distinct pages.  For practical reasons my
>> analysis was based of the url strings, and so probably over counts the
>> number of distinct articles involved, and to a degree overstates the
>> fraction of traffic to obscure pages.
>
>
> How sure are you that they were viewed by a person and not a bot?

There is no differentiation between people and bots.  (Some of these
things are why it is an unpublished analysis.  ;-)  I was actually
using traffic data for a totally different purpose, but decided to
look at things likes like obscure pages, while I was at it.)

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Steve Bennett-8
In reply to this post by Robert Rohde
Great, thank you. Even that is enough to begin to draw some conclusions.

> 129342_Ependes

Lol, the stereotypical asteroid article. Well, that's one more hit
than I would expect it to get.

> 1421_in_literature

My eye was drawn to [[1421 in literature]], but that has always been a
redirect, so perhaps the one hit was the person creating it. :)

> Antiprotonic_helium

Looks like a decent article! But it was orphaned...so I linked to it
from Antiproton.

> Antonella_Mularoni

Excellent article - pity no traffic.

> Madhusoodhanan_Nair

Redirect to borderline vanity

> Blue_Murder_(play)

A redirect

> Ozonotherapy

Redirect to fringe science

> Veronika_Krausas

Ok article, but pretty obscure subject.

> Verret,_New_Brunswick

Substub.

> Bare_Truth_(Nat_album)

A crappy article about what sounds like an even crappier album. With
offensive album art to boot.


Hmm, what conclusion to draw from all this? Most of those articles
were redirects or crappy articles - Antonella_Mularoni was the only
real exception.

Steve

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Brian J Mingus
In reply to this post by Robert Rohde
On Thu, Sep 17, 2009 at 10:18 PM, Robert Rohde <[hidden email]> wrote:

> On Thu, Sep 17, 2009 at 8:58 PM, Brian <[hidden email]> wrote:
> > On Thu, Sep 17, 2009 at 9:55 PM, Robert Rohde <[hidden email]> wrote:
> >
> >> On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett <[hidden email]>
> >> wrote:
> >> > On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde <[hidden email]>
> >> wrote:
> >> >> That particular result is unpublished.  I could make you a list of
> >> >> infrequently viewed articles, but it would be quite long.
> >> >
> >> > Could you make a list of the 100 least viewed? Or are there are large
> >> > number which are essentially equal?
> >>
> >> My sample consisted of collating 30 non-consecutive hours of data on
> >> enwiki traffic where each hour was randomly chosen from any point
> >> during the last 8 months.  This was filtered to only include page
> >> titles that were valid mainspace pages.
> >>
> >> In those 30 hours, there are 1.36 million valid article titles that
> >> are viewed exactly once [1].
> >>
> >> Examples include:
> >>
> >> 129342_Ependes
> >> 1421_in_literature
> >> Antiprotonic_helium
> >> Antonella_Mularoni
> >> Madhusoodhanan_Nair
> >> Blue_Murder_(play)
> >> Ozonotherapy
> >> Veronika_Krausas
> >> Verret,_New_Brunswick
> >> Bare_Truth_(Nat_album)
> >>
> >> As you can see, these are obscure topics, but they are not necessarily
> >> crazy topics.  If I were to repeat it with a longer baseline (say 1000
> >> hours rather than 30) I'm suspect you might get more interesting
> >> information on the tail, but right now probably the best I can say is
> >> that a cumulatively significant amount of traffic goes to relatively
> >> obscure pages.
> >>
> >> -Robert Rohde
> >>
> >> [1] Note: Because the traffic data is based on url request stings, and
> >> some url strings map to the same pages, i.e. Blue_Ocean and
> >> Blue%20Ocean, the number of valid article titles in not necessarily
> >> the same as the number of distinct pages.  For practical reasons my
> >> analysis was based of the url strings, and so probably over counts the
> >> number of distinct articles involved, and to a degree overstates the
> >> fraction of traffic to obscure pages.
> >
> >
> > How sure are you that they were viewed by a person and not a bot?
>
> There is no differentiation between people and bots.  (Some of these
> things are why it is an unpublished analysis.  ;-)  I was actually
> using traffic data for a totally different purpose, but decided to
> look at things likes like obscure pages, while I was at it.)
>
> -Robert Rohde
>
>
Oh I see.  It would be reassuring to know that there were a million or so
articles not viewed at all?
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Dmitriy Sintsov
In reply to this post by Steve Bennett-8
* Steve Bennett <[hidden email]> [Fri, 18 Sep 2009 14:20:53 +1000]:
> > 1421_in_literature
>
> My eye was drawn to [[1421 in literature]], but that has always been a
> redirect, so perhaps the one hit was the person creating it. :)
>
It redirects to 15th century literature article, which has no books
written in 1421 mentioned. Lots of other years of the same century, but
no 1421. And look at the infobox table in the top-right corner, there
seems to be "every year in literature" links or redirects. Probably
created by some bot.
Dmitriy

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Nikola Smolenski
In reply to this post by Steve Bennett-8
Дана Friday 18 September 2009 06:20:53 Steve Bennett написа:
> Hmm, what conclusion to draw from all this? Most of those articles
> were redirects or crappy articles - Antonella_Mularoni was the only
> real exception.

I find Antiprotonic helium to be a very interesting and sufficiently
informative article - certainly not crappy.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Platonides
In reply to this post by erikzachte
Erik Zachte wrote:

> Sure, info gets lost. And the Long Tail is meaningful for some research no
> doubt.
> But my resources are finite.
>
> Actually I do store some all inclusive counts in the compacted 24 hr file:
>
> # Lines starting with ampersand (@) show totals per 'namespace' (including
> omitted counts for low traffic articles)
> # Since valid namespace string are not known in the compression script any
> string followed by colon (:) counts as possible namespace string
> # Please reconcile with real namespace name strings later
> # 'namespaces' with count < 5 are combined in 'Other' (on larger wikis these
> are surely false positives)

Making the script aware of namespace names would be quite easy.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Robert Rohde
On Fri, Sep 18, 2009 at 10:02 AM, Platonides <[hidden email]> wrote:

> Erik Zachte wrote:
>> Sure, info gets lost. And the Long Tail is meaningful for some research no
>> doubt.
>> But my resources are finite.
>>
>> Actually I do store some all inclusive counts in the compacted 24 hr file:
>>
>> # Lines starting with ampersand (@) show totals per 'namespace' (including
>> omitted counts for low traffic articles)
>> # Since valid namespace string are not known in the compression script any
>> string followed by colon (:) counts as possible namespace string
>> # Please reconcile with real namespace name strings later
>> # 'namespaces' with count < 5 are combined in 'Other' (on larger wikis these
>> are surely false positives)
>
> Making the script aware of namespace names would be quite easy.

For English this is obviously true, but Erik writes scripts intended
to be language agnostic and work with all WMF projects.  While
certainly possible to teach it about namespaces in the general sense,
it would take rather a bit of effort to call up the local namespace
names and all legitimate variants for every different project/language
in turn.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Toolserver-l] Archive of visitor stats

Lars Aronsson
In reply to this post by Steve Bennett-8
Steve Bennett wrote:

> Is there a link to that analysis? It would be interesting to see
> which are the least requested articles, for example.

I don't have that, but you can visit
http://stats.grok.se/en/200909/Mineral_County,_Montana to find out
that this article was viewed 368 times during August 2009,
whereas http://stats.grok.se/en/200908/Tabaning_Sita_Forest_Park
was viewed only 29 times.

On sv.wikipedia there is a "gadget" for adding a "tab" to each
article, a tab that links to this "stats" website.

In word frequency analysis, the expected case is that half of the
different words in any text are used only once, a quarter is used
only twice, an eighth is used 3 or 4 times, etc.  There are
different names for such models: Zipf's law, power law
distribution, long tail, and so on. More often than not, such
terms are used without fully understanding the math behind them.

This is a little different from the case of Wikipedia articles,
where some articles are perhaps never viewed.  But we should
expect that a large number of articles are viewed very seldom.

So if you ask which articles are least requested, you should
probably expect a list of 1.5 million articles (of the 3 million
in the English Wikipedia).  It's similar to asking which words are
least frequently used.  With time, we will add another 3 million
articles about things that are even less interesting, and a few
thousand articles on more interesting topics.

It's a different case if you ask the question for a limited set of
articles, which you already know something about, for example
those about the 56 counties in Montana, which should all be
equally boring, or where interest should perhaps be proportional
to the population. Which are more or less requested?  Is something
wrong with some of those articles?


--
  Lars Aronsson ([hidden email])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12