Article hit rates -- research at the University of Minnesota

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Article hit rates -- research at the University of Minnesota

Reid Priedhorsky-2
Dear Wikitechnicians,

My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens
Research, which is the human-computer interaction group at the
University of Minnesota.

We are currently working on some research which is investigating
Wikipedia contribution and vandalism. To this end, statistics on the
view rate of different articles would be extremely helpful to us --
something along the lines of Leon Weber's WikiCharts tool, but with a
larger limit (ideally all 1.7 million articles).

It seems to me that the easiest way to accomplish this would be to get
copies of your sampled Squid logs (as described on
<http://lists.wikimedia.org/pipermail/wikitech-l/2007-January/029000.html>
and its links). We do not need the client IP or any other similarly
sensitive data, though if you gave it to us we would protect it
carefully as we protect the other sensitive research data we handle.

Would it be possible for us to have access to these log files?

If not, I would love to begin a discussion on what it would be possible
for us to access.

Your help would be greatly appreciated. Please let me know if you have
any questions.

Thanks,

Reid

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Gregory Maxwell
Greetings, describe for me what you ideal data would look like.


On 3/28/07, Reid Priedhorsky <[hidden email]> wrote:

> Dear Wikitechnicians,
>
> My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens
> Research, which is the human-computer interaction group at the
> University of Minnesota.
>
> We are currently working on some research which is investigating
> Wikipedia contribution and vandalism. To this end, statistics on the
> view rate of different articles would be extremely helpful to us --
> something along the lines of Leon Weber's WikiCharts tool, but with a
> larger limit (ideally all 1.7 million articles).
>
> It seems to me that the easiest way to accomplish this would be to get
> copies of your sampled Squid logs (as described on
> <http://lists.wikimedia.org/pipermail/wikitech-l/2007-January/029000.html>
> and its links). We do not need the client IP or any other similarly
> sensitive data, though if you gave it to us we would protect it
> carefully as we protect the other sensitive research data we handle.
>
> Would it be possible for us to have access to these log files?
>
> If not, I would love to begin a discussion on what it would be possible
> for us to access.
>
> Your help would be greatly appreciated. Please let me know if you have
> any questions.
>
> Thanks,
>
> Reid
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Tim Starling-2
In reply to this post by Reid Priedhorsky-2
Reid Priedhorsky wrote:

> Dear Wikitechnicians,
>
> My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens
> Research, which is the human-computer interaction group at the
> University of Minnesota.
>
> We are currently working on some research which is investigating
> Wikipedia contribution and vandalism. To this end, statistics on the
> view rate of different articles would be extremely helpful to us --
> something along the lines of Leon Weber's WikiCharts tool, but with a
> larger limit (ideally all 1.7 million articles).

Producing such statistics will be a Google Summer of Code project this
summer. If you can't wait that long, then we can give you a sampled,
anonymised log stream to analyse.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Reid Priedhorsky-2
Tim Starling wrote:

> Reid Priedhorsky wrote:
>> Dear Wikitechnicians,
>>
>> My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens
>> Research, which is the human-computer interaction group at the
>> University of Minnesota.
>>
>> We are currently working on some research which is investigating
>> Wikipedia contribution and vandalism. To this end, statistics on the
>> view rate of different articles would be extremely helpful to us --
>> something along the lines of Leon Weber's WikiCharts tool, but with a
>> larger limit (ideally all 1.7 million articles).
>
> Producing such statistics will be a Google Summer of Code project this
> summer. If you can't wait that long, then we can give you a sampled,
> anonymised log stream to analyse.

Yes, summer would be too late: anonymised logs would be be excellent for
our purposes. Does "stream" mean that we would need to write a program
to listen to the real-time log stream, or could you give us files?

Gregory Maxwell wrote:
 > Greetings, describe for me what you ideal data would look like.

Ideal data would be log files that just looked like:

   Main Page\t1169499304.066

i.e., article titles as they appear in the XML dumps and request time.

A close second choice would be simply-anonymized logs, e.g.:

   sq18.wikimedia.org 1715898 1169499304.066 0 - TCP_MEM_HIT/200 13208
GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - -

If the logs still contains duplicates due to requests being forwarded
between squids, we'd need pointers on how to resolve those.

Please let me know what the next step is. Thanks for your help!

Reid

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Oldak
On 29/03/07, Reid Priedhorsky <[hidden email]> wrote:

> Tim Starling wrote:
> > Reid Priedhorsky wrote:
> >> Dear Wikitechnicians,
> >>
> >> My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens
> >> Research, which is the human-computer interaction group at the
> >> University of Minnesota.
> >>
> >> We are currently working on some research which is investigating
> >> Wikipedia contribution and vandalism. To this end, statistics on the
> >> view rate of different articles would be extremely helpful to us --
> >> something along the lines of Leon Weber's WikiCharts tool, but with a
> >> larger limit (ideally all 1.7 million articles).
> >
> > Producing such statistics will be a Google Summer of Code project this
> > summer. If you can't wait that long, then we can give you a sampled,
> > anonymised log stream to analyse.
>
> Yes, summer would be too late: anonymised logs would be be excellent for
> our purposes. Does "stream" mean that we would need to write a program
> to listen to the real-time log stream, or could you give us files?
>
> Gregory Maxwell wrote:
>  > Greetings, describe for me what you ideal data would look like.
>
> Ideal data would be log files that just looked like:
>
>    Main Page\t1169499304.066
>
> i.e., article titles as they appear in the XML dumps and request time.
>
> A close second choice would be simply-anonymized logs, e.g.:
>
>    sq18.wikimedia.org 1715898 1169499304.066 0 - TCP_MEM_HIT/200 13208
> GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - -
>
> If the logs still contains duplicates due to requests being forwarded
> between squids, we'd need pointers on how to resolve those.
>
> Please let me know what the next step is. Thanks for your help!
>
> Reid

Just a small aside: please keep us up-to-date on the outcome of the
research over on the Wiki-research-l mailing list. It's always
interesting (and potentially useful) to see how Wikipedia is used.

--
Oldak Quill ([hidden email])

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Steve Bennett-8
On 3/30/07, Oldak Quill <[hidden email]> wrote:
> Just a small aside: please keep us up-to-date on the outcome of the
> research over on the Wiki-research-l mailing list. It's always
> interesting (and potentially useful) to see how Wikipedia is used.

Yes, I second this. We seem to get quite a few posts of the type "We
have been intensively researching some feature of Wikipedia for the
last 2 years, and we just need one detail to continue our research".
And that's the last we hear of them. I'd love to hear the results of
it - it would benefit our project a lot to have some hard
statistics.[1]

Steve
[1] Is that an oxymoron?

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Reid Priedhorsky-2
Steve Bennett wrote:

> On 3/30/07, Oldak Quill <[hidden email]> wrote:
>> Just a small aside: please keep us up-to-date on the outcome of the
>> research over on the Wiki-research-l mailing list. It's always
>> interesting (and potentially useful) to see how Wikipedia is used.
>
> Yes, I second this. We seem to get quite a few posts of the type "We
> have been intensively researching some feature of Wikipedia for the
> last 2 years, and we just need one detail to continue our research".
> And that's the last we hear of them. I'd love to hear the results of
> it - it would benefit our project a lot to have some hard
> statistics.[1]

Certainly. Our goal is to publish in a standard HCI venue, and those
publications are public info. I've put it on my to-do list to send a
note to wiki-research-l when our results are available.

Take care,

Reid

p.s. Thanks for the pointer to that list -- I wasn't aware of it, and
its content looks quite interesting.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Tim Starling-2
In reply to this post by Reid Priedhorsky-2
Reid Priedhorsky wrote:

> Tim Starling wrote:
>> Reid Priedhorsky wrote:
>>> Dear Wikitechnicians,
>>>
>>> My name is Reid Priedhorsky, and I'm a Ph.D. student at GroupLens
>>> Research, which is the human-computer interaction group at the
>>> University of Minnesota.
>>>
>>> We are currently working on some research which is investigating
>>> Wikipedia contribution and vandalism. To this end, statistics on the
>>> view rate of different articles would be extremely helpful to us --
>>> something along the lines of Leon Weber's WikiCharts tool, but with a
>>> larger limit (ideally all 1.7 million articles).
>> Producing such statistics will be a Google Summer of Code project this
>> summer. If you can't wait that long, then we can give you a sampled,
>> anonymised log stream to analyse.
>
> Yes, summer would be too late: anonymised logs would be be excellent for
> our purposes. Does "stream" mean that we would need to write a program
> to listen to the real-time log stream, or could you give us files?
>
> Gregory Maxwell wrote:
>  > Greetings, describe for me what you ideal data would look like.
>
> Ideal data would be log files that just looked like:
>
>    Main Page\t1169499304.066
>
> i.e., article titles as they appear in the XML dumps and request time.
>
> A close second choice would be simply-anonymized logs, e.g.:
>
>    sq18.wikimedia.org 1715898 1169499304.066 0 - TCP_MEM_HIT/200 13208
> GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - -
>
> If the logs still contains duplicates due to requests being forwarded
> between squids, we'd need pointers on how to resolve those.
>
> Please let me know what the next step is. Thanks for your help!
>
> Reid

We received a very similar request from Vrije Universiteit, and we're now
sending them a 1/10 sampled stream consisting of timestamp and URL, with
duplicates removed, real-time via UDP. It would be easier for us if we
could send you roughly the same thing. So for example:

1169499304.066 http://en.wikipedia.org/wiki/Main_Page

We don't have any system yet for periodically rotating, analysing and
sending logs, so streams are certainly easier for us. We get somewhere on
the order of 1.5 billion requests per day, and the simplified log line
above has an average length of 97 bytes, so it's an unsampled data rate of
about 135 GB per day. You'll probably want us to sample that down before
we send it to you.

Extracting the title as it appears in the XML dump is just a matter of
finding the right part of the URL and then unescaping it.

You can contact me privately to get the technical details sorted out.

--  Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Steve Bennett-8
On 4/1/07, Tim Starling <[hidden email]> wrote:
> sending logs, so streams are certainly easier for us. We get somewhere on
> the order of 1.5 billion requests per day, and the simplified log line

Can I be the first to say "holy crap!"

Stvee

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Rob Church
On 04/04/07, Steve Bennett <[hidden email]> wrote:
> Can I be the first to say "holy crap!"

You're really that shocked?


Rob Church

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Steve Bennett-8
On 4/4/07, Rob Church <[hidden email]> wrote:
> On 04/04/07, Steve Bennett <[hidden email]> wrote:
> > Can I be the first to say "holy crap!"
>
> You're really that shocked?

Had I sat down to think about it, perhaps, perhaps not. But I've never
heard of a daily pageview figure expressed in *billions* before.

I had a webpage once. It got 400 pageviews in a year.

Steve

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Artur Fijałkowski
2007/4/4, Steve Bennett <[hidden email]>:
> On 4/4/07, Rob Church <[hidden email]> wrote:
> > On 04/04/07, Steve Bennett <[hidden email]> wrote:
> > > Can I be the first to say "holy crap!"
> >
> > You're really that shocked?
>
> Had I sat down to think about it, perhaps, perhaps not. But I've never
> heard of a daily pageview figure expressed in *billions* before.

It's not count of pageviews, but http requests, is it?

AJF/WarX

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Rob Church
On 04/04/07, Artur Fijałkowski <[hidden email]> wrote:
> It's not count of pageviews, but http requests, is it?

That's right.


Rob Church
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Steve Bennett-8
On 4/4/07, Rob Church <[hidden email]> wrote:
> On 04/04/07, Artur Fijałkowski <[hidden email]> wrote:
> > It's not count of pageviews, but http requests, is it?
>
> That's right.

Ah, ok. So a page with a hundred images is like 105 http requests
including CSS etc, but only one page view.

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

evanprodromou
On Thu, 2007-05-04 at 11:09 +1000, Steve Bennett wrote:
> On 4/4/07, Rob Church <[hidden email]> wrote:
> > On 04/04/07, Artur Fijakowski <[hidden email]> wrote:
> > > It's not count of pageviews, but http requests, is it?
> >
> > That's right.
>
> Ah, ok. So a page with a hundred images is like 105 http requests
> including CSS etc, but only one page view.

It gets kind of complicated, since CSS and JS files, as well as skin
images, are usually pretty well cached. We do about a 4-to-1 ratio of
hits-to-pages on Wikitravel; I'd be surprised if that varied by more
than 2x in either direction for Wikipedia.

-Evan


________________________________________________________________________
Evan Prodromou <[hidden email]>
http://evan.prodromou.name/

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Brad Patrick
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Evan Prodromou wrote:

> It gets kind of complicated, since CSS and JS files, as well as skin
> images, are usually pretty well cached. We do about a 4-to-1 ratio of
> hits-to-pages on Wikitravel; I'd be surprised if that varied by more
> than 2x in either direction for Wikipedia.
>
> -Evan


Which of these statistics has relevance, and at what degree of
granularity?  The USA Today readers are looking for something as simple
as "# of daily page views" which any surfer can appreciate.  The http
request tally makes sense to developers who are concerned about the load
on our servers, tweaks in performance, etc.  Marketers want uniques per
day or month, etc.

Which of these stats should be developed to give accurate information to
the world about what performance is being achieved, in an
apples-to-apples comparison to existing suites?  What will give WMF the
most credibility in reporting in the future?

(and a HUGE thank you to Tim for making this happen)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGFQGp5txwQhyxnbIRAow0AJ4r9jy+r/wteLaqbRXiK7a+KeviwwCfdP7R
uqFnhzsOoRi7Za+TKJb+i2Q=
=fiy3
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Jay Ashworth-2
In reply to this post by Rob Church
On Wed, Apr 04, 2007 at 01:30:57AM +0100, Rob Church wrote:
> On 04/04/07, Steve Bennett <[hidden email]> wrote:
> > Can I be the first to say "holy crap!"
>
> You're really that shocked?

A couple of years ago, when I was temporarily in the running for Local
Hands (I live about 20 miles west of the datacenter), we were *just
starting* to bump our heads on a 100Mb/s port.

So *I* was a bit shocked.  :-)

Cheers,
-- jra
--
Jay R. Ashworth                                                [hidden email]
Designer                          Baylink                             RFC 2100
Ashworth & Associates        The Things I Think                        '87 e24
St Petersburg FL USA      http://baylink.pitas.com             +1 727 647 1274

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

evanprodromou
In reply to this post by Brad Patrick
On Thu, 2007-05-04 at 10:03 -0400, Brad Patrick wrote:

> Which of these statistics has relevance, and at what degree of
> granularity?  The USA Today readers are looking for something as simple
> as "# of daily page views" which any surfer can appreciate.  The http
> request tally makes sense to developers who are concerned about the load
> on our servers, tweaks in performance, etc.  Marketers want uniques per
> day or month, etc.

That's about it: page views per day, hits per day, and unique visitors
per month are the three main stats people care about.

> Which of these stats should be developed to give accurate information to
> the world about what performance is being achieved, in an
> apples-to-apples comparison to existing suites?  What will give WMF the
> most credibility in reporting in the future?

Page views per day, I think.

-Evan


________________________________________________________________________
Evan Prodromou <[hidden email]>
http://evan.prodromou.name/

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Article hit rates -- research at the University of Minnesota

Aryeh Gregor
On 4/6/07, Evan Prodromou <[hidden email]> wrote:
> > Which of these stats should be developed to give accurate information to
> > the world about what performance is being achieved, in an
> > apples-to-apples comparison to existing suites?  What will give WMF the
> > most credibility in reporting in the future?
>
> Page views per day, I think.

Better be a bit more granular than that.  Article views, edit-related
views, RC views, front-page views, etc.  As Yahoo! found out recently,
"page views per day" tends to drop rather noticeably when you Ajax
away some of the unnecessary ones.  I would say it's not a very useful
statistic.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l