hitcount stats

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

hitcount stats

Steve Summit
One significant potential source of error in Leon's (marvelous!)
new hitcount stats is the possibility that one reader is for
whatever reason fetching the same page multiple times (perhaps
due to nothing more than a prolonged edit).

Obviously it would be best to filter out multiple fetches of the
same page from the same IP address over some interval, probably
one day.  (Yes, this could then undercount hits from behind NAT
firewalls and proxies, but I think it'd still be worth it overall.)

I know that Leon's scheme is currently not logging IP addresses,
and given AOL's recent high-profile screwup I have to agree that
not logging IP addresses in this context is probably a good idea.
But what if we logged a one-way hash of the IP address, that
couldn't be correlated with anything else?

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Simetrical
On 8/30/06, Steve Summit <[hidden email]> wrote:
> But what if we logged a one-way hash of the IP address, that
> couldn't be correlated with anything else?

There are only about four billion possible IP addresses.  Anyone could
just do a brute-force execution of whatever hashing algorithm we use
on every IP address.  Really, though, there's no harm in storing IP
address-pageview links for a short period of time, like a day.

However, this wouldn't require that, and indeed, a server-side
solution would be impossible: 99.9% of page hits won't go to the
server to start with.  Since JavaScript is being used anyway, you can
just have the script only run the first time you visit a given page
per session.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Steve Bennett-4
In reply to this post by Steve Summit
I think it's unlikely to significantly skew the results. A few extra hits
compared to the thousands received on popular pages isn't an issue. I even
tried to artifically inflate the results for one page and had no luck
whatsoever :) The only concern would be if certain "types" of pages
encouraged rapid refreshing, like if for some reason pokemon pages were
refreshed much faster than normal pages, they would be over-reported. But if
it's just individual random editors who skew the results of whatever page
they edit, there should be no overall bias.

Steve

On 8/30/06, Steve Summit <[hidden email]> wrote:

>
> One significant potential source of error in Leon's (marvelous!)
> new hitcount stats is the possibility that one reader is for
> whatever reason fetching the same page multiple times (perhaps
> due to nothing more than a prolonged edit).
>
> Obviously it would be best to filter out multiple fetches of the
> same page from the same IP address over some interval, probably
> one day.  (Yes, this could then undercount hits from behind NAT
> firewalls and proxies, but I think it'd still be worth it overall.)
>
> I know that Leon's scheme is currently not logging IP addresses,
> and given AOL's recent high-profile screwup I have to agree that
> not logging IP addresses in this context is probably a good idea.
> But what if we logged a one-way hash of the IP address, that
> couldn't be correlated with anything else?
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Steve Bennett-4
In reply to this post by Simetrical
On 8/30/06, Simetrical <[hidden email]> wrote:
>
>
> However, this wouldn't require that, and indeed, a server-side
> solution would be impossible: 99.9% of page hits won't go to the
> server to start with.  Since JavaScript is being used anyway, you can
> just have the script only run the first time you visit a given page
> per session.


Actually now that I think about this, does this actually sufficiently model
the data we want to collect? Are we interested only in "how many people
visit a certain page" and not also in "how many times a certain page is
viewed"? If 5 users spend a whole day arguing back on forth on Wikipedia
talk:Pokémon, is 5 or 200 a more interesting/useful/relevant metric for that
page?

We should probably start thinking about exactly why we want this data, and
what we should do with the results of it.

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Jay Ashworth-2
On Wed, Aug 30, 2006 at 05:24:51PM +0200, Steve Bennett wrote:

> On 8/30/06, Simetrical <[hidden email]> wrote:
> > However, this wouldn't require that, and indeed, a server-side
> > solution would be impossible: 99.9% of page hits won't go to the
> > server to start with.  Since JavaScript is being used anyway, you can
> > just have the script only run the first time you visit a given page
> > per session.
>
>
> Actually now that I think about this, does this actually sufficiently model
> the data we want to collect? Are we interested only in "how many people
> visit a certain page" and not also in "how many times a certain page is
> viewed"? If 5 users spend a whole day arguing back on forth on Wikipedia
> talk:Pokémon, is 5 or 200 a more interesting/useful/relevant metric for that
> page?

Yes.

> We should probably start thinking about exactly why we want this data, and
> what we should do with the results of it.

Indeed; they're two separate, and both useful, measurements needed by
different audiences.

Cheers,
-- jra
--
Jay R. Ashworth                                                [hidden email]
Designer                          Baylink                             RFC 2100
Ashworth & Associates        The Things I Think                        '87 e24
St Petersburg FL USA      http://baylink.pitas.com             +1 727 647 1274

        The Internet: We paved paradise, and put up a snarking lot.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Steve Summit
In reply to this post by Simetrical
Simetrical wrote:
> On 8/30/06, Steve Summit <[hidden email]> wrote:
>> But what if we logged a one-way hash of the IP address, that
>> couldn't be correlated with anything else?
>
> There are only about four billion possible IP addresses.  Anyone could
> just do a brute-force execution of whatever hashing algorithm we use
> on every IP address.

Well, no, not just "anyone". :-)

> Really, though, there's no harm in storing IP
> address-pageview links for a short period of time, like a day.

I would tend to agree.  But three people at AOL lost their jobs
because of something they honestly thought there was "no harm" in
doing.  And it's very difficult (if not impossible) to guarantee
that something gets kept for only a day.

> However, this wouldn't require that, and indeed, a server-side
> solution would be impossible: 99.9% of page hits won't go to the
> server to start with.

Not sure what you mean here.

> Since JavaScript is being used anyway, you can just have the script
> only run the first time you visit a given page per session.

But that would be considerably more work to implement, and would
require arbitrary amounts of state kept in the browser, and would
break down if the browser were restarted (or perhaps just if the
tab or window were closed).
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Simetrical
On 8/30/06, Steve Summit <[hidden email]> wrote:
> Well, no, not just "anyone". :-)

Anyone *could*.  Most people just wouldn't know *how*.

> I would tend to agree.  But three people at AOL lost their jobs
> because of something they honestly thought there was "no harm" in
> doing.  And it's very difficult (if not impossible) to guarantee
> that something gets kept for only a day.

If it's possible to guarantee it gets kept, it's possible to guarantee
it only gets kept for a day.

> > However, this wouldn't require that, and indeed, a server-side
> > solution would be impossible: 99.9% of page hits won't go to the
> > server to start with.
>
> Not sure what you mean here.

What effect would it have if I reloaded the page fifty times?  I
wouldn't send fifty messages to the view-logging server instead of
one; I would have a 4.88% chance of sending *one* message, rather than
a 0.1% chance.  The server doesn't know that I reloaded the page fifty
times: it just knows that it was told I visited it an *average* of
1000 times (averaging it with non-hits).  It can't, therefore, discard
the extra 49 page loads; it never received them.  The client has to
discard them if anyone's going to.

> But that would be considerably more work to implement, and would
> require arbitrary amounts of state kept in the browser, and would
> break down if the browser were restarted (or perhaps just if the
> tab or window were closed).

That's not a bug, it's a feature: it shouldn't be the same page hit if
I leave and then return.  And more work than "impossible" is rather
difficult.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Steve Summit
In reply to this post by Steve Bennett-4
Steve Bennett wrote:
> Actually now that I think about this, does this actually sufficiently model
> the data we want to collect? Are we interested only in "how many people
> visit a certain page" and not also in "how many times a certain page is
> viewed"? If 5 users spend a whole day arguing back on forth on Wikipedia
> talk:Pokémon, is 5 or 200 a more interesting/useful/relevant metric
> for that page?

Me, I think I'm much more interested in the former.  Among other
things, it's an objective measure of something at least vaguely
akin to the elusive concept of "notability", and one big reason
for filtering out multiple hits from the same browser is therefore
to make it harder for people to deliberately skew the statistic.

The latter statistic -- assuming the argument takes the form of
actual edits -- is already derivable directly from the page
history, isn't it?

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Rob Church
In reply to this post by Steve Summit
On 30/08/06, Steve Summit <[hidden email]> wrote:
> I would tend to agree.  But three people at AOL lost their jobs
> because of something they honestly thought there was "no harm" in
> doing.  And it's very difficult (if not impossible) to guarantee
> that something gets kept for only a day.

I plead ignorance, sir. Do provide a URL?

Wait, since when have AOL had ethics of any sort!?


Rob Church
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Gregory Maxwell
In reply to this post by Steve Summit
On 8/30/06, Steve Summit <[hidden email]> wrote:
> One significant potential source of error in Leon's (marvelous!)
> new hitcount stats is the possibility that one reader is for
> whatever reason fetching the same page multiple times (perhaps
> due to nothing more than a prolonged edit).
[snip]

No.

The tool measures estimated views rather than unique impressions, thus
this is not an error. Someone sitting around hitting reload over and
over again is additional views, thus such actions should be sampled
just as often as any other view.

The major source of error in this is that we are sampling at far too
low a rate.  We believe that page view distribution for wikipedia is
power law. As such, course sampling will only be able to tell us
useful data about the relative ranking of the most popular items, so
it's good that the web interface only displays the top 1000 at most...
However, with only 34,000 samples collected of enwiki viewing over
four days much of the samples are scattered randomly around pages deep
into the tail which we can't not speak about accurately.


Changes will be made so that we can substantially increase the sampling rate.

I pity the journalist who sees our data and runs an absolutely idiotic
story based on it.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Simetrical
In reply to this post by Rob Church
On 8/30/06, Rob Church <[hidden email]> wrote:
> I plead ignorance, sir. Do provide a URL?

I believe you introduced me to www.justfuckinggoogleit.com, Rob.  ;)

http://news.google.com/news?q=AOL+logs
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Rob Church
On 30/08/06, Simetrical <[hidden email]> wrote:
> On 8/30/06, Rob Church <[hidden email]> wrote:
> > I plead ignorance, sir. Do provide a URL?
>
> I believe you introduced me to www.justfuckinggoogleit.com, Rob.  ;)

Yes, but because I'm special and so well loved, and so damn irritating
when I don't get my own way, you'd never DREAM of using that on me,
would you now?


Rob Church
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Steve Summit
In reply to this post by Simetrical
Simetrical wrote:
> On 8/30/06, Steve Summit <[hidden email]> wrote:
>> Well, no, not just "anyone". :-)
>
> Anyone *could*.  Most people just wouldn't know *how*.

Ah.  So you can high jump 8 feet, can you?

>> And it's very difficult (if not impossible) to guarantee
>> that something gets kept for only a day.
>
> If it's possible to guarantee it gets kept, it's possible to guarantee
> it only gets kept for a day.

False (unless you're splitting hairs).

>>> However, this wouldn't require that, and indeed, a server-side
>>> solution would be impossible: 99.9% of page hits won't go to the
>>> server to start with.
>>
>> Not sure what you mean here.
>
> What effect would it have if I reloaded the page fifty times?
> I wouldn't send fifty messages to the view-logging server instead
> of one; I would have a 4.88% chance of sending *one* message,
> rather than a 0.1% chance.

Okay, but that's true only as long as (a) the stats factor is in
the thousands, which it doesn't have to be (and isn't for some
wikimedia projects, and (2) nobody's trying to deliberately skew
the results.  But also, it only *matters* if you're trying to
keep (not discard) the extra hits, i.e. if you do want to say
something like "M people viewed it N times" as opposed to
"M people viewed it at least once".  If you're interested in
discarding redundant hits, it obviously doesn't matter whether
the browser or the server does it.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Gregory Maxwell
In reply to this post by Simetrical
On 8/30/06, Simetrical <[hidden email]> wrote:
> There are only about four billion possible IP addresses.  Anyone could
> just do a brute-force execution of whatever hashing algorithm we use
> on every IP address.  Really, though, there's no harm in storing IP
> address-pageview links for a short period of time, like a day.
[snip]

H(secret + ip)  can only be inverted by exhaustive search of both the
secret and the IP (or the secret if you happen to have some known H(),
IP pairs)... and the secret can be much longer than 32 bits.

However the fuss about the AOL logs showed that, at least for search
strings, mere correlation of requests was enough to leak too much
data.   I do not believe that the same is true for page hits, but
thats the consideration.

To me it seems a bit foolish of an argument though... any one of our
admins could add such a bug... any upstream ISP could sniff the
traffic.... and we all know that the US Government is already doing
so. ;)  but it is what it is..... and for some reason people don't
like the prospects of the world figuring out that they have a venereal
disease. Silly people.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Steve Summit
In reply to this post by Simetrical
Simetrical wrote:
>On 8/30/06, Rob Church <[hidden email]> wrote:
>>On 30/08/06, Steve Summit <[hidden email]> wrote:
>>> I would tend to agree.  But three people at AOL lost their jobs
>>> because of something they honestly thought there was "no harm" in
>>> doing.  And it's very difficult (if not impossible) to guarantee
>>> that something gets kept for only a day.
>>
>> I plead ignorance, sir. Do provide a URL?

To the personnel implications of the AOL debacle:

        http://www.theregister.co.uk/2006/08/30/online_anonymity/
        http://www.ovum.com/news/euronews.asp?id=4770

To my claim about data retention:

        http://catless.ncl.ac.uk/Risks/23.76.html#subj1

> I believe you introduced me to www.justfuckinggoogleit.com, Rob.  ;)

Heh.  Hadn't come across that one before.  Thanks.

Actually, even though the story was (it seemed) all over the
media last week, it was surprisingly hard to find more than a
couple of hits today.  Besides "AOL" and "CTO" and "resign",
another couple of keywords to use for anyone who wants to read
further would be "Maureen McGovern".
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Simetrical
In reply to this post by Steve Summit
On 8/30/06, Steve Summit <[hidden email]> wrote:
> > If it's possible to guarantee it gets kept, it's possible to guarantee
> > it only gets kept for a day.
>
> False (unless you're splitting hairs).

// If you remove the line after the next or refactor this code, we will
// flay the living flesh from your bones
$db->write("$IP visited this page, yay")
$db->check_if_stuff_is_over_a_day_old_and_deal_with_it();
// If you remove the above line or refactor this code, we will flay the
// living flesh from your bones

> Okay, but that's true only as long as (a) the stats factor is in
> the thousands,

No, it's true as long as it's above one.  Even if it's just two,
someone making two page views would have a 75% chance of getting one
hit through, instead of a 50% chance: a major difference.

> (2) nobody's trying to deliberately skew
> the results.

If anybody is, we're screwed anyway if we're doing sampling.

> But also, it only *matters* if you're trying to
> keep (not discard) the extra hits, i.e. if you do want to say
> something like "M people viewed it N times" as opposed to
> "M people viewed it at least once".

Um, this entire discussion is about the latter.

> If you're interested in
> discarding redundant hits, it obviously doesn't matter whether
> the browser or the server does it.

Except that the server can't do it.

On 8/30/06, Gregory Maxwell <[hidden email]> wrote:
> H(secret + ip)  can only be inverted by exhaustive search of both the
> secret and the IP (or the secret if you happen to have some known H(),
> IP pairs)... and the secret can be much longer than 32 bits.

Except that presumably anyone with access to the actual encoded IPs
will have access to the secret as well, yes?  Or are we talking about
letting *anyone* see the encoded IP-pageview correlations?  In which
case, that is kind of a privacy violation, in the AOL style.

(You could always change the secret, of course . . . first check if
H(secret(1) + ip) exists, and if it does, use H(secret(2) + ip)
instead if that doesn't exist, and so forth . . . but then there's no
point in making it public, and we're back to the "anyone who knows the
encoded IPs knows the secret anyway" thing.)
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Gregory Maxwell
On 8/30/06, Simetrical <[hidden email]> wrote:
> On 8/30/06, Gregory Maxwell <[hidden email]> wrote:
> > H(secret + ip)  can only be inverted by exhaustive search of both the
> > secret and the IP (or the secret if you happen to have some known H(),
> > IP pairs)... and the secret can be much longer than 32 bits.
>
> Except that presumably anyone with access to the actual encoded IPs
> will have access to the secret as well, yes?  Or are we talking about
> letting *anyone* see the encoded IP-pageview correlations?  In which
> case, that is kind of a privacy violation, in the AOL style.

It can be easily configured so that anyone with access to the secret
has privileged access to the server and, already, anyone with
privileged access to the server could be logging IPs.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Simetrical
On 8/30/06, Gregory Maxwell <[hidden email]> wrote:
> It can be easily configured so that anyone with access to the secret
> has privileged access to the server and, already, anyone with
> privileged access to the server could be logging IPs.

Yes, but again, there's no good reason to allow anyone without
privileged access to the server to see the IPs in the first place,
encoded or not, so why bother encoding them for storage?  *If* you're
going to allow people to view the connections the way AOL did, you may
as well assign arbitrary numbers (say, chronologically) rather than
some encoded form of the IP, since that's easier to implement *and*
more secure, if only marginally.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Gregory Maxwell
On 8/30/06, Simetrical <[hidden email]> wrote:

> On 8/30/06, Gregory Maxwell <[hidden email]> wrote:
> > It can be easily configured so that anyone with access to the secret
> > has privileged access to the server and, already, anyone with
> > privileged access to the server could be logging IPs.
>
> Yes, but again, there's no good reason to allow anyone without
> privileged access to the server to see the IPs in the first place,
> encoded or not, so why bother encoding them for storage?  *If* you're
> going to allow people to view the connections the way AOL did, you may
> as well assign arbitrary numbers (say, chronologically) rather than
> some encoded form of the IP, since that's easier to implement *and*
> more secure, if only marginally.

It's not easier to impliment numbering IPs, actually. Hashing is memoryless.

The reason to use it for storage is the above mentioned paranoia about
being able to make sure things are not retained too long....

It's all a silly and pointless argument in my view, and it's really
off topic for this list.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: hitcount stats

Simetrical
On 8/30/06, Gregory Maxwell <[hidden email]> wrote:
> It's not easier to impliment numbering IPs, actually.

Autoblock messages use the table row number.  :)
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
123