UvA what they need in log data from the Wikimedia Foundation

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

UvA what they need in log data from the Wikimedia Foundation

Gerard Meijssen-3
Hoi,
The university of Amsterdam (UvA)  is getting log information that is
thoroughly anonymised to the point where it becomes not as useful as it
should be. The UvA is working on what they call a "peer to peer Wikipedia".
Their interest in the data is not in the specific IP number of a requester
for information, their interest is in where a request is coming from. The
point is that is best, fastest and cheapest when information is available
from a peer that is close by.

When you consider that there is a wikipedia.ky a project that is outside of
the WMF where the justification is that it is expensive to get information
from outside the country, you will appreciate that a cache in Kirghistan
would make the reason for being for this project disappear. A peer to peer
Wikipedia allows for having peers in all parts of the world and the
information would as a result be potentially locally viable in countries
like Kirghistan but also in Africa and China..

In order to build this system it is necessary to understand how the need for
information develops. To build efficient routines that bring the information
in a sufficient number to the caches that are local to the requesters and in
order to ensure that data will be persistently available, it is necessary
for the UvA to have geographically relevant information on requests to the
WMF servers.

The UvA is one of the top universities in this field. In Andrew Tannenbaum
they have one of the leading thinkers and architects on computer and
Internet architecture as one of their staff. It is for this reason that I
again and this time publicly ask for the UvA to have this information.

With a peer to peer Wikipedia infrastructure the need for funding of the
Wikimedia Foundation will be significantly reduced. Before this project is
finished however, it may be two years down the road.. However

Thanks,
      GerardM
_______________________________________________
foundation-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: UvA what they need in log data from the Wikimedia Foundation

Gregory Maxwell
On 9/15/07, GerardM <[hidden email]> wrote:
> Hoi,
> The university of Amsterdam (UvA)  is getting log information that is
> thoroughly anonymised to the point where it becomes not as useful as it
> should be. The UvA is working on what they call a "peer to peer Wikipedia".
> Their interest in the data is not in the specific IP number of a requester
> for information, their interest is in where a request is coming from. The
> point is that is best, fastest and cheapest when information is available
> from a peer that is close by.

Would a simple break down of bytes and requests per autonomous system
number over a fairly wide time window (say, days), fit their needs?

Example data:

Collection Span               ASN   REQs  Bytes sent  hit-rate
20070801000000-20070801235959 14907 1000  10289000    .99987
20070802000000-20070801235959 14907 2000  20578013    .99916

Or perhaps by hour and AS over some span:

Collection Span               HrGMT ASN   REQs  Bytes sent  hit-rate
20070801000000-20070814235959 00    14907   40  411560      .9688
20070801000000-20070814235959 01    14907   20  205780      .9832

I don't see any reason why we couldn't release aggregates like these.
We should be generating them for our own planning purposes in any
case.

If they wanted details about object locality and things like that, we
could anonymize requests objects by unique IDs but doing that would
require a lot more care.

_______________________________________________
foundation-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: UvA what they need in log data from the Wikimedia Foundation

Gerard Meijssen-3
Hoi,
It is the distribution in time of geographical needs that is critical. When
a particular article becomes relevant in a certain area, you want to address
the flood of requests by distributing the article to the peers that are
close to where the demand is. In this peering system, a node may have an
article but it is not necessary to have all articles.

As was mentioned before, work has been done on the data received from the
WMF. Based on the data a paper will be published in the near future. When it
is available, I will post a link.
Thanks,
    Gerard

On 9/15/07, Gregory Maxwell <[hidden email]> wrote:

>
> On 9/15/07, GerardM <[hidden email]> wrote:
> > Hoi,
> > The university of Amsterdam (UvA)  is getting log information that is
> > thoroughly anonymised to the point where it becomes not as useful as it
> > should be. The UvA is working on what they call a "peer to peer
> Wikipedia".
> > Their interest in the data is not in the specific IP number of a
> requester
> > for information, their interest is in where a request is coming from.
> The
> > point is that is best, fastest and cheapest when information is
> available
> > from a peer that is close by.
>
> Would a simple break down of bytes and requests per autonomous system
> number over a fairly wide time window (say, days), fit their needs?
>
> Example data:
>
> Collection Span               ASN   REQs  Bytes sent  hit-rate
> 20070801000000-20070801235959 14907 1000  10289000    .99987
> 20070802000000-20070801235959 14907 2000  20578013    .99916
>
> Or perhaps by hour and AS over some span:
>
> Collection Span               HrGMT ASN   REQs  Bytes sent  hit-rate
> 20070801000000-20070814235959 00    14907   40  411560      .9688
> 20070801000000-20070814235959 01    14907   20  205780      .9832
>
> I don't see any reason why we couldn't release aggregates like these.
> We should be generating them for our own planning purposes in any
> case.
>
> If they wanted details about object locality and things like that, we
> could anonymize requests objects by unique IDs but doing that would
> require a lot more care.
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/foundation-l
>
_______________________________________________
foundation-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: UvA what they need in log data from the Wikimedia Foundation

Gregory Maxwell
On 9/15/07, GerardM <[hidden email]> wrote:
> Hoi,
> It is the distribution in time of geographical needs that is critical. When
> a particular article becomes relevant in a certain area, you want to address
> the flood of requests by distributing the article to the peers that are
> close to where the demand is. In this peering system, a node may have an
> article but it is not necessary to have all articles.

It would perhaps be more interesting to have this discussion with
someone who understands Internet scale networking.

I strongly expect the data I proposed would match their needs, if you
passed that on to them.

If they are actually insisting that they would rater have some kind of
physical geographic data rather than network topological data for the
purposes of the design of a decenteralized content distribution
framework, ... then, in the absence of additional information, I would
be left questioning their competence.

_______________________________________________
foundation-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/foundation-l