Re: [Wikimedia-l] wikipedia access traces ?

classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Pine W
Hi Valerio,

This kind of request is a better fit for the Research mailing list. I've included the email for that list in the To: line of this email reply.

Pine

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland


1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60
_______________________________________________
Wikimedia-l mailing list, guidelines at: <a href="https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org" target="_blank">https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@...
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Valerio Schiavoni
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Aaron Halfaker-2
Just to confirm, https://dumps.wikimedia.org/other/pagecounts-raw/ won't work for you?

On Wed, Sep 17, 2014 at 8:53 AM, Valerio Schiavoni <[hidden email]> wrote:
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Giovanni Luca Ciampaglia-3
In reply to this post by Valerio Schiavoni
Valerio,

I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH


Cheers

G

Giovanni Luca Ciampaglia

✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
http://www.glciampaglia.com/
✆ +1 812 855-7261
[hidden email]

2014-09-17 9:53 GMT-04:00 Valerio Schiavoni <[hidden email]>:
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Valerio Schiavoni
In reply to this post by Aaron Halfaker-2
Hello Aaron,
thanks for your reply.

On Wed, Sep 17, 2014 at 4:03 PM, Aaron Halfaker <[hidden email]> wrote:
Just to confirm, https://dumps.wikimedia.org/other/pagecounts-raw/ won't work for you?

Unfortunately, no. Those logs only provide page counts but without the associated timestamps ("when" those pages have been accessed). If such logs exist, they would perfectly do.. 
 

By comparison, the logs in that dataset looks like this:

3325795636 1191194118.711 http://en.wikipedia.org/w/index.php?title=MediaWiki:Monobook.css&usemsgcache=yes&action=raw&ctype=text/css&smaxage=2678400 -


3325795635 1191194118.803 http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Icono_aviso_borrar.png/50px-Icono_aviso_borrar.png -


3325795639 1191194118.671 http://de.wikipedia.org/w/index.php?title=MediaWiki:Monobook.css&usemsgcache=yes&action=raw&ctype=text/css&smaxage=2678400 -


The first token is just a counter,  the second one is a Unix timestamp then there is the Wikipedia URL in the request, and a flag indicating if the request issued a database update or not (none of those three did).

best,
Valerio

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Valerio Schiavoni
In reply to this post by Giovanni Luca Ciampaglia-3
Hello Giovanni, 
thanks for the pointer to the Click datasets. 
I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia. 

Then, one of the requirements to access those datas is:
"The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "

I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland.
I'll let you know !!

Best,
Valerio

On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia <[hidden email]> wrote:
Valerio,

I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH


Cheers

G

Giovanni Luca Ciampaglia

✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
http://www.glciampaglia.com/
✆ <a href="tel:%2B1%20812%20855-7261" value="+18128557261" target="_blank">+1 812 855-7261
[hidden email]

2014-09-17 9:53 GMT-04:00 Valerio Schiavoni <[hidden email]>:
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Valerio Schiavoni
Hello Giovanni, 
on second thought, I think the Click dataset won't do either. 
I've parsed the smaller sample [1], which is said to be extracted from the bigger one.

In that dataset there are ~34k entries related to Wikipedia, but they look like the following:

{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"} 


That is, the log only  reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
 
This is what is of main interest to me. 

Thanks for your interest anyway!
Valerio
  


On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni <[hidden email]> wrote:
Hello Giovanni, 
thanks for the pointer to the Click datasets. 
I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia. 

Then, one of the requirements to access those datas is:
"The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "

I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland.
I'll let you know !!

Best,
Valerio

On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia <[hidden email]> wrote:
Valerio,

I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH


Cheers

G

Giovanni Luca Ciampaglia

✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
http://www.glciampaglia.com/
✆ <a href="tel:%2B1%20812%20855-7261" value="+18128557261" target="_blank">+1 812 855-7261
[hidden email]

2014-09-17 9:53 GMT-04:00 Valerio Schiavoni <[hidden email]>:
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Aaron Halfaker-2
Hi Valerio,

The page counts dataset has a time resolution of one hour.  Is that too coarse?  How fine of resolution do you need?

On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni <[hidden email]> wrote:
Hello Giovanni, 
on second thought, I think the Click dataset won't do either. 
I've parsed the smaller sample [1], which is said to be extracted from the bigger one.

In that dataset there are ~34k entries related to Wikipedia, but they look like the following:

{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"} 


That is, the log only  reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
 
This is what is of main interest to me. 

Thanks for your interest anyway!
Valerio
  


On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni <[hidden email]> wrote:
Hello Giovanni, 
thanks for the pointer to the Click datasets. 
I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia. 

Then, one of the requirements to access those datas is:
"The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "

I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland.
I'll let you know !!

Best,
Valerio

On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia <[hidden email]> wrote:
Valerio,

I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH


Cheers

G

Giovanni Luca Ciampaglia

✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
http://www.glciampaglia.com/
✆ <a href="tel:%2B1%20812%20855-7261" value="+18128557261" target="_blank">+1 812 855-7261
[hidden email]

2014-09-17 9:53 GMT-04:00 Valerio Schiavoni <[hidden email]>:
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Valerio Schiavoni
Hello Aaron,
1 hour is way too coarse. 
Let's say 1 second would be ok.
Is that available ? 

On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker <[hidden email]> wrote:
Hi Valerio,

The page counts dataset has a time resolution of one hour.  Is that too coarse?  How fine of resolution do you need?

On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni <[hidden email]> wrote:
Hello Giovanni, 
on second thought, I think the Click dataset won't do either. 
I've parsed the smaller sample [1], which is said to be extracted from the bigger one.

In that dataset there are ~34k entries related to Wikipedia, but they look like the following:

{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"} 


That is, the log only  reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
 
This is what is of main interest to me. 

Thanks for your interest anyway!
Valerio
  


On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni <[hidden email]> wrote:
Hello Giovanni, 
thanks for the pointer to the Click datasets. 
I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia. 

Then, one of the requirements to access those datas is:
"The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "

I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland.
I'll let you know !!

Best,
Valerio

On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia <[hidden email]> wrote:
Valerio,

I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH


Cheers

G

Giovanni Luca Ciampaglia

✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
http://www.glciampaglia.com/
✆ <a href="tel:%2B1%20812%20855-7261" value="+18128557261" target="_blank">+1 812 855-7261
[hidden email]

2014-09-17 9:53 GMT-04:00 Valerio Schiavoni <[hidden email]>:
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Aaron Halfaker-3
I don't think that we keep those logs historically.  analytics-l (CC'd) might have more insights.  

Do we have anything more granular than the hourly view logs available here: https://dumps.wikimedia.org/other/pagecounts-raw/

On Wed, Sep 17, 2014 at 10:39 AM, Valerio Schiavoni <[hidden email]> wrote:
Hello Aaron,
1 hour is way too coarse. 
Let's say 1 second would be ok.
Is that available ? 

On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker <[hidden email]> wrote:
Hi Valerio,

The page counts dataset has a time resolution of one hour.  Is that too coarse?  How fine of resolution do you need?

On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni <[hidden email]> wrote:
Hello Giovanni, 
on second thought, I think the Click dataset won't do either. 
I've parsed the smaller sample [1], which is said to be extracted from the bigger one.

In that dataset there are ~34k entries related to Wikipedia, but they look like the following:

{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"} 


That is, the log only  reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
 
This is what is of main interest to me. 

Thanks for your interest anyway!
Valerio
  


On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni <[hidden email]> wrote:
Hello Giovanni, 
thanks for the pointer to the Click datasets. 
I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia. 

Then, one of the requirements to access those datas is:
"The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "

I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland.
I'll let you know !!

Best,
Valerio

On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia <[hidden email]> wrote:
Valerio,

I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH


Cheers

G

Giovanni Luca Ciampaglia

✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
http://www.glciampaglia.com/
✆ <a href="tel:%2B1%20812%20855-7261" value="+18128557261" target="_blank">+1 812 855-7261
[hidden email]

2014-09-17 9:53 GMT-04:00 Valerio Schiavoni <[hidden email]>:
Hello,
just bumping my email from last week, since so far I did not get any answer.

Should I consider that dataset to be somehow lost ? 

I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. 
  
Thanks again,
Valerio 

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni <[hidden email]> wrote:
Dear WikiMedia foundation,
in the context of a EU research project [1], we are interested in accessing
wikipedia access traces.
In the past, such traces were given for research purposes to other groups
[2].
Unfortunately, only a small percentage (10%) of that trace has been made
made available (10%).
We are interested in accessing the totality of that same trace (or even
better, a more recent one, but the same one will do).

If this is not the correct ML to use for such requests, could please anyone
redirect me to correct one ?

Thanks again for your attention,

Valerio Schiavoni
Post-Doc Researcher
University of Neuchatel, Switzerland

1 - http://www.leads-project.eu
2 - http://www.wikibench.eu/?page_id=60



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Benj. Mako Hill-2
In reply to this post by Valerio Schiavoni
<quote who="Valerio Schiavoni" date="Wed, Sep 17, 2014 at 04:14:04PM +0200">
> Unfortunately, no. Those logs only provide page counts but without the
> associated timestamps ("when" those pages have been accessed). If such logs
> exist, they would perfectly do..

The pagecount data /has/ timing data but they are "binned" by the
hour.

I don't think more comprehensive data (all pages, all languages,
nearly all viewers) over a long period of time exists anywhere and I
don't think any similarly comprehensive data exists before 2007 at
all.

You might find more granular data for short periods of time (like the
WikiBench data or maybe stuff that's been collected more recently by
WMF but isn't published) or much more detailed data from longer
periods of time for a subset of users on a particular network (perhaps
like the Indiana data, or toolbar data like the Yahoo data that some
WP researchers have used).

I would /love/ to hear that I am wrong about this and that there's
some wonderful, granual, broad, long-term dataset of pageviews I just
don't know about it. :)

Later,
Mako



--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Pine W
I suppose you could get more granular data by conducting an opt-in study of some kind, and you would need to be careful that users who haven't opted in are not accidentally included or indirectly have their privacy affected. I agree that collection at intervals shorter than an hour is going to raise a lot of privacy considerations for users who have not opted in.

Pine

On Thu, Sep 18, 2014 at 12:03 PM, Benj. Mako Hill <[hidden email]> wrote:
<quote who="Valerio Schiavoni" date="Wed, Sep 17, 2014 at 04:14:04PM +0200">
> Unfortunately, no. Those logs only provide page counts but without the
> associated timestamps ("when" those pages have been accessed). If such logs
> exist, they would perfectly do..

The pagecount data /has/ timing data but they are "binned" by the
hour.

I don't think more comprehensive data (all pages, all languages,
nearly all viewers) over a long period of time exists anywhere and I
don't think any similarly comprehensive data exists before 2007 at
all.

You might find more granular data for short periods of time (like the
WikiBench data or maybe stuff that's been collected more recently by
WMF but isn't published) or much more detailed data from longer
periods of time for a subset of users on a particular network (perhaps
like the Indiana data, or toolbar data like the Yahoo data that some
WP researchers have used).

I would /love/ to hear that I am wrong about this and that there's
some wonderful, granual, broad, long-term dataset of pageviews I just
don't know about it. :)

Later,
Mako



--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Benj. Mako Hill-2
<quote who="Pine W" date="Thu, Sep 18, 2014 at 12:07:53PM -0700">
> I suppose you could get more granular data by conducting an opt-in study of
> some kind, and you would need to be careful that users who haven't opted in
> are not accidentally included or indirectly have their privacy affected. I
> agree that collection at intervals shorter than an hour is going to raise a
> lot of privacy considerations for users who have not opted in.

That would certainly work for some research questions and that's more
or less what most toolbar data is.

The problem is that often questions answered with view data are about
the overall popularity of visibility of pages which requires data that
is representative. There's lots of reasons to believe that people who
opt-in aren't going to be representative of all Wikipedia readers.

Regards,
Mako


--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Pine W

Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?

Pine

On Sep 18, 2014 1:32 PM, "Benj. Mako Hill" <[hidden email]> wrote:
<quote who="Pine W" date="Thu, Sep 18, 2014 at 12:07:53PM -0700">
> I suppose you could get more granular data by conducting an opt-in study of
> some kind, and you would need to be careful that users who haven't opted in
> are not accidentally included or indirectly have their privacy affected. I
> agree that collection at intervals shorter than an hour is going to raise a
> lot of privacy considerations for users who have not opted in.

That would certainly work for some research questions and that's more
or less what most toolbar data is.

The problem is that often questions answered with view data are about
the overall popularity of visibility of pages which requires data that
is representative. There's lots of reasons to believe that people who
opt-in aren't going to be representative of all Wikipedia readers.

Regards,
Mako


--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Pine W
In reply to this post by Benj. Mako Hill-2

Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?

Pine

On Sep 18, 2014 1:32 PM, "Benj. Mako Hill" <[hidden email]> wrote:
<quote who="Pine W" date="Thu, Sep 18, 2014 at 12:07:53PM -0700">
> I suppose you could get more granular data by conducting an opt-in study of
> some kind, and you would need to be careful that users who haven't opted in
> are not accidentally included or indirectly have their privacy affected. I
> agree that collection at intervals shorter than an hour is going to raise a
> lot of privacy considerations for users who have not opted in.

That would certainly work for some research questions and that's more
or less what most toolbar data is.

The problem is that often questions answered with view data are about
the overall popularity of visibility of pages which requires data that
is representative. There's lots of reasons to believe that people who
opt-in aren't going to be representative of all Wikipedia readers.

Regards,
Mako


--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

aaron shaw
In reply to this post by Benj. Mako Hill-2
On Thu, Sep 18, 2014 at 3:49 PM, Pine W <[hidden email]> wrote:
Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?

In theory, sure, but that's a high bar. Responsible phone survey firms that generate high quality data generally work very hard to draw random samples of the population under consideration, follow up with non-respondents numerous times to maximize the response rate, develop nuanced survey weights for their data in order to adjust the responses relative to known parameters of larger populations (when possible) and - at least recently - often conduct ongoing studies to ensure that their data quality remains high (e.g., in response to the transition away from land-lines toward cell-phone only users among some demographic groups).

Many of these practices are very difficult to map into contexts like Wikipedia, WMF projects, or online communities more broadly. Even the most sophisticated web-metrics data providers (e.g., ComScore, Quantcast) struggle with the issues of non-response and data quality. Those firms do not publish much about their methodologies and do not share their data with non-paying members of the public.

Mako and I have written about some of these issues in a PLoS ONE article[1] where we also attempt to correct some existing Wikipedia survey data using an interesting technique that draws on overlapping questions in an opt-in survey and a nationally-representative phone survey of US adults. I've also talked with a few communities about conducting surveys in a manner that would be more likely to generate high quality data along these lines, but without much to show for it yet. It would be great to see more people (scholars/communities/observers) move in this direction.

all the best,
a


On Thu, Sep 18, 2014 at 3:49 PM, Pine W <[hidden email]> wrote:

Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?

Pine

On Sep 18, 2014 1:32 PM, "Benj. Mako Hill" <[hidden email]> wrote:
<quote who="Pine W" date="Thu, Sep 18, 2014 at 12:07:53PM -0700">
> I suppose you could get more granular data by conducting an opt-in study of
> some kind, and you would need to be careful that users who haven't opted in
> are not accidentally included or indirectly have their privacy affected. I
> agree that collection at intervals shorter than an hour is going to raise a
> lot of privacy considerations for users who have not opted in.

That would certainly work for some research questions and that's more
or less what most toolbar data is.

The problem is that often questions answered with view data are about
the overall popularity of visibility of pages which requires data that
is representative. There's lots of reasons to believe that people who
opt-in aren't going to be representative of all Wikipedia readers.

Regards,
Mako


--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Benj. Mako Hill-2
In reply to this post by Pine W
<quote who="Pine W" date="Thu, Sep 18, 2014 at 01:49:13PM -0700">
> Yes, but supposedly phone survey companies are able to get
> representative samples of broad populations despite many people
> refusing to respond to phone surveys.  If opt-in users were chosen
> using similar methods, could arguably representative data be
> obtained?

The way that people build representative surveys from
non-representative data is by understanding quite a lot about the
nature and structure of the bias in your sample. You might want to
think about how people do this as trying to create a very complicated
system of weights.

Folks who do this for US phone surveys, for example, have spent many
decades and many millions of dollars on research to understand how to
get reliable results and even then it's a quickly moving target. They
still routinely sometimes miss things and get things wrong.

That said, there are certainly things we can learn. Aaron Shaw and I
actually did something related with one of the big Wikipedia surveys
in this article:

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0065782

In our case, our study was only possible because (a) we had very good
luck finding "ground truth" data from the right point in time, (b) we
had detailed demographic data on folks from the WP survey, and (c) we
make a series of untestable assumptions. After all that work, we still
can't know that we've got it right. We really can only suggest that
there are reasons to believe our estimates are better that pretending
that the opt-in survey is unbiased.

In the case of signing up for a Wikipedia toolbar, we might not even
attract a sub-population that even /can/ reliably used to build
representative estimates. :-(

Regards,
Mako


--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Jonathan Morgan
See what you started, Pine? This is what happens when you get professors talking about research methods.

:P

- J

On Thu, Sep 18, 2014 at 2:21 PM, Benj. Mako Hill <[hidden email]> wrote:
<quote who="Pine W" date="Thu, Sep 18, 2014 at 01:49:13PM -0700">
> Yes, but supposedly phone survey companies are able to get
> representative samples of broad populations despite many people
> refusing to respond to phone surveys.  If opt-in users were chosen
> using similar methods, could arguably representative data be
> obtained?

The way that people build representative surveys from
non-representative data is by understanding quite a lot about the
nature and structure of the bias in your sample. You might want to
think about how people do this as trying to create a very complicated
system of weights.

Folks who do this for US phone surveys, for example, have spent many
decades and many millions of dollars on research to understand how to
get reliable results and even then it's a quickly moving target. They
still routinely sometimes miss things and get things wrong.

That said, there are certainly things we can learn. Aaron Shaw and I
actually did something related with one of the big Wikipedia surveys
in this article:

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0065782

In our case, our study was only possible because (a) we had very good
luck finding "ground truth" data from the right point in time, (b) we
had detailed demographic data on folks from the WP survey, and (c) we
make a series of untestable assumptions. After all that work, we still
can't know that we've got it right. We really can only suggest that
there are reasons to believe our estimates are better that pretending
that the opt-in survey is unbiased.

In the case of signing up for a Wikipedia toolbar, we might not even
attract a sub-population that even /can/ reliably used to build
representative estimates. :-(

Regards,
Mako


--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Jonathan T. Morgan
Learning Strategist
Wikimedia Foundation


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Benj. Mako Hill-2
<quote who="Jonathan Morgan" date="Thu, Sep 18, 2014 at 02:43:35PM -0700">
> See what you started, Pine? *This* is what happens when you get professors
> talking about research methods.

What, you get nearly identical messages written simultaneously by
serial co-authors? ;)

Later,
Mako


--
Benjamin Mako Hill
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] wikipedia access traces ?

Richard Jensen
In reply to this post by Benj. Mako Hill-2
the basic issue in sampling is to decide what the target population T
actually is. Then you weight the sample so that each person in the
target population has an equal chance w  and people not in it have weight zero.

So what is the target population we want to study?
--the world's population?
--the world's educated population?
--everyone with internet access
--everyone who ever uses Wikipedia
--everyone who use it a lot
--everyone  who has knowledge to contribute in positive fashion?
--everyone  who has the internet, skills and potential to contribute?
--everyone  who has the potential to contribute but does not do so?

Richard Jensen
[hidden email]


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
123