Wikipedia aggregate clickstream data released

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Wikipedia aggregate clickstream data released

Dario Taraborelli-3
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia


This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.

This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia

We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream

Ellery and Dario

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Wikipedia aggregate clickstream data released

Leila Zia
Hi all,

For archive happiness:

Clickstream dataset is now being generated on a monthly basis for 5
Wikipedia languages (English, Russian, German, Spanish, and Japanese). You
can access the data at https://dumps.wikimedia.org/other/clickstream/ and
read more about the release and those who contributed to it at
https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/

Best,
Leila



--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
[hidden email]> wrote:

> We’re glad to announce the release of an aggregate clickstream dataset
> extracted from English Wikipedia
>
> http://dx.doi.org/10.6084/m9.figshare.1305770
>
> This dataset contains counts of *(referer, article) *pairs aggregated
> from the HTTP request logs of English Wikipedia. This snapshot captures 22
> million *(referer, article)* pairs from a total of 4 billion requests
> collected during the month of January 2015.
>
> This data can be used for various purposes:
> • determining the most frequent links people click on for a given article
> • determining the most common links people followed to an article
> • determining how much of the total traffic to an article clicked on a
> link in that article
> • generating a Markov chain over English Wikipedia
>
> We created a page on Meta for feedback and discussion about this release:
> https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
>
> Ellery and Dario
>
> _______________________________________________
> Analytics mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Wikipedia aggregate clickstream data released

Taha Yasseri
Just wanted to quickly thank Dario et al. for releasing these data that is
Gold!
And also (self)-promote a paper that we wrote based the earlier releases of
the data to be presented at CompleNet'18 in Boston (March)

Inspiration, Captivation, and Misdirection: Emergent Properties in Networks
of Online Navigation <https://arxiv.org/abs/1710.03326>P Gildersleve, T
Yasseri - arXiv preprint arXiv:1710.03326, 2017 - arxiv.org

Best
Taha

On Tue, Jan 16, 2018 at 7:21 PM, Leila Zia <[hidden email]> wrote:

> Hi all,
>
> For archive happiness:
>
> Clickstream dataset is now being generated on a monthly basis for 5
> Wikipedia languages (English, Russian, German, Spanish, and Japanese). You
> can access the data at https://dumps.wikimedia.org/other/clickstream/ and
> read more about the release and those who contributed to it at
> https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
>
> Best,
> Leila
>
>
>
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
> On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
> [hidden email]> wrote:
>
> > We’re glad to announce the release of an aggregate clickstream dataset
> > extracted from English Wikipedia
> >
> > http://dx.doi.org/10.6084/m9.figshare.1305770
> >
> > This dataset contains counts of *(referer, article) *pairs aggregated
> > from the HTTP request logs of English Wikipedia. This snapshot captures
> 22
> > million *(referer, article)* pairs from a total of 4 billion requests
> > collected during the month of January 2015.
> >
> > This data can be used for various purposes:
> > • determining the most frequent links people click on for a given article
> > • determining the most common links people followed to an article
> > • determining how much of the total traffic to an article clicked on a
> > link in that article
> > • generating a Markov chain over English Wikipedia
> >
> > We created a page on Meta for feedback and discussion about this release:
> > https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
> >
> > Ellery and Dario
> >
> > _______________________________________________
> > Analytics mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



--
Dr Taha Yasseri
@TahaYasseri <https://twitter.com/TahaYasseri>
http://www.oii.ox.ac.uk/people/yasseri/

Senior Research Fellow in Computational Social Science, Oxford Internet
Institute,
Research Fellow in Humanities and Social Sciences, Wolfson College,
University of Oxford,
and
Turing Fellow, Alan Turing Institute for Data Science.

Tel. +44-1865-287229
1 St. Giles
Oxford OX1 3JS
UK
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Wikipedia aggregate clickstream data released

Taha Yasseri-2
In reply to this post by Leila Zia
Hi All,


Just wanted to quickly thank Dario et al. for releasing these data that is
Gold!
And also (self)-promote a paper that we wrote based the earlier releases of
the data to be presented at CompleNet'18 in Boston (March)

Inspiration, Captivation, and Misdirection: Emergent Properties in Networks
of Online Navigation <https://arxiv.org/abs/1710.03326>P Gildersleve, T
Yasseri - arXiv preprint arXiv:1710.03326, 2017 - arxiv.org

Best
Taha

On Tue, Jan 16, 2018 at 7:21 PM, Leila Zia <[hidden email]> wrote:

> Hi all,
>
> For archive happiness:
>
> Clickstream dataset is now being generated on a monthly basis for 5
> Wikipedia languages (English, Russian, German, Spanish, and Japanese). You
> can access the data at https://dumps.wikimedia.org/other/clickstream/ and
> read more about the release and those who contributed to it at
> https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
>
> Best,
> Leila
>
>
>
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
> On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
> [hidden email]> wrote:
>
> > We’re glad to announce the release of an aggregate clickstream dataset
> > extracted from English Wikipedia
> >
> > http://dx.doi.org/10.6084/m9.figshare.1305770
> >
> > This dataset contains counts of *(referer, article) *pairs aggregated
> > from the HTTP request logs of English Wikipedia. This snapshot captures
> 22
> > million *(referer, article)* pairs from a total of 4 billion requests
> > collected during the month of January 2015.
> >
> > This data can be used for various purposes:
> > • determining the most frequent links people click on for a given article
> > • determining the most common links people followed to an article
> > • determining how much of the total traffic to an article clicked on a
> > link in that article
> > • generating a Markov chain over English Wikipedia
> >
> > We created a page on Meta for feedback and discussion about this release:
> > https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
> >
> > Ellery and Dario
> >
> > _______________________________________________
> > Analytics mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



--
Dr Taha Yasseri
@TahaYasseri <https://twitter.com/TahaYasseri>
http://www.oii.ox.ac.uk/people/yasseri/

Senior Research Fellow in Computational Social Science, Oxford Internet
Institute,
Research Fellow in Humanities and Social Sciences, Wolfson College,
University of Oxford,
and
Turing Fellow, Alan Turing Institute for Data Science.

Tel. +44-1865-287229
1 St. Giles
Oxford OX1 3JS
UK
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Wikipedia aggregate clickstream data released

Gerard Meijssen-3
In reply to this post by Leila Zia
Hoi,
Do I understand well that the 3% of "other" links are the ones that have
articles at *this *time but they did not exist at the time of the dump. So
in effect they are not red links?

Is there any way to find the articles people were seeking but could not
find??
Thanks,
     GerardM

On 16 January 2018 at 20:21, Leila Zia <[hidden email]> wrote:

> Hi all,
>
> For archive happiness:
>
> Clickstream dataset is now being generated on a monthly basis for 5
> Wikipedia languages (English, Russian, German, Spanish, and Japanese). You
> can access the data at https://dumps.wikimedia.org/other/clickstream/ and
> read more about the release and those who contributed to it at
> https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
>
> Best,
> Leila
>
>
>
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
> On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
> [hidden email]> wrote:
>
> > We’re glad to announce the release of an aggregate clickstream dataset
> > extracted from English Wikipedia
> >
> > http://dx.doi.org/10.6084/m9.figshare.1305770
> >
> > This dataset contains counts of *(referer, article) *pairs aggregated
> > from the HTTP request logs of English Wikipedia. This snapshot captures
> 22
> > million *(referer, article)* pairs from a total of 4 billion requests
> > collected during the month of January 2015.
> >
> > This data can be used for various purposes:
> > • determining the most frequent links people click on for a given article
> > • determining the most common links people followed to an article
> > • determining how much of the total traffic to an article clicked on a
> > link in that article
> > • generating a Markov chain over English Wikipedia
> >
> > We created a page on Meta for feedback and discussion about this release:
> > https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
> >
> > Ellery and Dario
> >
> > _______________________________________________
> > Analytics mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Wikipedia aggregate clickstream data released

Leila Zia
On Tue, Jan 16, 2018 at 10:38 PM, Gerard Meijssen
<[hidden email]> wrote:
> Hoi,
> Do I understand well that the 3% of "other" links are the ones that have
> articles at *this *time but they did not exist at the time of the dump. So
> in effect they are not red links?

Per description of "Other" in
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Format,
the lines in the data that are labeled with Other are those where both
referrer and request articles exist in Wikipedia at the time of
creating the dumps, however, the referrer article does not link to the
requested article. This can happen, for example, if the user does an
internal search and get to the requested article from the referrer
page.

The question about redlinks is a separate one, I think, and it goes
back to your question at
https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream#Not_found
. Dario or others closer to the data will be able to comment on
whether it's included in these recurring releases.

> Is there any way to find the articles people were seeking but could not
> find??

If redlinks are included, part of this question can be addressed by
this dataset, but not all.

Best,
Leila

> Thanks,
>      GerardM
>
> On 16 January 2018 at 20:21, Leila Zia <[hidden email]> wrote:
>
>> Hi all,
>>
>> For archive happiness:
>>
>> Clickstream dataset is now being generated on a monthly basis for 5
>> Wikipedia languages (English, Russian, German, Spanish, and Japanese). You
>> can access the data at https://dumps.wikimedia.org/other/clickstream/ and
>> read more about the release and those who contributed to it at
>> https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
>>
>> Best,
>> Leila
>>
>>
>>
>> --
>> Leila Zia
>> Senior Research Scientist
>> Wikimedia Foundation
>>
>> On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
>> [hidden email]> wrote:
>>
>> > We’re glad to announce the release of an aggregate clickstream dataset
>> > extracted from English Wikipedia
>> >
>> > http://dx.doi.org/10.6084/m9.figshare.1305770
>> >
>> > This dataset contains counts of *(referer, article) *pairs aggregated
>> > from the HTTP request logs of English Wikipedia. This snapshot captures
>> 22
>> > million *(referer, article)* pairs from a total of 4 billion requests
>> > collected during the month of January 2015.
>> >
>> > This data can be used for various purposes:
>> > • determining the most frequent links people click on for a given article
>> > • determining the most common links people followed to an article
>> > • determining how much of the total traffic to an article clicked on a
>> > link in that article
>> > • generating a Markov chain over English Wikipedia
>> >
>> > We created a page on Meta for feedback and discussion about this release:
>> > https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
>> >
>> > Ellery and Dario
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>> >
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Wikipedia aggregate clickstream data released

Gerard Meijssen-3
In reply to this post by Gerard Meijssen-3
Hoi,
I am a big fan of suggesting people to write articles / do work that will
be read, will be used. In a blogpost [1], I suggest the accumulation of
these click streams and use the missing popular articles as suggestions for
new articles. Articles that people seek and are truly missing are also
obvious candidates as suggestions for new articles.

My question: how hard is it to do this accumulation and analysis for
missing new articles and, combine it with suggestions to authors to write
something that is likely to prove popular? Does this idea have merit?
Thanks,
        GerardM



[1]
https://ultimategerardm.blogspot.nl/2018/01/wikipedia-entering-rabbit-hole.html

On 18 January 2018 at 21:37, Joseph Allemandou <[hidden email]>
wrote:

> Hi Gerard,
> Here are my two cents on your questions.
>
> About redlinks, you are correct in saying that the 3% of "other" link-type
> are jumps from a page to another (using http-referer), while the hyperlink
> from the origin to the target allowing for such a jump doesn't exist in the
> origin page at the moment of computation.
> From my exploration of the dataset, such "other" links happen with the
> "manually-edited-with-error" url class (the "-" article has a lot of such
> entering links for instance), as well as with links that I think have been
> edited in the origin page (for instance in November 2017 dataset, there are
> "other" links from page "Kevin Spacey" to "Dan Savage",
> "hebephilia","pedophilia or "Harvey_Weinstein" - Those links are confirmed
> as existing at some point in the page in November, but not anymore at the
> beginning of December when the pages hyperlinks are snapshot).
>
> As for your question about what people are looking for and don't find, the
> one way I can think of to get ideas is to use detailed session analysis
> correlated with search results, in order to try to get a signal of pages
> reached from search and not being visited for long. Even if I think we have
> data we could use in that respect on the cluster, we can't publish such
> details externally for privacy concerns, obviously.
>
> Please let me know if what I say makes sense :)
> Many thanks
> Joseph Allemandou
>
>
>> Hoi,
>> Do I understand well that the 3% of "other" links are the ones that have
>> articles at *this *time but they did not exist at the time of the dump. So
>>
>> in effect they are not red links?
>>
>> Is there any way to find the articles people were seeking but could not
>> find??
>> Thanks,
>>      GerardM
>>
>> On 16 January 2018 at 20:21, Leila Zia <[hidden email]> wrote:
>>
>> > Hi all,
>> >
>> > For archive happiness:
>> >
>> > Clickstream dataset is now being generated on a monthly basis for 5
>> > Wikipedia languages (English, Russian, German, Spanish, and Japanese).
>> You
>> > can access the data at https://dumps.wikimedia.org/other/clickstream/
>> and
>> > read more about the release and those who contributed to it at
>> > https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-
>> clickstream/
>> >
>> > Best,
>> > Leila
>> >
>> >
>> >
>> > --
>> > Leila Zia
>> > Senior Research Scientist
>> > Wikimedia Foundation
>> >
>> > On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
>> > [hidden email]> wrote:
>> >
>> > > We’re glad to announce the release of an aggregate clickstream dataset
>> > > extracted from English Wikipedia
>> > >
>> > > http://dx.doi.org/10.6084/m9.figshare.1305770
>> > >
>> > > This dataset contains counts of *(referer, article) *pairs aggregated
>> > > from the HTTP request logs of English Wikipedia. This snapshot
>> captures
>> > 22
>> > > million *(referer, article)* pairs from a total of 4 billion requests
>> > > collected during the month of January 2015.
>> > >
>> > > This data can be used for various purposes:
>> > > • determining the most frequent links people click on for a given
>> article
>> > > • determining the most common links people followed to an article
>> > > • determining how much of the total traffic to an article clicked on a
>> > > link in that article
>> > > • generating a Markov chain over English Wikipedia
>> > >
>> > > We created a page on Meta for feedback and discussion about this
>> release:
>> > > https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
>> > >
>> > > Ellery and Dario
>> > >
>> > > _______________________________________________
>> > > Analytics mailing list
>> > > [hidden email]
>> > > https://lists.wikimedia.org/mailman/listinfo/analytics
>> > >
>> > >
>> > _______________________________________________
>> > Wiki-research-l mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> >
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>>
>> --
>>
>> *Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
>> wikimediafoundation.org • nitens.org • @readermeter
>> <http://twitter.com/readermeter>
>>
>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l