Fwd: [Analytics] Wikipedia aggregate clickstream data released

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: [Analytics] Wikipedia aggregate clickstream data released

Dario Taraborelli-3
Forwarding a reply from Joseph that somehow didn't go through.

---------- Forwarded message ----------
From: Joseph Allemandou <[hidden email]>
To: Research into Wikimedia content and communities <wiki-research-l@lists.
wikimedia.org>, [hidden email]

Hi Gerard,
Here are my two cents on your questions.

About redlinks, you are correct in saying that the 3% of "other" link-type
are jumps from a page to another (using http-referer), while the hyperlink
from the origin to the target allowing for such a jump doesn't exist in the
origin page at the moment of computation.
From my exploration of the dataset, such "other" links happen with the
"manually-edited-with-error" url class (the "-" article has a lot of such
entering links for instance), as well as with links that I think have been
edited in the origin page (for instance in November 2017 dataset, there are
"other" links from page "Kevin Spacey" to "Dan Savage",
"hebephilia","pedophilia or "Harvey_Weinstein" - Those links are confirmed
as existing at some point in the page in November, but not anymore at the
beginning of December when the pages hyperlinks are snapshot).

As for your question about what people are looking for and don't find, the
one way I can think of to get ideas is to use detailed session analysis
correlated with search results, in order to try to get a signal of pages
reached from search and not being visited for long. Even if I think we have
data we could use in that respect on the cluster, we can't publish such
details externally for privacy concerns, obviously.

Please let me know if what I say makes sense :)
Many thanks
Joseph Allemandou


> Hoi,
> Do I understand well that the 3% of "other" links are the ones that have
> articles at *this *time but they did not exist at the time of the dump. So
> in effect they are not red links?
>
> Is there any way to find the articles people were seeking but could not
> find??
> Thanks,
>      GerardM
>
> On 16 January 2018 at 20:21, Leila Zia <[hidden email]> wrote:
>
> > Hi all,
> >
> > For archive happiness:
> >
> > Clickstream dataset is now being generated on a monthly basis for 5
> > Wikipedia languages (English, Russian, German, Spanish, and Japanese).
> You
> > can access the data at https://dumps.wikimedia.org/other/clickstream/
> and
> > read more about the release and those who contributed to it at
> > https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
> >
> > Best,
> > Leila
> >
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Analytics] Wikipedia aggregate clickstream data released

Jonathan Cardy
I don't see a privacy issue in creating a listing of common/frequent search terms. Obviously we don't need the data on who is making these searches, nor do we need the "long tail" of things only searched for by a small number of people. Aside from being clutter some of those search terms could well involve privacy violations.

But a list of frequently accessed redlinks, or indeed frequent null result searches, would be useful. In the past we have had lists of frequently occurring redlinks on English Wikipedia. Once such a list is created and published people will come forward to process it and create articles or redirects where it is sensible to do so.

Regards

Jonathan


> On 19 Jan 2018, at 20:36, Dario Taraborelli <[hidden email]> wrote:
>
> Forwarding a reply from Joseph that somehow didn't go through.
>
> ---------- Forwarded message ----------
> From: Joseph Allemandou <[hidden email]>
> To: Research into Wikimedia content and communities <wiki-research-l@lists.
> wikimedia.org>, [hidden email]
>
> Hi Gerard,
> Here are my two cents on your questions.
>
> About redlinks, you are correct in saying that the 3% of "other" link-type
> are jumps from a page to another (using http-referer), while the hyperlink
> from the origin to the target allowing for such a jump doesn't exist in the
> origin page at the moment of computation.
> From my exploration of the dataset, such "other" links happen with the
> "manually-edited-with-error" url class (the "-" article has a lot of such
> entering links for instance), as well as with links that I think have been
> edited in the origin page (for instance in November 2017 dataset, there are
> "other" links from page "Kevin Spacey" to "Dan Savage",
> "hebephilia","pedophilia or "Harvey_Weinstein" - Those links are confirmed
> as existing at some point in the page in November, but not anymore at the
> beginning of December when the pages hyperlinks are snapshot).
>
> As for your question about what people are looking for and don't find, the
> one way I can think of to get ideas is to use detailed session analysis
> correlated with search results, in order to try to get a signal of pages
> reached from search and not being visited for long. Even if I think we have
> data we could use in that respect on the cluster, we can't publish such
> details externally for privacy concerns, obviously.
>
> Please let me know if what I say makes sense :)
> Many thanks
> Joseph Allemandou
>
>
>> Hoi,
>> Do I understand well that the 3% of "other" links are the ones that have
>> articles at *this *time but they did not exist at the time of the dump. So
>> in effect they are not red links?
>>
>> Is there any way to find the articles people were seeking but could not
>> find??
>> Thanks,
>>     GerardM
>>
>>> On 16 January 2018 at 20:21, Leila Zia <[hidden email]> wrote:
>>>
>>> Hi all,
>>>
>>> For archive happiness:
>>>
>>> Clickstream dataset is now being generated on a monthly basis for 5
>>> Wikipedia languages (English, Russian, German, Spanish, and Japanese).
>> You
>>> can access the data at https://dumps.wikimedia.org/other/clickstream/
>> and
>>> read more about the release and those who contributed to it at
>>> https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
>>>
>>> Best,
>>> Leila
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l