Annoucing WikiConv dataset

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Annoucing WikiConv dataset

Yiqing Hua-2
Hi all,

We’re thrilled to announce the release of WikiConv—a multilingual corpus
reconstructing the complete conversational history of multiple Wikipedia
language editions

The corpus—a collaboration between Jigsaw, Cornell and Wikimedia
foundation—includes  over 100M individual conversation threads and 300M
conversational actions extracted from the English, Chinese, German, Greek,
and Russian Wikipedia talk pages.

WikiConv can be used to understand and model conversational turns in online
collaborative spaces, as we showed in an earlier study, predicting when
conversations go awry <>.

The reconstruction methodology, as well as its possible applications, are
described in a paper by Hua et al. recently presented at EMNLP 2018
<>. You can also watch a video presentation
of this work from the Wikimedia Research showcase
<> in
June 2018.

The corpus is released under CC0 (CC BY SA for individual comments). All
the underlying code is available in this Github repository

If you have any questions about the dataset, feel free to contact us at
[hidden email].


Wiki-research-l mailing list
[hidden email]