Diff logic used in Wikipedia Detox project

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Diff logic used in Wikipedia Detox project

Pinkesh Badjatiya
Hello,

I was exploring the dataset shared in the Wikipedia Detox
<https://meta.wikimedia.org/wiki/Research:Modeling_Talk_Page_Abuse>
project. I was trying to use the similar diff logic to obtain the changes
from a page using *revid* but realized that the Wikipedia API provides only
the diff of the revision with its earlier version. I am able to fetch the
diffs for a set of *revids* using the Wikipedia API, but I am unable to
extract only the changed sentences in the revision. I found this
<https://github.com/ewulczyn/wiki-detox/blob/master/src/data_generation/diff_utils.py>
particular
script from the project source files that contain bits of what might have
been used in the actual data collection process to obtain the changes from
the Talk pages, but I am unable to figure out the high-level information
such as input/output formats etc.

Can anyone provide a solution to this or any suggestions on how to proceed?
Also, It would be really beneficial if I could use the same diff logic as
used by the original authors to ensure consistency.

Meanwhile, I have asked a similar question on StackOverflow
<https://stackoverflow.com/questions/46010675/extract-changes-from-wikipedia-wikimedia-revision-pages>
and
emailed the original Wikimedia author of the paper.


Regards,
Pinkesh Badjatiya
[hidden email]
IIIT Hyderabad
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Diff logic used in Wikipedia Detox project

Aaron Halfaker-2
I believe that Ellery's work used my mwdiffs library which is largely based
on deltas.

http://pythonhosted.org/mwdiffs/
http://pythonhosted.org/deltas/

On Sun, Sep 3, 2017 at 2:54 PM, Pinkesh Badjatiya <
[hidden email]> wrote:

> Hello,
>
> I was exploring the dataset shared in the Wikipedia Detox
> <https://meta.wikimedia.org/wiki/Research:Modeling_Talk_Page_Abuse>
> project. I was trying to use the similar diff logic to obtain the changes
> from a page using *revid* but realized that the Wikipedia API provides only
> the diff of the revision with its earlier version. I am able to fetch the
> diffs for a set of *revids* using the Wikipedia API, but I am unable to
> extract only the changed sentences in the revision. I found this
> <https://github.com/ewulczyn/wiki-detox/blob/master/src/
> data_generation/diff_utils.py>
> particular
> script from the project source files that contain bits of what might have
> been used in the actual data collection process to obtain the changes from
> the Talk pages, but I am unable to figure out the high-level information
> such as input/output formats etc.
>
> Can anyone provide a solution to this or any suggestions on how to proceed?
> Also, It would be really beneficial if I could use the same diff logic as
> used by the original authors to ensure consistency.
>
> Meanwhile, I have asked a similar question on StackOverflow
> <https://stackoverflow.com/questions/46010675/extract-
> changes-from-wikipedia-wikimedia-revision-pages>
> and
> emailed the original Wikimedia author of the paper.
>
>
> Regards,
> Pinkesh Badjatiya
> [hidden email]
> IIIT Hyderabad
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l