Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

Dmitry Chichkov-3
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list:
http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
http://wpcvn.com/enwiki-20100130.most.reverted.txt

This list was calculated using the following sampling criteria:
* All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions;
** Filtered pages with revert ratios > 0.3;
* Sorted in descending revert ratios.

Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum;
BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL).

-- Regards, Dmitry

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

Luca de Alfaro-3
Thanks, this is great fun!  As an Italian, let me quote: 

(0.42525520906166969, (7151, 3041, 59, 514, 63, 2519, 955), 'Penis')
(0.42516069788797062, (1089, 463, 29, 27, 16, 470, 84), 'Inner core')
(0.42490272373540855, (1285, 546, 11, 64, 27, 515, 122), 'Stuff')
(0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun')
(0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey')
(0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas')
(0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian Renaissance')
(0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon')
(0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy')
(0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap')

It is absolutely great to see that Italian Renaissance (with Incas) is one of the few cultural topics that makes it as high in the list as the usual excrement-sex-infantile type of things!!

Luca

On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov <[hidden email]> wrote:
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list:
http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
http://wpcvn.com/enwiki-20100130.most.reverted.txt

This list was calculated using the following sampling criteria:
* All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions;
** Filtered pages with revert ratios > 0.3;
* Sorted in descending revert ratios.

Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum;
BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL).

-- Regards, Dmitry

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

Dmitry Chichkov-3
Yes, working with large data sets is fun. There are always surprises. For example top of the 'words, most reverted by trusted users' are not the expected infantile type of things either. I haven't done the analysis on the full dump yet, but on the subset from the full histories of articles from the PAN 10 LAB test set following words came up top (sorted by chi-square, note that this is very preliminary and tokenization/regularization might have been wrong):

token, chi-sq, regular-diff-tok-cnt, revert-diff-tok-cnt
Image:Example.jpg|Ca 87959701.9043 113 7568
[[Media:Example.ogg] 5549492.56549 62 2196
title]][http://www.e 606182.771025 0 363
aaaaaaaaaaaaaaaaaaaa 305908.640902 0 365
[http://youtube.com/ 253267.5237 189 407
pooooooooooooooooooo 214921.014803 0 375
you 154597.596739 18655 102007
ffffffffffffffffffff 129822.419517 1 238
value="transparent"> 129575.702482 1 168
____________________ 126503.143626 23 166
language|Macedonian] 123467.452157 121 164
hhhhhhhhhhhhhhhhhhhh 119613.359035 0 280
!!!!!!!!!!!!!!!!!!!! 118479.373501 5 686
AAAAAAAAAAAAAAAAAAAA 114581.582068 2 158
oooooooooooooooooooo 110263.074451 0 155
i 109590.406785 2620 55971

-- Cheers, Dmitry




On Fri, Aug 13, 2010 at 4:06 PM, Luca de Alfaro <[hidden email]> wrote:
Thanks, this is great fun!  As an Italian, let me quote: 

(0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun')
(0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey')
(0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas')
(0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian Renaissance')
(0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon')
(0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy')
(0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap')

It is absolutely great to see that Italian Renaissance (with Incas) is one of the few cultural topics that makes it as high in the list as the usual excrement-sex-infantile type of things!!

Luca

On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov <[hidden email]> wrote:
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list:
http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
http://wpcvn.com/enwiki-20100130.most.reverted.txt

This list was calculated using the following sampling criteria:
* All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions;
** Filtered pages with revert ratios > 0.3;
* Sorted in descending revert ratios.

Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum;
BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL).

-- Regards, Dmitry

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

John Mark Vandenberg
In reply to this post by Dmitry Chichkov-3
On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov <[hidden email]> wrote:
> If anybody is interested, I've made a list of 'most reverted pages' in the
> english wikipedia based on the analysis of the enwiki-20100130 dump. Here is
> the list:
> http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
> http://wpcvn.com/enwiki-20100130.most.reverted.txt

Lovely!

This could be used to add semi-protection or pending-changes to reduce
the amount of unnecessary work.

Is it easy to limit this to reverts within a period, such as the last 12 months?

It would also be useful to filter out irregular edit-wars, or pages
which were subject to frequent reverts, but have become stable.

--
John Vandenberg

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

Dmitry Chichkov-3

Yes. It is fairly easy to produce the list limited to a time period, or any other custom stats (e.g. 'reverted edits ratios' for anonymous users, etc). It's just several hours of processing. But it is limited with the time frame of the recent database dump. For the en-wiki it is 2010/01/30. Send your complains to the xmldatadumps-l  ([hidden email])  ;) . 

By the way, I've posted (somewhat cleaned-up) python script that I've used to calculate that list. It's available here:
 http://code.google.com/p/pymwdat/

For en-wiki dump requires:
* 31 Gb enwiki-20100130-pages-meta-history.xml.7z download;
* 250 Gb free disk space (for intermediate data & dump);
* ~week to pre-process the dump (modern desktop);
* ~3 hours to do a simple run (e.g calculate the list like I did).

Dump preprocessed is basically extracting/parsing .xml.7z, calculating MD5s for page revisions, calculating page diffs and pickling the results (alongside with other metadata) to disk. It uses a custom diff algorithm optimized for the wikipedia (regular diff is a way too slow and doesn't handle copy editing well).

It needs memory if one wants to calculate/hold stats for every editor/page (4Gb minimal, 8Gb recommended, 24Gb+ preferred).
But obviously one can filter yourself a data subset or even work on a single page.

Requires System/Libraries:
* Python 2.6+, Linux (I've never tried it on Windows);
* PyWikipedia/Trunk ( http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ )
* OrderedDict (available in Python 2.7 or http://pypi.python.org/pypi/ordereddict/)
* 7-Zip (command line 7za)

-- Dmitry




On Thu, Aug 19, 2010 at 8:46 AM, John Vandenberg <[hidden email]> wrote:
On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov <[hidden email]> wrote:
> If anybody is interested, I've made a list of 'most reverted pages' in the
> english wikipedia based on the analysis of the enwiki-20100130 dump. Here is
> the list:
> http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
> http://wpcvn.com/enwiki-20100130.most.reverted.txt

Lovely!

This could be used to add semi-protection or pending-changes to reduce
the amount of unnecessary work.

Is it easy to limit this to reverts within a period, such as the last 12 months?

It would also be useful to filter out irregular edit-wars, or pages
which were subject to frequent reverts, but have become stable.

--
John Vandenberg

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l