Working with edit history dump

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Working with edit history dump

Behzad Tabibian
Hi all,

I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

Anyone has an experienced this? Am I downloading the wrong files? 

Best,
Behzad

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Working with edit history dump

Jeremy Baron

On Feb 24, 2015 1:44 PM, "Behzad Tabibian" <[hidden email]> wrote:
> I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

This may be a decent place to ask (actually I don't read this list too much so just guessing) but probably more relevant at [hidden email] . FYI

-Jeremy


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Working with edit history dump

Aaron Halfaker-2
Behzad,

The XML dumps should be complete and reflect the full history of pages in Wikipedia.  Could you give an example of a page from the XML dump that doesn't have the full set of revisions?

-Aaron

On Tue, Feb 24, 2015 at 12:49 PM, Jeremy Baron <[hidden email]> wrote:

On Feb 24, 2015 1:44 PM, "Behzad Tabibian" <[hidden email]> wrote:
> I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

This may be a decent place to ask (actually I don't read this list too much so just guessing) but probably more relevant at [hidden email] . FYI

-Jeremy


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Working with edit history dump

Behzad Tabibian
Hi Aaron,

Thanks for your reply. Particular file that I worked with is "enwiki-20140707-pages-meta-history5.xml-p000183366p000184999”

The first article appearing in this batch is "Dude Ranch (album)", http://en.wikipedia.org/wiki/Dude_Ranch_%28album%29

Total number of tags with "{http://www.mediawiki.org/xml/export-0.8/}revision" is 20 in the dump. Wikipedia statistics show more than 900 edits with very small portion made after 07/2014.

Best,
Behzad

On Feb 24, 2015, at 10:07 PM, Aaron Halfaker <[hidden email]> wrote:

Behzad,

The XML dumps should be complete and reflect the full history of pages in Wikipedia.  Could you give an example of a page from the XML dump that doesn't have the full set of revisions?

-Aaron

On Tue, Feb 24, 2015 at 12:49 PM, Jeremy Baron <[hidden email]> wrote:

On Feb 24, 2015 1:44 PM, "Behzad Tabibian" <[hidden email]> wrote:
> I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

This may be a decent place to ask (actually I don't read this list too much so just guessing) but probably more relevant at [hidden email] . FYI

-Jeremy


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Working with edit history dump

Behzad Tabibian
In reply to this post by Jeremy Baron
Hi Jeremy,

Thanks for the reply. I will look into that list.

Best,
Behzad

On Feb 24, 2015, at 7:49 PM, Jeremy Baron <[hidden email]> wrote:

On Feb 24, 2015 1:44 PM, "Behzad Tabibian" <[hidden email]> wrote:
> I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

This may be a decent place to ask (actually I don't read this list too much so just guessing) but probably more relevant at [hidden email] . FYI

-Jeremy

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Working with edit history dump

Aaron Halfaker-3
Behzad, 

I just ran a quick script to check the count and it comes out as expected.  Here's the call. 

$ bzcat <snip>/enwiki-20140707-pages-meta-history5.xml-p000183366p000184999.bz2 | python count_revs.py 
183366 Dude Ranch (album) 854
183369 Scrambling 149
183370 Home cinema 821
183371 Hamilton Hume 184
183372 Gertrude Gadwall 13
183373 Critical chain project management 286
183375 Luke The Goose 6
183376 George Robertson, Baron Robertson of Port Ellen 350
183378 Lord Robertson 9
183379 List of rivers of Nova Scotia 120
183380 Scottish whisky 1
183381 Louis XV 1
183382 Talk:List of artists who died of drug-related causes 16
^C

Here's what count_revs.py looks like:

"""
Counts the revisions per page in an XML dump and prints a nice format
"""
import sys
from mw import xml_dump

dump = xml_dump.Iterator.from_file(sys.stdin)

for page in dump:
revisions = sum(1 for revision in page)
print(page.id, page.title, revisions)

On Tue, Feb 24, 2015 at 3:41 PM, Behzad Tabibian <[hidden email]> wrote:
Hi Jeremy,

Thanks for the reply. I will look into that list.

Best,
Behzad

On Feb 24, 2015, at 7:49 PM, Jeremy Baron <[hidden email]> wrote:

On Feb 24, 2015 1:44 PM, "Behzad Tabibian" <[hidden email]> wrote:
> I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

This may be a decent place to ask (actually I don't read this list too much so just guessing) but probably more relevant at [hidden email] . FYI

-Jeremy

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Working with edit history dump

Behzad Tabibian
Thank you so much for looking into this. 
I am going to use mw package and see why my xml decoding is not producing this result.

Best,
Behzad
On Feb 24, 2015, at 11:36 PM, Aaron Halfaker <[hidden email]> wrote:

Behzad, 

I just ran a quick script to check the count and it comes out as expected.  Here's the call. 

$ bzcat <snip>/enwiki-20140707-pages-meta-history5.xml-p000183366p000184999.bz2 | python count_revs.py 
183366 Dude Ranch (album) 854
183369 Scrambling 149
183370 Home cinema 821
183371 Hamilton Hume 184
183372 Gertrude Gadwall 13
183373 Critical chain project management 286
183375 Luke The Goose 6
183376 George Robertson, Baron Robertson of Port Ellen 350
183378 Lord Robertson 9
183379 List of rivers of Nova Scotia 120
183380 Scottish whisky 1
183381 Louis XV 1
183382 Talk:List of artists who died of drug-related causes 16
^C

Here's what count_revs.py looks like:

"""
Counts the revisions per page in an XML dump and prints a nice format
"""
import sys
from mw import xml_dump

dump = xml_dump.Iterator.from_file(sys.stdin)

for page in dump:
revisions = sum(1 for revision in page)
print(page.id, page.title, revisions)

On Tue, Feb 24, 2015 at 3:41 PM, Behzad Tabibian <[hidden email]> wrote:
Hi Jeremy,

Thanks for the reply. I will look into that list.

Best,
Behzad

On Feb 24, 2015, at 7:49 PM, Jeremy Baron <[hidden email]> wrote:

On Feb 24, 2015 1:44 PM, "Behzad Tabibian" <[hidden email]> wrote:
> I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

This may be a decent place to ask (actually I don't read this list too much so just guessing) but probably more relevant at [hidden email] . FYI

-Jeremy

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Working with edit history dump

Aaron Halfaker-3

pip install mediawiki-utilities  :) 

python 3.x only :/  

On Tue, Feb 24, 2015 at 4:40 PM, Behzad Tabibian <[hidden email]> wrote:
Thank you so much for looking into this. 
I am going to use mw package and see why my xml decoding is not producing this result.

Best,
Behzad

On Feb 24, 2015, at 11:36 PM, Aaron Halfaker <[hidden email]> wrote:

Behzad, 

I just ran a quick script to check the count and it comes out as expected.  Here's the call. 

$ bzcat <snip>/enwiki-20140707-pages-meta-history5.xml-p000183366p000184999.bz2 | python count_revs.py 
183366 Dude Ranch (album) 854
183369 Scrambling 149
183370 Home cinema 821
183371 Hamilton Hume 184
183372 Gertrude Gadwall 13
183373 Critical chain project management 286
183375 Luke The Goose 6
183376 George Robertson, Baron Robertson of Port Ellen 350
183378 Lord Robertson 9
183379 List of rivers of Nova Scotia 120
183380 Scottish whisky 1
183381 Louis XV 1
183382 Talk:List of artists who died of drug-related causes 16
^C

Here's what count_revs.py looks like:

"""
Counts the revisions per page in an XML dump and prints a nice format
"""
import sys
from mw import xml_dump

dump = xml_dump.Iterator.from_file(sys.stdin)

for page in dump:
revisions = sum(1 for revision in page)
print(page.id, page.title, revisions)

On Tue, Feb 24, 2015 at 3:41 PM, Behzad Tabibian <[hidden email]> wrote:
Hi Jeremy,

Thanks for the reply. I will look into that list.

Best,
Behzad

On Feb 24, 2015, at 7:49 PM, Jeremy Baron <[hidden email]> wrote:

On Feb 24, 2015 1:44 PM, "Behzad Tabibian" <[hidden email]> wrote:
> I am new to working with Wikipedia dumps. I am trying to obtain full revision history of all the articles on Wikipedia. I downloaded enwiki-20140707-pages-meta-history1.xml-*.7z from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at the xml files revision history of individual articles do not match with revision history one may see from history page on Wikipedia website. It seems the dump contains significantly smaller number of revisions than what can be found on Wikipedia.

This may be a decent place to ask (actually I don't read this list too much so just guessing) but probably more relevant at [hidden email] . FYI

-Jeremy

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l