Access to older wikipedia dumps

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Access to older wikipedia dumps

Delip Rao
Hi,

How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.

Thanks,
Delip 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Cormac Lawler
2009/11/20 Delip Rao <[hidden email]>
Hi,

How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.

I had thought all dumps were archived at download.wikimedia.org - but clearly not. Internet Archive have some <http://www.archive.org/search.php?query=enwiki>, but not the specific one you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas) keep historical records? But this would be quite an oversight if there are no backups on Wikimedia's servers.

Cormac

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Gregory Maxwell
On Fri, Nov 20, 2009 at 9:04 AM, Cormac Lawler <[hidden email]> wrote:
> I had thought all dumps were archived at download.wikimedia.org - but
> clearly not. Internet Archive have some
> <http://www.archive.org/search.php?query=enwiki>, but not the specific one
> you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas) keep
> historical records? But this would be quite an oversight if there are no
> backups on Wikimedia's servers.

The closest I appear to have is enwiki-20060702-pages-meta-history.xml.7z

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Gerard Meijssen-3
In reply to this post by Cormac Lawler
Hoi,
At the time it was published that the WMF would not retain old backups. This had to do with the cost of storage at the time and a lack of perceived value of these backups.
Thanks,
     GerardM

2009/11/20 Cormac Lawler <[hidden email]>
2009/11/20 Delip Rao <[hidden email]>

Hi,

How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.

I had thought all dumps were archived at download.wikimedia.org - but clearly not. Internet Archive have some <http://www.archive.org/search.php?query=enwiki>, but not the specific one you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas) keep historical records? But this would be quite an oversight if there are no backups on Wikimedia's servers.

Cormac

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Denny Vrandecic-3
In reply to this post by Delip Rao
The newer dump should include almost all material from the older dumps, so the older dumps are redundant.

You can just get the fresh dumps and query appropriately.

Hth, denny



On Nov 20, 2009, at 6:43, Delip Rao wrote:

> Hi,
>
> How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.
>
> Thanks,
> Delip
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Delip Rao
In reply to this post by Gregory Maxwell


On Fri, Nov 20, 2009 at 9:09 AM, Gregory Maxwell <[hidden email]> wrote:
On Fri, Nov 20, 2009 at 9:04 AM, Cormac Lawler <[hidden email]> wrote:
> I had thought all dumps were archived at download.wikimedia.org - but
> clearly not. Internet Archive have some
> <http://www.archive.org/search.php?query=enwiki>, but not the specific one
> you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas) keep
> historical records? But this would be quite an oversight if there are no
> backups on Wikimedia's servers.

The closest I appear to have is enwiki-20060702-pages-meta-history.xml.7z

Thanks everyone for the replies. I was looking for the 9/11/2006 dump to reproduce a previous research result. But I think 20060702 should be close enough. 

Gregory, Do you have a URL where I can access it?

Best,
Delip
 
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Anthony-73
In reply to this post by Denny Vrandecic-3
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic
<[hidden email]> wrote:
> The newer dump should include almost all material from the older dumps, so the older dumps are redundant.

Almost redundant :).

> You can just get the fresh dumps and query appropriately.

Except for the one that you can't get.

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Denny Vrandecic-3

On Nov 20, 2009, at 16:38, Anthony wrote:

> On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic
> <[hidden email]> wrote:
>> The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
>
> Almost redundant :).
>

Correct -- there is a small amount of data that is *really* deleted, but my gut feeling is that this is less than 0.1% of all revisions. This would need some evaluation, though.

Or do you mean something else?

>> You can just get the fresh dumps and query appropriately.
>
> Except for the one that you can't get.
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Anthony-73
On Fri, Nov 20, 2009 at 10:42 AM, Denny Vrandecic
<[hidden email]> wrote:

>
> On Nov 20, 2009, at 16:38, Anthony wrote:
>
>> On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic
>> <[hidden email]> wrote:
>>> The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
>>
>> Almost redundant :).
>>
>
> Correct -- there is a small amount of data that is *really* deleted, but my gut feeling is that this is less than 0.1% of all revisions. This would need some evaluation, though.
>
> Or do you mean something else?

No, that's what I mean, though I'm not sure if it's less than 0.1% (I
don't have any guess at all on the percentage).  When an article is
"deleted" (set as deleted by an admin, which isn't even *really*
deleted), all revisions are removed from the public portion of the
database, which is where the dump comes from.  Then, making up a much
much smaller portion of the material that isn't there, there are
oversighted revisions and individually deleted revisions.

I believe page moves (after a certain date?) are recorded in the logs.
 They wouldn't be in the history dump itself, but they could
potentially be backed into by reading the logs.

The main thing that would be missing, and that can't be reconstructed
from the newer dumps, would be deleted articles.  0.1%, weighted by
number of revisions?  I have absolutely no idea.  I think the number
of deleted revisions is available to the public (through a toolserver
app) though, so we could probably calculate it.

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Anthony-73
On Fri, Nov 20, 2009 at 10:57 AM, Anthony <[hidden email]> wrote:
> The main thing that would be missing, and that can't be reconstructed
> from the newer dumps, would be deleted articles.  0.1%, weighted by
> number of revisions?  I have absolutely no idea.

By the way, depending on what you're using the data for, this may or
may not be significant.  For instance, if you're measuring vandalism,
even a small percentage of missing data might be significant, because
there is likely to be a high correlation to articles which are deleted
and articles which were vandalized.

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Bugzilla from jcsahnwaldt@gmail.com
In reply to this post by Anthony-73
On Fri, Nov 20, 2009 at 16:38, Anthony <[hidden email]> wrote:
> On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic <[hidden email]> wrote:
>> The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
>
> Almost redundant :).
>
>> You can just get the fresh dumps and query appropriately.
>
> Except for the one that you can't get.

I think the main problem is that for enwiki, only the current
page text is included in the dump, not the older revisions.

pages-meta-history.xml is supposed to contain the old
revisions, but for enwiki, it can't be downloaded anymore.
I believe it simply got too big. For example, the current enwiki
dump progress page [1] displays "ETA 2010-02-12 17:21:11"
for pages-meta-history.xml.bz2, and the pages for completed
dumps, e.g. [2], don't include pages-meta-history.xml at all.

For the smaller wikis, e.g. dewiki [3], pages-meta-history.xml
is still available.

Christopher

[1] http://download.wikimedia.org/enwiki/20091103/
[2] http://download.wikimedia.org/enwiki/20091026/
[3] http://download.wikimedia.org/dewiki/20091110/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Access to older wikipedia dumps

Felipe Ortega


--- El vie, 20/11/09, Jona Christopher Sahnwaldt <[hidden email]> escribió:

> De: Jona Christopher Sahnwaldt <[hidden email]>
>
> pages-meta-history.xml is supposed to contain the old
> revisions, but for enwiki, it can't be downloaded anymore.

As far as I know, this is not a precise statement. It's not that they can't be downloaded *anymore*, but they can't be retrieved *yet*.

WMF tech staff has been working on this issue over the past months, and they've promised us several times that, as soon as they find out an appropriate solution for the complexity of this task, complete dumps for enwiki will be available again for all of us researchers eagerly waiting to put our hands on them :-).

Best,
Felipe.

>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


     

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l