XML dumps

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

XML dumps

Lars Aronsson
The XML database dumps are missing all through May, apparently
because of a memory leak that is being worked on, as described
here,
https://phabricator.wikimedia.org/T98585

However, that information doesn't reach the person who wants to
download a fresh dump and looks here,
http://dumps.wikimedia.org/backup-index.html

I think it should be possible to make a regular schedule for
when these dumps should be produced, e.g. once each month or
once every second month, and treat any delay as a bug. The
process to produce them has been halted by errors many times
in the past, and even when it runs as intended the interval
is unpredictable. Now when there is a bug, all dumps are
halted, i.e. much delayed. For a user of the dumps, this is
extremely frustrating. With proper release management, it
should be possible to run the old version of the process
until the new version has been tested, first on some smaller
wikis, and gradually on the larger ones.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: XML dumps

Gerard Meijssen-3
hear hear
Gerard

On 29 May 2015 at 01:52, Lars Aronsson <[hidden email]> wrote:

> The XML database dumps are missing all through May, apparently
> because of a memory leak that is being worked on, as described
> here,
> https://phabricator.wikimedia.org/T98585
>
> However, that information doesn't reach the person who wants to
> download a fresh dump and looks here,
> http://dumps.wikimedia.org/backup-index.html
>
> I think it should be possible to make a regular schedule for
> when these dumps should be produced, e.g. once each month or
> once every second month, and treat any delay as a bug. The
> process to produce them has been halted by errors many times
> in the past, and even when it runs as intended the interval
> is unpredictable. Now when there is a bug, all dumps are
> halted, i.e. much delayed. For a user of the dumps, this is
> extremely frustrating. With proper release management, it
> should be possible to run the old version of the process
> until the new version has been tested, first on some smaller
> wikis, and gradually on the larger ones.
>
>
> --
>   Lars Aronsson ([hidden email])
>   Aronsson Datateknik - http://aronsson.se
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: XML dumps

Matthew Flaschen-2
In reply to this post by Lars Aronsson


On 05/28/2015 07:52 PM, Lars Aronsson wrote:
 > With proper release management, it
> should be possible to run the old version of the process
> until the new version has been tested, first on some smaller
> wikis, and gradually on the larger ones.

I understand your frustration; however release management was not the
issue in this case.  According to Ariel Glenn on the task
(https://phabricator.wikimedia.org/T98585#1284441), "It's not a new
leak, it's just that the largest single stubs file in our dumps runs is
now produced by wikidata!".

I.E. it was caused by changes to the input data (i.e. our projects), not
by changes to the code.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Wikipedia dumps

Neil Harris
Hello! I've noticed that no enwiki dump seems to have been generated so
far this month. Is this by design, or has there been some sort of dump
failure? Does anyone know when the next enwiki dump might happen?

Kind regards.

Neil Harris


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia dumps

Bernardo Sulzbach
On Sun, Jan 10, 2016 at 9:55 PM, Neil Harris <[hidden email]> wrote:
> Hello! I've noticed that no enwiki dump seems to have been generated so far
> this month. Is this by design, or has there been some sort of dump failure?
> Does anyone know when the next enwiki dump might happen?
>

I would also be interested.

--
Bernardo Sulzbach

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia dumps

Tilman Bayer
On Sun, Jan 10, 2016 at 4:05 PM, Bernardo Sulzbach <
[hidden email]> wrote:

> On Sun, Jan 10, 2016 at 9:55 PM, Neil Harris <[hidden email]>
> wrote:
> > Hello! I've noticed that no enwiki dump seems to have been generated so
> far
> > this month. Is this by design, or has there been some sort of dump
> failure?
> > Does anyone know when the next enwiki dump might happen?
> >
>
> I would also be interested.
>
> --
> Bernardo Sulzbach
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

CCing the Xmldatadumps mailing list
<https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>, where
someone has already posted
<https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-January/001214.html>
about
what might be the same issue.

--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia dumps

Bernardo Sulzbach
On Mon, Jan 11, 2016 at 3:22 AM, Tilman Bayer <[hidden email]> wrote:
> CCing the Xmldatadumps mailing list
> <https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>, where
> someone has already posted
> <https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-January/001214.html>
> about
> what might be the same issue.

For some reason, I did not subscribe to that list. Thanks for pointing it out.

--
Bernardo Sulzbach

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia dumps

Ariel Glenn WMF
That would be me; I need to push some changes through for this month but I
was either travelling or dev summit/allstaff.  I'm pretty jetlagged but
I'll likely be doing that tonight, given I woke up at 5 pm :-D

A.

On Mon, Jan 11, 2016 at 4:20 PM, Bernardo Sulzbach <
[hidden email]> wrote:

> On Mon, Jan 11, 2016 at 3:22 AM, Tilman Bayer <[hidden email]>
> wrote:
> > CCing the Xmldatadumps mailing list
> > <https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>, where
> > someone has already posted
> > <
> https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-January/001214.html
> >
> > about
> > what might be the same issue.
>
> For some reason, I did not subscribe to that list. Thanks for pointing it
> out.
>
> --
> Bernardo Sulzbach
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] Wikipedia dumps

gnosygnu
In reply to this post by Tilman Bayer
Basically, the xml dumps have 2 IDs: page_id and revision_id.

The page_id points to the article. In this case, 14640471 is the page_id
for Mars (https://en.wikipedia.org/wiki/Mars)

The revision_id points to the latest revision for the article. For Mars,
the latest revision_id is 699008434 which was generated on 2016-01-09 (
https://en.wikipedia.org/w/index.php?title=Mars&oldid=699008434). Note that
a revision_id is generated every time a page is edited.

So, to answer your question, the IDs never change. 14640471 will always
point to Mars, while 699008434 points to the 2016-01-09 revision for Mars.

That said, different dumps will have different revision_ids, because an
article may be updated. If Mars gets updated tomorrow, and the English
Wikipedia dump is generated afterwards, then that dump will list Mars with
a new revision_id (something higher than 6999008434). However, that dump
will still show Mars with a page_id of 1460471. You're probably better off
using the page_id.

Finally, you can see also reference the Wikimedia API to get a similar view
to the dump: For example:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Mars&rvprop=content|ids

Hope this helps.


On Mon, Jan 11, 2016 at 5:09 AM, Luigi Assom <[hidden email]> wrote:

> yep, same here!
>
> Also another question about consistency of _IDs in time.
> I was working with an old version of wikipedia dump, and testing some
> data models I built on the dumpusing as pivot a few topics.
> I might have data corrupted on my side, but just to be sure:
> are _IDs of article *persistent* over time, or are they subjected to
> change?
>
> Might happen that due any fallback or merge in an article history, ID
> would change?
> E.g. as test article "Mars" would first point to a version _ID ="4285430"
> and then changed to "14640471"
>
> I need to ensure _IDs will persist.
> thank you!
>
>
> *P.S. sorry for cross posting - I've replied from wrong email - could you
> please delete the other message and keep only this email address? thank
> you! *
>
> On Mon, Jan 11, 2016 at 11:05 AM, XDiscovery Team <[hidden email]>
> wrote:
>
>> yep, same here!
>>
>> Also another question about consistency of _IDs in time.
>> I was working with an old version of wikipedia dump, and testing some
>> data models I built on the dump using as pivot a few topics.
>> I might have data corrupted on my side, but just to be sure:
>> are _IDs of article *persistent* over time, or are they subjected to
>> change?
>>
>> Might happen that due any fallback or merge in an article history, ID
>> would change?
>> E.g. as test article "Mars" would first point to a version _ID ="4285430"
>> and then changed to "14640471"
>>
>> I need to ensure _IDs will persist.
>> thank you!
>>
>>
>> On Mon, Jan 11, 2016 at 6:22 AM, Tilman Bayer <[hidden email]>
>> wrote:
>>
>>> On Sun, Jan 10, 2016 at 4:05 PM, Bernardo Sulzbach <
>>> [hidden email]> wrote:
>>>
>>>> On Sun, Jan 10, 2016 at 9:55 PM, Neil Harris <[hidden email]>
>>>> wrote:
>>>> > Hello! I've noticed that no enwiki dump seems to have been generated
>>>> so far
>>>> > this month. Is this by design, or has there been some sort of dump
>>>> failure?
>>>> > Does anyone know when the next enwiki dump might happen?
>>>> >
>>>>
>>>> I would also be interested.
>>>>
>>>> --
>>>> Bernardo Sulzbach
>>>>
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> [hidden email]
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>
>>>
>>> CCing the Xmldatadumps mailing list
>>> <https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>, where
>>> someone has already posted
>>> <https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-January/001214.html> about
>>> what might be the same issue.
>>>
>>> --
>>> Tilman Bayer
>>> Senior Analyst
>>> Wikimedia Foundation
>>> IRC (Freenode): HaeB
>>>
>>> _______________________________________________
>>> Xmldatadumps-l mailing list
>>> [hidden email]
>>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>>
>>>
>>
>>
>> --
>> *Luigi Assom*
>> Founder & CEO @ XDiscovery - Crazy on Human Knowledge
>> *Corporate*
>> www.xdiscovery.com
>> *Mobile App for knowledge Discovery*
>> APP STORE <http://tiny.cc/LearnDiscoveryApp>  | PR
>> <http://tiny.cc/app_Mindmap_Wikipedia>  | WEB
>> <http://www.learndiscovery.com/>
>>
>> T +39 349 3033334 | +1 415 707 9684
>>
>
>
>
> --
> *Luigi Assom*
>
> T +39 349 3033334 | +1 415 707 9684
> Skype oggigigi
>
> _______________________________________________
> Xmldatadumps-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] Wikipedia dumps

Bartosz Dziewoński
On 2016-01-11 22:06, gnosygnu wrote:
> So, to answer your question, the IDs never change. 14640471 will always
> point to Mars, while 699008434 points to the 2016-01-09 revision for Mars.

While it's unlikely/rare, I think the page id can change when a page is
deleted and re-created, and maybe some other cases. MediaWiki tries to
keep it constant (for example, I think it's preserved after deletion and
undeletion), but it's not always possible. It should be fine to use to
track pages across renames, though, at least most of the time.


> That said, different dumps will have different revision_ids, because an
> article may be updated. If Mars gets updated tomorrow, and the English
> Wikipedia dump is generated afterwards, then that dump will list Mars with
> a new revision_id (something higher than 6999008434).

Please don't assume that revision id's are increasing. Weird things can
happen with import, export and page history merges :)


--
Bartosz Dziewoński

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia dumps

MZMcBride-2
In reply to this post by Neil Harris
Bartosz Dziewoński wrote:
>On 2016-01-11 22:06, gnosygnu wrote:
>> So, to answer your question, the IDs never change. 14640471 will always
>> point to Mars, while 699008434 points to the 2016-01-09 revision for
>>Mars.
>
>While it's unlikely/rare, I think the page id can change when a page is
>deleted and re-created, and maybe some other cases. MediaWiki tries to
>keep it constant (for example, I think it's preserved after deletion and
>undeletion), but it's not always possible.

It looks like <https://phabricator.wikimedia.org/T28123> was just fixed,
so page IDs changing should now hopefully be rarer.

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] Wikipedia dumps

Gergo Tisza
In reply to this post by Bartosz Dziewoński
On Mon, Jan 11, 2016 at 10:37 PM, Bartosz Dziewoński <[hidden email]>
wrote:

> On 2016-01-11 22:06, gnosygnu wrote:
>
>> So, to answer your question, the IDs never change. 14640471 will always
>> point to Mars, while 699008434 points to the 2016-01-09 revision for Mars.
>>
>
> While it's unlikely/rare, I think the page id can change when a page is
> deleted and re-created, and maybe some other cases. MediaWiki tries to keep
> it constant (for example, I think it's preserved after deletion and
> undeletion), but it's not always possible.


The patch to preserve IDs over undeletion was merged today (so don't expect
IDs to be unchanging in older dumps). Also, pages can be split and joined
through partial undeletion of revisions, in which case it is hard to tell
what staying constant even means. You can swap the ID of any two pages, for
example, without any changes in their text or history, with the right
sequence of deletions, undeletions and page moves. Also, when a page is
moved, the ID and the title are disassociated (which is probably what you'd
want in that case).
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l