Question about 2-phase dump

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about 2-phase dump

vitalif
Hello!

While working on my improvements to MediaWiki Import&Export, I've
discovered a feature that is totally new for me: 2-phase backup dump.
I.e. the first pass dumper creates XML file without page texts, and the
second pass dumper adds page texts.

I have several questions about it - what it is intended for? Is it a
sort of optimisation for large databases and why such method of
optimisation was chosen?

Also, does anyone use it? (does Wikimedia use it?)


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Question about 2-phase dump

Mark
On 11/21/12 1:54 PM, [hidden email] wrote:

> While working on my improvements to MediaWiki Import&Export, I've
> discovered a feature that is totally new for me: 2-phase backup dump.
> I.e. the first pass dumper creates XML file without page texts, and
> the second pass dumper adds page texts.
>
> I have several questions about it - what it is intended for? Is it a
> sort of optimisation for large databases and why such method of
> optimisation was chosen?
>
> Also, does anyone use it? (does Wikimedia use it?)

I'm not sure if this is the reason it was created, but one useful
outcome is that Wikimedia can make the output of both passes available
at dumps.wikimedia.org. This can be useful for researchers (myself
included), because the metadata-only (pass 1) dump is sufficient for
doing some kinds of analyses, while being *much* smaller than the full dump.

-Mark


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Question about 2-phase dump

Brion Vibber
In reply to this post by vitalif
On Wed, Nov 21, 2012 at 4:54 AM, <[hidden email]> wrote:

> Hello!
>
> While working on my improvements to MediaWiki Import&Export, I've
> discovered a feature that is totally new for me: 2-phase backup dump. I.e.
> the first pass dumper creates XML file without page texts, and the second
> pass dumper adds page texts.
>
> I have several questions about it - what it is intended for? Is it a sort
> of optimisation for large databases and why such method of optimisation was
> chosen?
>

While generating a full dump, we're holding the database connection
open.... for a long, long time. Hours, days, or weeks in the case of
English Wikipedia.

There's two issues with this:
* the DB server needs to maintain a consistent snapshot of data since when
we started the connection, so it's doing extra work to keep old data around
* the DB connection needs to actually remain open; if the DB goes down or
the dump process crashes, whoops! you just lost all your work.

So, grabbing just the page and revision metadata lets us generate a file
with a consistent snapshot as quickly as possible. We get to let the
databases go, and the second pass can die and restart as many times as it
needs while fetching actual text, which is immutable (thus no worries about
consistency in the second pass).

We definitely use this system for Wikimedia's data dumps!

-- brion
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Question about 2-phase dump

vitalif
Brion Vibber wrote 2012-11-21 23:20:

> While generating a full dump, we're holding the database connection
> open.... for a long, long time. Hours, days, or weeks in the case of
> English Wikipedia.
>
> There's two issues with this:
> * the DB server needs to maintain a consistent snapshot of data since
> when
> we started the connection, so it's doing extra work to keep old data
> around
> * the DB connection needs to actually remain open; if the DB goes
> down or
> the dump process crashes, whoops! you just lost all your work.
>
> So, grabbing just the page and revision metadata lets us generate a
> file
> with a consistent snapshot as quickly as possible. We get to let the
> databases go, and the second pass can die and restart as many times
> as it
> needs while fetching actual text, which is immutable (thus no worries
> about
> consistency in the second pass).
>
> We definitely use this system for Wikimedia's data dumps!

Oh, thanks, now I understand!
But the revisions are also immutable - isn't it simpler just to select
maximum revision ID in the beginning of dump and just discard newer page
and image revisions during dump generation?

Also, I have the same question about 'spawn' feature of
backupTextPass.inc :) what is it intended for? :)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Question about 2-phase dump

Brion Vibber
On Wed, Nov 21, 2012 at 12:31 PM, <[hidden email]> wrote:

> Oh, thanks, now I understand!
> But the revisions are also immutable - isn't it simpler just to select
> maximum revision ID in the beginning of dump and just discard newer page
> and image revisions during dump generation?
>

Page history structure isn't quite immutable; revisions may be added or
deleted, pages may be renamed, etc etc.



> Also, I have the same question about 'spawn' feature of backupTextPass.inc
> :) what is it intended for? :)


Shelling out to an external process means when that process dies due to a
dead database connection etc, we can restart it cleanly.

-- brion
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Question about 2-phase dump

Platonides
In reply to this post by vitalif
You may also be interested in the xmldatadumps mailing list.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Question about 2-phase dump

vitalif
In reply to this post by Brion Vibber
> Page history structure isn't quite immutable; revisions may be added
> or
> deleted, pages may be renamed, etc etc.
>
> Shelling out to an external process means when that process dies due
> to a
> dead database connection etc, we can restart it cleanly.

Brion, thanks for clarifying it.

Also, I want to ask you and other developers about the idea of packing
export XML file along with all exported uploads to ZIP archive (instead
of putting them to XML in base64) - what do you think about it? We use
it in our Mediawiki installations ("mediawiki4intranet") and find it
quite convenient. Actually, ZIP was the idea of Tim Starling, before ZIP
we used very strange "multipart/related" archives (I don't know why we
did it :))

I want to try to get this change reviewed at last... What do you think
about it?

Other improvements include advanced page selection (based on
namespaces, categories, dates, imagelinks, templatelinks and pagelinks)
and an advanced import report (including some sort of "conflict
detection"). I should probably need to split them to separate patches in
Gerrit for the ease of review?

Also, do all the archiving methods (7z) really need to be built in the
Export.php as dump filters? (especially when using ZIP?) I.e. with
simple XML dumps you could just pipe the output to the compressor.

Or are they really needed to save the temporary disk space during
export? I ask because my version of import/export does not build the
archive "on-the-fly" - it puts all the contents to a temporary directory
and then archives it fully. Is it an acceptable method?

--
With best regards,
   Vitaliy Filippov


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Question about 2-phase dump

Platonides
On 11/25/12 22:16, [hidden email] wrote:

> Also, I want to ask you and other developers about the idea of packing
> export XML file along with all exported uploads to ZIP archive (instead
> of putting them to XML in base64) - what do you think about it? We use
> it in our Mediawiki installations ("mediawiki4intranet") and find it
> quite convenient. Actually, ZIP was the idea of Tim Starling, before ZIP
> we used very strange "multipart/related" archives (I don't know why we
> did it :))
>
> I want to try to get this change reviewed at last... What do you think
> about it?

Looks a better solution than base64 files. :)


> Other improvements include advanced page selection (based on namespaces,
> categories, dates, imagelinks, templatelinks and pagelinks) and an
> advanced import report (including some sort of "conflict detection"). I
> should probably need to split them to separate patches in Gerrit for the
> ease of review?

I don't see a need to split eg. templatelinks selection and pagelinks
selection. But if you provide a 64K patch, you may have a hard time
getting people to review it :)
I would probably generate a couple of patches, one with the selection
parameters and the other with the advanced report.
Depending on how big are those changes, YMMV.



> Also, do all the archiving methods (7z) really need to be built in the
> Export.php as dump filters? (especially when using ZIP?) I.e. with
> simple XML dumps you could just pipe the output to the compressor.
>
> Or are they really needed to save the temporary disk space during
> export? I ask because my version of import/export does not build the
> archive "on-the-fly" - it puts all the contents to a temporary directory
> and then archives it fully. Is it an acceptable method?

Probably not the best method, but a suboptimal implementation that works
is better than no implementation at all. So go ahead and submit it. We
can then be picky later in front of the code :)

Regards




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l