Re: [Xmldatadumps-l] Suggested file format of new incremental dumps

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] Suggested file format of new incremental dumps

Neil Harris
On 01/07/13 23:21, Nicolas Torzec wrote:

> Hi there,
>
> In principle, I understand the need for binary formats and compression in a context with limited resources.
> On the other hand, plain text formats are easy to work with, especially for third-party users and organizations.
>
> Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly.
>
> Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why?
> Therefore, it is not easy to provide input and help.
>
>
> Cheers.
> - Nicolas Torzec.
>

+1

The simplest possible dump format is the best, and there's already a
thriving ecosystem around the current XML dumps, which would be broken
by moving to a binary format. Binary file formats and APIs defined by
code are not the way to go if you want long-term archival that can
endure through decades of technological change.

If more money is needed for dump processing, it should be budgeted for
and added to the IT budget, instead of over-optimizing by using a
potentially fragile, and therefore risky, binary format.

Archival in a stable format is not a luxury or an optional extra; it's a
core part of the Foundation's mission. The value is in the data, which
is priceless. Computers and storage are (relatively) cheap by
comparison, and Wikipedia is growing significantly more slowly than the
year-on-year improvements in storage, processing and communication
links.  Moreover, re-making the dumps every time provides defence in
depth against subtle database corruption that might slowly corrupt a
database dump.

Please keep the dumps themselves simple and their format stable, and, as
Nicolas says, do the clever stuff elsewhere, in which you can use
whatever efficient representation you like to do the processing.

Neil


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] Suggested file format of new incremental dumps

Ariel Glenn WMF
Στις 02-07-2013, ημέρα Τρι, και ώρα 11:47 +0100, ο/η Neil Harris έγραψε:


> The simplest possible dump format is the best, and there's already a
> thriving ecosystem around the current XML dumps, which would be broken
> by moving to a binary format. Binary file formats and APIs defined by
> code are not the way to go if you want long-term archival that can
> endure through decades of technological change.
>
> If more money is needed for dump processing, it should be budgeted for
> and added to the IT budget, instead of over-optimizing by using a
> potentially fragile, and therefore risky, binary format.
>
> Archival in a stable format is not a luxury or an optional extra; it's a
> core part of the Foundation's mission. The value is in the data, which
> is priceless. Computers and storage are (relatively) cheap by
> comparison, and Wikipedia is growing significantly more slowly than the
> year-on-year improvements in storage, processing and communication
> links.  Moreover, re-making the dumps every time provides defence in
> depth against subtle database corruption that might slowly corrupt a
> database dump.

A point of information: we already do not produce dumps every time from
scratch; we re-use old revisions because if we did not it would take
months and months to generate the en wikipedia dumps, something which is
clearly untenable.

The question now is how we are going to use those old revisions.  Right
now we uncompress the entire previous dump, write new information where
needed, and recompress it all (which would take several weeks for en
wikipedia history dumps if we didn't run 27 jobs at once).

What I hope for is a format that allows dumps to be produced much more
rapidly, where the time to produce the incrementals grows only as the
number of edits per time frame grows, and where the time to produce new
fulls via the incrementals is bounded in a much better fashion than we
have now.

And I expect that we would have a library or scripts that provide for
conversion of a new-format dump to the good old XML, so that all the
tools folks use now will continue to work.

Ariel
 
>
> Please keep the dumps themselves simple and their format stable, and, as
> Nicolas says, do the clever stuff elsewhere, in which you can use
> whatever efficient representation you like to do the processing.
>
> Neil




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] Suggested file format of new incremental dumps

Ariel Glenn WMF
Στις 07-07-2013, ημέρα Κυρ, και ώρα 21:09 -0700, ο/η Randall Farmer
έγραψε:

> Sorry, reading back over this thread late.
>
>
> > What I hope for is a format that allows dumps to be produced much
> more
> > rapidly, where the time to produce the incrementals grows only as
> the
> > number of edits per time frame grows
>
>
> Curious: what's happening currently that makes the time to produce
> incrementals grow more quickly than that?
>

We don't produce true incrementals now; we produce 'adds/changes' dumps
which don't acocunt for deletions, oversights, page moves, etc.  And you
can't add them onto a full to get a new full.  When they are produced, I
want them to behave as described above.

Ariel




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l