Distributed bzip2

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Distributed bzip2

Brion Vibber
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Between other things I've been working on a distributed bzip2 compression tool
which could help speed up generation of data dumps.

By trading LAN bandwidth for idle CPU elsewhere in the server cluster, an
order-of-magnitude improvement in throughput seems reasonably practical; this
could cut bzip2 compression time for the large English Wikipedia history dumps
by a full day.

Status/documentation:
http://www.mediawiki.org/wiki/dbzip2

Source:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/dbzip2

Updates on my (*blush*) development blog:
http://leuksman.com/


I'm hoping something similar can be accomplished with 7zip as well...

- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEfVkSwRnhpk1wk44RAv/wAJ9JJghnGX6wmPvVz8lX6WLa+sfeZwCg1wex
ZcadcVpBCsZP866C4gdG7/M=
=SpCq
-----END PGP SIGNATURE-----
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Distributed bzip2

Evan Martin-2
On 5/31/06, Brion Vibber <[hidden email]> wrote:
> Between other things I've been working on a distributed bzip2 compression tool
> which could help speed up generation of data dumps.

Alternatively, have you considered generating deltas?  (Sorry if this
has been brought up before...)

It seems to me there are two main consumption cases of the wikipedia data:
 - one-off copies ("most recent" doesn't really matter)
 - mirrors (will want to continually update)
If you did a full snapshot once a month, and then daily/weekly deltas
on top of that, you could maybe save yourself both processing time and
external bandwidth.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Distributed bzip2

Brion Vibber
Evan Martin wrote:
> On 5/31/06, Brion Vibber <[hidden email]> wrote:
>> Between other things I've been working on a distributed bzip2 compression tool
>> which could help speed up generation of data dumps.
>
> Alternatively, have you considered generating deltas?  (Sorry if this
> has been brought up before...)

Many times, but that's not necessarily clear or simple. The generic
delta-generation tools we've tried in the past just choke on our files; note
that the full-history dump of English Wikipedia -- the one we're most concerned
about having archival copies of available -- is over 350 gigabytes uncompressed.

(Clean XML-wrapped text with no scary internal compression or diffing, and a
well-known standard compression format on the outside, is a simple and
relatively future-proof for third-party textual analysis and reuse and long-term
archiving.)

Something application-specific might be possible.

> It seems to me there are two main consumption cases of the wikipedia data:
>  - one-off copies ("most recent" doesn't really matter)
>  - mirrors (will want to continually update)
> If you did a full snapshot once a month, and then daily/weekly deltas
> on top of that, you could maybe save yourself both processing time and
> external bandwidth.

Even if I only did full snapshots a quarter as often, I'd still want them to
take two days instead of ten. :)

-- brion vibber (brion @ pobox.com)


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Distributed bzip2

Jay Ashworth-2
On Wed, May 31, 2006 at 10:08:12PM -0700, Brion Vibber wrote:
> Even if I only did full snapshots a quarter as often, I'd still want them to
> take two days instead of ten. :)

Yeah, I was a little squeezy when I heard you were going to *shorten*
them by a day; that's like hearing they're going to give you $5000 off
the price of the car -- what does the car *cost*??

Cheers,
-- jra
--
Jay R. Ashworth                                                [hidden email]
Designer                          Baylink                             RFC 2100
Ashworth & Associates        The Things I Think                        '87 e24
St Petersburg FL USA      http://baylink.pitas.com             +1 727 647 1274

     A: Because it messes up the order in which people normally read text.
     Q: Why is top-posting such a bad thing?
     
     A: Top-posting.
     Q: What is the most annoying thing on Usenet and in e-mail?
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Distributed bzip2

mdd4696
In reply to this post by Brion Vibber
> Many times, but that's not necessarily clear or simple. The generic
> delta-generation tools we've tried in the past just choke on our files; note
> that the full-history dump of English Wikipedia -- the one we're most concerned
> about having archival copies of available -- is over 350 gigabytes uncompressed.

Actually, I just downloaded enwiki-20060518-pages-meta-history.xml.7z
and when I did a "7za l enwiki-20060518-pages-meta-history.xml.7z" it
said that the archive would be 692686106434 bytes (645 GB)
uncompressed. Is that inaccurate?

~MDD4696
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Distributed bzip2

Brion Vibber
[hidden email] wrote:
> Actually, I just downloaded enwiki-20060518-pages-meta-history.xml.7z
> and when I did a "7za l enwiki-20060518-pages-meta-history.xml.7z" it
> said that the archive would be 692686106434 bytes (645 GB)
> uncompressed. Is that inaccurate?

Hopefully. :)

-- brion vibber (brion @ pobox.com)


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

signature.asc (257 bytes) Download Attachment