Data dumps

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Data dumps

Brion Vibber
I'm doing a test run of the new data dump script on our Korean cluster;
currently jawiki (ja.wikipedia.org) is in progress:
http://amaryllis.yaseo.wikimedia.org/backup/jawiki/20060118/

Any comments on the page layout and information included in the progress page?

A couple notes:

* The file naming has been changed so they include the database name and date.
This should make it easier to figure out what the hell you just downloaded.

* The directory structure is different; the database names are used instead of
the weird mix of sites, languages, and database names which was hard to reliably
get the scripts to run. Each database has subdirectories for each day it was
dumped, plus a 'latest' subdirectory with symbolic links to the files from the
last completed dump.

* I renamed 'pages_current' and 'pages_full' to 'pages-meta-current' and
'pages-meta-history'. In addition to the big explanatory labels, this should
emphasize that these dumps contain metapages such as discussion and user pages,
distancing it from the pages-articles dump.

* I've discontinued 7-Zip compression for the current-versions dumps, since it
doesn't do better than bzip2 for those. They are still generated for the history
dump, where it compresses significantly better (about 3 vs 11 GB for enwiki)

* Upload tarballs are still not included at the moment.


The backup runner script is written in Python, and is in our CVS in the 'backup'
module should anyone feel like laughing at my code.

A few more things need to be fixed up before I start running it on the main
cluster, but it's pretty close! (A list of databases in progress, some locking,
emailing me on error, and finding the prior XML dump to speed dump generation.)

-- brion vibber (brion @ pobox.com)


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Data dumps

Emilio Gonzalez
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Brion Vibber wrote:

> I'm doing a test run of the new data dump script on our Korean cluster;
> currently jawiki (ja.wikipedia.org) is in progress:
> http://amaryllis.yaseo.wikimedia.org/backup/jawiki/20060118/
>
> Any comments on the page layout and information included in the progress page?
>
> A couple notes:
>
> * The file naming has been changed so they include the database name and date.
> This should make it easier to figure out what the hell you just downloaded.
>
> * The directory structure is different; the database names are used instead of
> the weird mix of sites, languages, and database names which was hard to reliably
> get the scripts to run. Each database has subdirectories for each day it was
> dumped, plus a 'latest' subdirectory with symbolic links to the files from the
> last completed dump.
>
> * I renamed 'pages_current' and 'pages_full' to 'pages-meta-current' and
> 'pages-meta-history'. In addition to the big explanatory labels, this should
> emphasize that these dumps contain metapages such as discussion and user pages,
> distancing it from the pages-articles dump.
>
> * I've discontinued 7-Zip compression for the current-versions dumps, since it
> doesn't do better than bzip2 for those. They are still generated for the history
> dump, where it compresses significantly better (about 3 vs 11 GB for enwiki)
>
> * Upload tarballs are still not included at the moment.
>
>
> The backup runner script is written in Python, and is in our CVS in the 'backup'
> module should anyone feel like laughing at my code.
>
> A few more things need to be fixed up before I start running it on the main
> cluster, but it's pretty close! (A list of databases in progress, some locking,
> emailing me on error, and finding the prior XML dump to speed dump generation.)
>
> -- brion vibber (brion @ pobox.com)
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Just a comment. Why not adding user_groups table in the public dump? It
can be useful for some statistical purposes and I can't see any risk in
publishing this information.

Regards,

Emilio Gonzalez
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDzn4g+A561B9ENWoRAtagAJ0UZW2SdWP/B4C9gdP8ugjVbLHCIgCggfxI
Lb0jJjQBCt8kL0V4FKlUAbo=
=/+U1
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Data dumps

Brion Vibber
Emilio Gonzalez wrote:
> Just a comment. Why not adding user_groups table in the public dump? It
> can be useful for some statistical purposes and I can't see any risk in
> publishing this information.

Ummm, because I forgot? :)

I'll switch it.

(The info contained is already available through Special:Listusers, so it's not
terribly secret.)

-- brion vibber (brion @ pobox.com)


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

signature.asc (257 bytes) Download Attachment