dataset1, xml dumps

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

dataset1, xml dumps

Ariel Glenn WMF
For folks who have not been following the saga on
http://wikitech.wikimedia.org/view/Dataset1
we were able to get the raid array back in service last night on the XML
data dumps server, and we are now busily copying data off of it to
another host.  There's about 11T of dumps to copy over; once that's done
we will start serving these dumps read-only to the public again.
Because the state of the server hardware is still uncertain, we don't
want to do anything that might put the data at risk until that copy has
been made.

The replacement server is on order and we are watching that closely.

We have also been working on deploying a server to run one round of
dumps in the interrim.

Thanks for your patience (which is a way of saying, I know you are all
out of patience, as am I, but hang on just a little longer).

Ariel



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Brion Vibber
Great news! Thanks for the update and thanks for all you guys' work getting
it beaten back into shape. Keeping fingers crossed for all going well on the
transfer...

-- brion
On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" <[hidden email]> wrote:

> For folks who have not been following the saga on
> http://wikitech.wikimedia.org/view/Dataset1
> we were able to get the raid array back in service last night on the XML
> data dumps server, and we are now busily copying data off of it to
> another host. There's about 11T of dumps to copy over; once that's done
> we will start serving these dumps read-only to the public again.
> Because the state of the server hardware is still uncertain, we don't
> want to do anything that might put the data at risk until that copy has
> been made.
>
> The replacement server is on order and we are watching that closely.
>
> We have also been working on deploying a server to run one round of
> dumps in the interrim.
>
> Thanks for your patience (which is a way of saying, I know you are all
> out of patience, as am I, but hang on just a little longer).
>
> Ariel
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Diederik van Liere
+1
Diederik

On 2010-12-14, at 12:02, Brion Vibber <[hidden email]> wrote:

> Great news! Thanks for the update and thanks for all you guys' work getting
> it beaten back into shape. Keeping fingers crossed for all going well on the
> transfer...
>
> -- brion
> On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" <[hidden email]> wrote:
>> For folks who have not been following the saga on
>> http://wikitech.wikimedia.org/view/Dataset1
>> we were able to get the raid array back in service last night on the XML
>> data dumps server, and we are now busily copying data off of it to
>> another host. There's about 11T of dumps to copy over; once that's done
>> we will start serving these dumps read-only to the public again.
>> Because the state of the server hardware is still uncertain, we don't
>> want to do anything that might put the data at risk until that copy has
>> been made.
>>
>> The replacement server is on order and we are watching that closely.
>>
>> We have also been working on deploying a server to run one round of
>> dumps in the interrim.
>>
>> Thanks for your patience (which is a way of saying, I know you are all
>> out of patience, as am I, but hang on just a little longer).
>>
>> Ariel
>>
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Emilio J. Rodríguez-Posada
In reply to this post by Ariel Glenn WMF
Thanks.

Double good news:
http://lists.wikimedia.org/pipermail/foundation-l/2010-December/063088.html

2010/12/14 Ariel T. Glenn <[hidden email]>

> For folks who have not been following the saga on
> http://wikitech.wikimedia.org/view/Dataset1
> we were able to get the raid array back in service last night on the XML
> data dumps server, and we are now busily copying data off of it to
> another host.  There's about 11T of dumps to copy over; once that's done
> we will start serving these dumps read-only to the public again.
> Because the state of the server hardware is still uncertain, we don't
> want to do anything that might put the data at risk until that copy has
> been made.
>
> The replacement server is on order and we are watching that closely.
>
> We have also been working on deploying a server to run one round of
> dumps in the interrim.
>
> Thanks for your patience (which is a way of saying, I know you are all
> out of patience, as am I, but hang on just a little longer).
>
> Ariel
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Ariel Glenn WMF
In reply to this post by Ariel Glenn WMF
We now have a copy of the dumps on a backup host.  Although we are still
resolving hardware issues on the XML dumps server, we think it is safe
enough to serve the existing dumps read-only.  DNS was updated to that
effect already; people should see the dumps within the hour.  

Ariel



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

masti-2
Good news, but looking form a professional point of view having them
just on array will be leading to such outages.
Any idea to have a tape backup or mirror?

masti

On 12/15/2010 08:57 PM, Ariel T. Glenn wrote:

> We now have a copy of the dumps on a backup host.  Although we are still
> resolving hardware issues on the XML dumps server, we think it is safe
> enough to serve the existing dumps read-only.  DNS was updated to that
> effect already; people should see the dumps within the hour.
>
> Ariel
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Ariel Glenn WMF
Currently the files have been copied off of the server onto a backup
host, which is the only reason we feel safe about serving them again.

We will be getting a new host (it is due to be shipped soon) which will
host the live data. The current server will have a backup copy.  That is
the short term answer to your question.  In the longer term we expect to
have a redundant copy elsewhere and cease to rely on dataset1
whatsoever.

We are interested in other mirrors of the dumps; see

http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

Ariel

Στις 15-12-2010, ημέρα Τετ, και ώρα 21:16 +0100, ο/η masti έγραψε:

> Good news, but looking form a professional point of view having them
> just on array will be leading to such outages.
> Any idea to have a tape backup or mirror?
>
> masti
>
> On 12/15/2010 08:57 PM, Ariel T. Glenn wrote:
> > We now have a copy of the dumps on a backup host.  Although we are still
> > resolving hardware issues on the XML dumps server, we think it is safe
> > enough to serve the existing dumps read-only.  DNS was updated to that
> > effect already; people should see the dumps within the hour.
> >
> > Ariel
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] dataset1, xml dumps

Emilio J. Rodríguez-Posada
In reply to this post by Ariel Glenn WMF
Good work.

2010/12/15 Ariel T. Glenn <[hidden email]>

> We now have a copy of the dumps on a backup host.  Although we are still
> resolving hardware issues on the XML dumps server, we think it is safe
> enough to serve the existing dumps read-only.  DNS was updated to that
> effect already; people should see the dumps within the hour.
>
> Ariel
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Anthony-73
In reply to this post by Ariel Glenn WMF
On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn <[hidden email]> wrote:
> We are interested in other mirrors of the dumps; see
>
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

On the talk page, it says "torrents are useful to save bandwidth,
which is not our problem".  If bandwidth is not the problem, then what
*is* the problem?

If the problem is just to get someone to store the data on hard
drives, then it's a much easier problem than actually *hosting* that
data.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Ariel Glenn WMF
Στις 15-12-2010, ημέρα Τετ, και ώρα 15:57 -0500, ο/η Anthony έγραψε:

> On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn <[hidden email]> wrote:
> > We are interested in other mirrors of the dumps; see
> >
> > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
>
> On the talk page, it says "torrents are useful to save bandwidth,
> which is not our problem".  If bandwidth is not the problem, then what
> *is* the problem?
>
> If the problem is just to get someone to store the data on hard
> drives, then it's a much easier problem than actually *hosting* that
> data.

We certainly want people to host it as well.  It's not a matter of
bandwidth but of protection: if someone can't get to our copy for
whatever reason, another copy is accessible.  

Ariel




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Bryan Tong Minh
On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn <[hidden email]> wrote:
>
> We certainly want people to host it as well.  It's not a matter of
> bandwidth but of protection: if someone can't get to our copy for
> whatever reason, another copy is accessible.
>
Is there a copy in Amsterdam? Seems like that would be the most
obvious choice to put a backup as WMF already has a lot of servers
there.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Ariel Glenn WMF
Στις 15-12-2010, ημέρα Τετ, και ώρα 22:50 +0100, ο/η Bryan Tong Minh
έγραψε:

> On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn <[hidden email]> wrote:
> >
> > We certainly want people to host it as well.  It's not a matter of
> > bandwidth but of protection: if someone can't get to our copy for
> > whatever reason, another copy is accessible.
> >
> Is there a copy in Amsterdam? Seems like that would be the most
> obvious choice to put a backup as WMF already has a lot of servers
> there.
>

We want people besides us to host it.  We expect to put a copy at the
new data center (at least), as well.

Ariel



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Lars Aronsson
In reply to this post by Ariel Glenn WMF
On 12/15/2010 09:30 PM, Ariel T. Glenn wrote:
> We are interested in other mirrors of the dumps; see
>
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

Just as a small-scale experiment, I tried to mirror the
Faroese (fowiki) and Sami (sewiki) language projects.
But "wget -m" says that timestamps are turned off,
so it keeps downloading the same files again. Is this
an error at my side or at the server side?

This happens for some files, but not for all.
Here is one example:

--2010-12-15 23:59:54--  
http://download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pages-meta-history.xml.bz2
Reusing existing connection to download.wikimedia.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 95974 (94K) [application/octet-stream]
Last-modified header missing -- time-stamps turned off.
--2010-12-15 23:59:54--  
http://download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pages-meta-history.xml.bz2
Reusing existing connection to download.wikimedia.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 95974 (94K) [application/octet-stream]
Saving to:
`download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pages-meta-history.xml.bz2'

100%[======================================>] 95,974       156K/s   in 0.6s



--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Xmldatadumps-l] dataset1, xml dumps

Felipe Ortega
In reply to this post by Ariel Glenn WMF
Yeah, great work Ariel. Thanks a lot for the effort.

Best,
F.

--- El mié, 15/12/10, Ariel T. Glenn <[hidden email]> escribió:

> De: Ariel T. Glenn <[hidden email]>
> Asunto: Re: [Xmldatadumps-l] dataset1, xml dumps
> Para: [hidden email]
> CC: [hidden email]
> Fecha: miércoles, 15 de diciembre, 2010 20:57
> We now have a copy of the dumps on a
> backup host.  Although we are still
> resolving hardware issues on the XML dumps server, we think
> it is safe
> enough to serve the existing dumps read-only.  DNS was
> updated to that
> effect already; people should see the dumps within the
> hour. 
>
> Ariel
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>


     

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

yegg
In reply to this post by Ariel Glenn WMF
Ariel T. Glenn <ariel <at> wikimedia.org> writes:

>
> We now have a copy of the dumps on a backup host.  Although we are still
> resolving hardware issues on the XML dumps server, we think it is safe
> enough to serve the existing dumps read-only.  DNS was updated to that
> effect already; people should see the dumps within the hour.  
>
> Ariel
>

Hi, thank you for working so hard on this issue, but I'm still having trouble
with the latest en.wikipedia dump, however. I downloaded
http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
articles.xml.bz2 and am running into trouble decompressing.

In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.

And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:

    [2752: huff+mtf data integrity (CRC) error in data

I ran bzip2recover & then bzip2 -t rec* and got the following:

bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
error in data
bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
error in data
bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
error in data



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Emilio J. Rodríguez-Posada
Have you checked the md5sum?

2010/12/16 Gabriel Weinberg <[hidden email]>

> Ariel T. Glenn <ariel <at> wikimedia.org> writes:
>
> >
> > We now have a copy of the dumps on a backup host.  Although we are still
> > resolving hardware issues on the XML dumps server, we think it is safe
> > enough to serve the existing dumps read-only.  DNS was updated to that
> > effect already; people should see the dumps within the hour.
> >
> > Ariel
> >
>
> Hi, thank you for working so hard on this issue, but I'm still having
> trouble
> with the latest en.wikipedia dump, however. I downloaded
> http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> articles.xml.bz2 and am running into trouble decompressing.
>
> In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
>
> And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
>
>    [2752: huff+mtf data integrity (CRC) error in data
>
> I ran bzip2recover & then bzip2 -t rec* and got the following:
>
> bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
> error in data
> bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
> error in data
> bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
> error in data
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

yegg
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
opposed to 7a4805475bba1599933b3acd5150bd4d
on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
).

I've downloaded it twice now and have gotten the same md5sum. Can anyone
else confirm?

On Thu, Dec 16, 2010 at 5:41 PM, emijrp <[hidden email]> wrote:

> Have you checked the md5sum?
>
> 2010/12/16 Gabriel Weinberg <[hidden email]>
>
> > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> >
> > >
> > > We now have a copy of the dumps on a backup host.  Although we are
> still
> > > resolving hardware issues on the XML dumps server, we think it is safe
> > > enough to serve the existing dumps read-only.  DNS was updated to that
> > > effect already; people should see the dumps within the hour.
> > >
> > > Ariel
> > >
> >
> > Hi, thank you for working so hard on this issue, but I'm still having
> > trouble
> > with the latest en.wikipedia dump, however. I downloaded
> > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > articles.xml.bz2 and am running into trouble decompressing.
> >
> > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> >
> > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> >
> >    [2752: huff+mtf data integrity (CRC) error in data
> >
> > I ran bzip2recover & then bzip2 -t rec* and got the following:
> >
> > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> (CRC)
> > error in data
> > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> (CRC)
> > error in data
> > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> (CRC)
> > error in data
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Emilio J. Rodríguez-Posada
If the md5s don't match, the files are obviously different, I mean, one of
them is corrupt.

What is the size of your local file? I use to download dumps with wget UNIX
command and I don't get errors. If you are using FAT32, the file size is
limited to 2 GB and the file is truncated. Is your case?

2010/12/16 Gabriel Weinberg <[hidden email]>

> md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> opposed to 7a4805475bba1599933b3acd5150bd4d
> on
> http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> ).
>
> I've downloaded it twice now and have gotten the same md5sum. Can anyone
> else confirm?
>
> On Thu, Dec 16, 2010 at 5:41 PM, emijrp <[hidden email]> wrote:
>
> > Have you checked the md5sum?
> >
> > 2010/12/16 Gabriel Weinberg <[hidden email]>
> >
> > > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> > >
> > > >
> > > > We now have a copy of the dumps on a backup host.  Although we are
> > still
> > > > resolving hardware issues on the XML dumps server, we think it is
> safe
> > > > enough to serve the existing dumps read-only.  DNS was updated to
> that
> > > > effect already; people should see the dumps within the hour.
> > > >
> > > > Ariel
> > > >
> > >
> > > Hi, thank you for working so hard on this issue, but I'm still having
> > > trouble
> > > with the latest en.wikipedia dump, however. I downloaded
> > > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > > articles.xml.bz2 and am running into trouble decompressing.
> > >
> > > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> > >
> > > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> > >
> > >    [2752: huff+mtf data integrity (CRC) error in data
> > >
> > > I ran bzip2recover & then bzip2 -t rec* and got the following:
> > >
> > > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

yegg
I've been downloading this file (using wget on ubuntu or fetch on FreeBSD)
with no issues for years. The current one is 6.2GB as it should be.

On Thu, Dec 16, 2010 at 5:53 PM, emijrp <[hidden email]> wrote:

> If the md5s don't match, the files are obviously different, I mean, one of
> them is corrupt.
>
> What is the size of your local file? I use to download dumps with wget UNIX
> command and I don't get errors. If you are using FAT32, the file size is
> limited to 2 GB and the file is truncated. Is your case?
>
> 2010/12/16 Gabriel Weinberg <[hidden email]>
>
> > md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> > opposed to 7a4805475bba1599933b3acd5150bd4d
> > on
> >
> http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> > ).
> >
> > I've downloaded it twice now and have gotten the same md5sum. Can anyone
> > else confirm?
> >
> > On Thu, Dec 16, 2010 at 5:41 PM, emijrp <[hidden email]> wrote:
> >
> > > Have you checked the md5sum?
> > >
> > > 2010/12/16 Gabriel Weinberg <[hidden email]>
> > >
> > > > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> > > >
> > > > >
> > > > > We now have a copy of the dumps on a backup host.  Although we are
> > > still
> > > > > resolving hardware issues on the XML dumps server, we think it is
> > safe
> > > > > enough to serve the existing dumps read-only.  DNS was updated to
> > that
> > > > > effect already; people should see the dumps within the hour.
> > > > >
> > > > > Ariel
> > > > >
> > > >
> > > > Hi, thank you for working so hard on this issue, but I'm still having
> > > > trouble
> > > > with the latest en.wikipedia dump, however. I downloaded
> > > > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > > > articles.xml.bz2 and am running into trouble decompressing.
> > > >
> > > > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> > > >
> > > > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> > > >
> > > >    [2752: huff+mtf data integrity (CRC) error in data
> > > >
> > > > I ran bzip2recover & then bzip2 -t rec* and got the following:
> > > >
> > > > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: dataset1, xml dumps

Ariel Glenn WMF
In reply to this post by yegg
I was able to unzip a copy of the file on another host (taken from the
same location) without problems. On the download host itself I get the
correct md5sum: 7a4805475bba1599933b3acd5150bd4d

Ariel

Στις 16-12-2010, ημέρα Πεμ, και ώρα 17:48 -0500, ο/η Gabriel Weinberg
έγραψε:

> md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> opposed to 7a4805475bba1599933b3acd5150bd4d
> on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> ).
>
> I've downloaded it twice now and have gotten the same md5sum. Can anyone
> else confirm?
>
> On Thu, Dec 16, 2010 at 5:41 PM, emijrp <[hidden email]> wrote:
>
> > Have you checked the md5sum?
> >
> > 2010/12/16 Gabriel Weinberg <[hidden email]>
> >
> > > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> > >
> > > >
> > > > We now have a copy of the dumps on a backup host.  Although we are
> > still
> > > > resolving hardware issues on the XML dumps server, we think it is safe
> > > > enough to serve the existing dumps read-only.  DNS was updated to that
> > > > effect already; people should see the dumps within the hour.
> > > >
> > > > Ariel
> > > >
> > >
> > > Hi, thank you for working so hard on this issue, but I'm still having
> > > trouble
> > > with the latest en.wikipedia dump, however. I downloaded
> > > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > > articles.xml.bz2 and am running into trouble decompressing.
> > >
> > > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> > >
> > > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> > >
> > >    [2752: huff+mtf data integrity (CRC) error in data
> > >
> > > I ran bzip2recover & then bzip2 -t rec* and got the following:
> > >
> > > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12