Massive image loss

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Gregory Maxwell
On Sat, Sep 6, 2008 at 12:26 PM, Huji <[hidden email]> wrote:
> An "out of the blue" idea that I haven't checked: Are those pages stored in
> archive.org? Because if yes, then a copy of the image my also be there.

I think what you'll find is that most mirrors and copies do not have
the full resolution image.   I think we had thumbs for all of the
remainders already.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Huji Lee
On 9/6/08, Gregory Maxwell <[hidden email]> wrote:

>
> On Sat, Sep 6, 2008 at 12:26 PM, Huji <[hidden email]> wrote:
> > An "out of the blue" idea that I haven't checked: Are those pages stored
> in
> > archive.org? Because if yes, then a copy of the image my also be there.
>
>
> I think what you'll find is that most mirrors and copies do not have
> the full resolution image.   I think we had thumbs for all of the
> remainders already.
>

Well I was hoping otherwise. I hoped that some crawlers like archive.org may
store not only the image page, but also the full res image (which is linked
from the image page). I tested some examples, and it seems they don't even
store the image pages!

Huji
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

MinuteElectron
Huji wrote:
> I tested some examples, and it seems they don't even store the image pages!

The Wayback Machine does not release archived data until six months
after it is captured.

MinuteElectron.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Jay Ashworth-2
On Mon, Sep 08, 2008 at 04:10:48PM +0100, MinuteElectron wrote:
> Huji wrote:
> > I tested some examples, and it seems they don't even store the image pages!
>
> The Wayback Machine does not release archived data until six months
> after it is captured.

And, annoyingly enough, they also block access to data if a current
robots.txt says to... even if the domain has changed hands.  That makes
little sense to me, but what can you do; they have a staff of, what, 6?

Cheers,
-- jra
--
Jay R. Ashworth                   Baylink                      [hidden email]
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com                     '87 e24
St Petersburg FL USA      http://photo.imageinc.us             +1 727 647 1274

             Those who cast the vote decide nothing.
             Those who count the vote decide everything.
               -- (Josef Stalin)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Brevam
In reply to this post by Tim Starling-2
Tim Starling wrote:
> A list of missing images can be found here:
>
> http://noc.wikimedia.org/~tstarling/missing-images-2008-09
>

Some images on ja.wikipedia seem to be lost too.
Is it due to the same accident?

/wikipedia/ja/0/01/Shinkiryu_station.jpg
/wikipedia/ja/0/02/Totsuka_station_nishiguchi.jpg
/wikipedia/ja/0/04/Tsurugamineeki.jpg
... and at least a hundred more missing.




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Ilmari Karonen
Brevam wrote:

> Tim Starling wrote:
>> A list of missing images can be found here:
>>
>> http://noc.wikimedia.org/~tstarling/missing-images-2008-09
>
> Some images on ja.wikipedia seem to be lost too.
> Is it due to the same accident?
>
> /wikipedia/ja/0/01/Shinkiryu_station.jpg
> /wikipedia/ja/0/02/Totsuka_station_nishiguchi.jpg
> /wikipedia/ja/0/04/Tsurugamineeki.jpg
> ... and at least a hundred more missing.

That's an... interesting failure mode.  That makes three different types
of breakage I've seen so far: missing files, empty files and now
directory entries where there should be files.  Are these really all
from the same bug?

--
Ilmari Karonen

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Tim Starling-2
Ilmari Karonen wrote:

> Brevam wrote:
>> Tim Starling wrote:
>>> A list of missing images can be found here:
>>>
>>> http://noc.wikimedia.org/~tstarling/missing-images-2008-09
>> Some images on ja.wikipedia seem to be lost too.
>> Is it due to the same accident?
>>
>> /wikipedia/ja/0/01/Shinkiryu_station.jpg
>> /wikipedia/ja/0/02/Totsuka_station_nishiguchi.jpg
>> /wikipedia/ja/0/04/Tsurugamineeki.jpg
>> ... and at least a hundred more missing.
>
> That's an... interesting failure mode.  That makes three different types
> of breakage I've seen so far: missing files, empty files and now
> directory entries where there should be files.  Are these really all
> from the same bug?

Yes. The bug itself deleted the file and put a directory entry in its
place. I wrote a shell script to remove the directory entry and then do a
wget to fetch the file from the squid cache. Wget created a zero-length
file for all the cache misses. Some of those files were subsequently deleted.

-- Tim Starling



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Gregory Maxwell
On Sat, Sep 20, 2008 at 9:38 PM, Tim Starling <[hidden email]> wrote:
> Yes. The bug itself deleted the file and put a directory entry in its
> place. I wrote a shell script to remove the directory entry and then do a
> wget to fetch the file from the squid cache. Wget created a zero-length
> file for all the cache misses. Some of those files were subsequently deleted.

So does that mean there are more files which were not included in the
prior list of missing files that I should check for?

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Tim Starling-2
Gregory Maxwell wrote:
> On Sat, Sep 20, 2008 at 9:38 PM, Tim Starling <[hidden email]> wrote:
>> Yes. The bug itself deleted the file and put a directory entry in its
>> place. I wrote a shell script to remove the directory entry and then do a
>> wget to fetch the file from the squid cache. Wget created a zero-length
>> file for all the cache misses. Some of those files were subsequently deleted.
>
> So does that mean there are more files which were not included in the
> prior list of missing files that I should check for?

The list of missing files was derived from the initial scan for directory
entries where files should have been. I'm not sure why files would have
been missing from that list. We'll probably have to check the whole file
repository against the DB. You could write a script for that if you feel
like doing something to help.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Massive image loss

Ilmari Karonen
Tim Starling wrote:

> Gregory Maxwell wrote:
>> On Sat, Sep 20, 2008 at 9:38 PM, Tim Starling <[hidden email]> wrote:
>>> Yes. The bug itself deleted the file and put a directory entry in its
>>> place. I wrote a shell script to remove the directory entry and then do a
>>> wget to fetch the file from the squid cache. Wget created a zero-length
>>> file for all the cache misses. Some of those files were subsequently deleted.
>> So does that mean there are more files which were not included in the
>> prior list of missing files that I should check for?
>
> The list of missing files was derived from the initial scan for directory
> entries where files should have been. I'm not sure why files would have
> been missing from that list. We'll probably have to check the whole file
> repository against the DB. You could write a script for that if you feel
> like doing something to help.

Just running something like "find -type d" on the image directory and
filtering out the expected legitimate entries would be a good start.

By the way, there also seem to be plenty of these under the "archive"
directory, e.g.
/wikipedia/en/archive/0/00/20060414204303!Uakari_male.jpg/  This
probably has something to do with the problems we've been having with
thumbnail generation in image histories.

--
Ilmari Karonen

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12