New tool - User dupes

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

New tool - User dupes

Magnus Manske-2
Fulfilling a request, I added "User dupes" to my set of toys. For a
user and a wiki (wikipedia or commons), it can find uploaded files
identical in size (pixels and bytes) but different names.

Magnus


[1] http://tools.wikimedia.de/~magnus/userdupes.php
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Bence Damokos
Great tool!
Is there a way to make it cross wiki, so to find commons duplicates on a local wiki, that are not under the same name, and not from the same user?

Bence

On 12/3/06, Magnus Manske <[hidden email]> wrote:
Fulfilling a request, I added "User dupes" to my set of toys. For a
user and a wiki (wikipedia or commons), it can find uploaded files
identical in size (pixels and bytes) but different names.

Magnus


[1] http://tools.wikimedia.de/~magnus/userdupes.php
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l


_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Magnus Manske-2
On 12/4/06, Bence Damokos <[hidden email]> wrote:
> Great tool!

Thanks!

> Is there a way to make it cross wiki, so to find commons duplicates on a
> local wiki, that are not under the same name, and not from the same user?

Well, cross-checking one million commons images against a few hundred
thousand on one of the larger wikipedias might kill the toolserver
quite efficiently ;-)

Or did you mean, given a wikipedia and a user, find duplicates on the
local wiki and commons alike? That would be possible.

Magnus
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Alexandre NOUVEL
Hi all,

---Selon Magnus Manske <[hidden email]>:
> Well, cross-checking one million commons images against a few hundred
> thousand on one of the larger wikipedias might kill the toolserver
> quite efficiently ;-)

Well, I agree that image processing is a very CPU-consuming task, and
cross-checking adds to the difficulty.

However, I think that it may be possible to build a kind of hash
signature for each file and sort them to find duplicates. The process
itself of hashing would require some time but may be splitted amongst
some servers. The resulting hash lists may then be sorted, so that
matching signatures would lead to further checking of their initial
images.

One drawback for this solution is to maintain a huge index of all the
signatures (each one associated with the image name and the originating
wiki).

Or perhaps I'm just writing bullshit :)

Best regards from France,
--
[hidden email]
|-> http://www.alnoprods.net
|-> La copie privée et l'auto-diffusion menacées : http://eucd.info
\ I hate spam. I kill spammers. Non mais.
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Magnus Manske-2
On 12/4/06, Alexandre NOUVEL <[hidden email]> wrote:

> Hi all,
>
> ---Selon Magnus Manske <[hidden email]>:
> > Well, cross-checking one million commons images against a few hundred
> > thousand on one of the larger wikipedias might kill the toolserver
> > quite efficiently ;-)
>
> Well, I agree that image processing is a very CPU-consuming task, and
> cross-checking adds to the difficulty.
>
> However, I think that it may be possible to build a kind of hash
> signature for each file and sort them to find duplicates. The process
> itself of hashing would require some time but may be splitted amongst
> some servers. The resulting hash lists may then be sorted, so that
> matching signatures would lead to further checking of their initial
> images.

There was a discussion somewhere (maybe on this list? I don't
remember) to store MD5-hashes of image data in the table with the
other image information (size etc.). Nothing came of it, I'm afraid.
Too bad.

> One drawback for this solution is to maintain a huge index of all the
> signatures (each one associated with the image name and the originating
> wiki).

With images being replaced, deleted, undeleted, etc. the only
practical place is indeed the image table on the respective wiki. An
outside solution (i.e. toolserver) is out of the question IMHO.

> Or perhaps I'm just writing bullshit :)

Nope :-)

Magnus
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Bryan Tong Minh
On 12/4/06, Magnus Manske <[hidden email]> wrote:
> There was a discussion somewhere (maybe on this list? I don't
> remember) to store MD5-hashes of image data in the table with the
> other image information (size etc.). Nothing came of it, I'm afraid.
> Too bad.
What was the reason for this? Was it the technical difficulties, or
just a lack of willingness?
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Magnus Manske-2
On 12/12/06, Bryan Tong Minh <[hidden email]> wrote:
> On 12/4/06, Magnus Manske <[hidden email]> wrote:
> > There was a discussion somewhere (maybe on this list? I don't
> > remember) to store MD5-hashes of image data in the table with the
> > other image information (size etc.). Nothing came of it, I'm afraid.
> > Too bad.
> What was the reason for this? Was it the technical difficulties, or
> just a lack of willingness?

I'd say 10% the former (MD5ing all existing images on all wikipedias
and commons would mean server stress), and 90% the latter.

Magnus
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Sherool
On Tue, 12 Dec 2006 14:06:41 +0100, Magnus Manske  
<[hidden email]> wrote:

> On 12/12/06, Bryan Tong Minh <[hidden email]> wrote:
>> On 12/4/06, Magnus Manske <[hidden email]> wrote:
>> > There was a discussion somewhere (maybe on this list? I don't
>> > remember) to store MD5-hashes of image data in the table with the
>> > other image information (size etc.). Nothing came of it, I'm afraid.
>> > Too bad.
>> What was the reason for this? Was it the technical difficulties, or
>> just a lack of willingness?
>
> I'd say 10% the former (MD5ing all existing images on all wikipedias
> and commons would mean server stress), and 90% the latter.

I'm sure the job que could be tweaked to handle hashing of all existing  
images in the "background", it would probably take a few days to complete,  
but it would hardly need to be a major stress factor). Did anyone file an  
actual feature request in bugzilla or did the discussiuon just fizzle out  
before anyone got around to it?

--
[[:en:User:Sherool]]

_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: New tool - User dupes

Bryan Tong Minh
In reply to this post by Magnus Manske-2
That's a pity, since it would have been very handy for auto comparing images.

Bryan

On 12/12/06, Magnus Manske <[hidden email]> wrote:

> On 12/12/06, Bryan Tong Minh <[hidden email]> wrote:
> > On 12/4/06, Magnus Manske <[hidden email]> wrote:
> > > There was a discussion somewhere (maybe on this list? I don't
> > > remember) to store MD5-hashes of image data in the table with the
> > > other image information (size etc.). Nothing came of it, I'm afraid.
> > > Too bad.
> > What was the reason for this? Was it the technical difficulties, or
> > just a lack of willingness?
>
> I'd say 10% the former (MD5ing all existing images on all wikipedias
> and commons would mean server stress), and 90% the latter.
>
> Magnus
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/commons-l
>
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l