Duplicate removal?

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicate removal?

Jonas Öberg
Hi everyone,

In our work with Elog.io[1], we've come across a number of duplicate
files in Commons. Some of them are explainable, such as PNGs which
also have a thumbnail as JPG[2], but others seem to be more clear-cut
duplicated uploads, like [3] and [4], and yet others are the same work
but different sizes like [5] and [6].

Going through this is quite an effort, and likely requires a bit of
manual work. Is there any organised structure/group of people, that
deal with duplicate works? We'd love to contribute our findings to
such an effort once we clean up our data a bit.

[1] http://elog.io/
[2] Like https://commons.wikimedia.org/wiki/File:Island_House,_Bellows_Falls,_by_P._W._Taft.png
[3] https://commons.wikimedia.org/wiki/File:Defense.gov_News_Photo_090910-N-8420M-038.jpg
[4] https://commons.wikimedia.org/wiki/File:US_Navy_090910-N-8420M-038_Students_in_Basic_Underwater_Demolition-SEAL_(BUD-S)_class_279_participate_in_a_surf_passage_exercise_during_the_first_phase_of_training_at_Naval_Amphibious_Base_Coronado.jpg
[5] https://commons.wikimedia.org/wiki/File:P0772931871(37827)(NRCS_Photo_Gallery).jpg
[6] https://commons.wikimedia.org/wiki/File:NRCSMT01082(18769)(NRCS_Photo_Gallery).jpg

--
Jonas Öberg, Founder & Shuttleworth Foundation Fellow
Commons Machinery | [hidden email]
E-mail is the fastest way to my attention

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Federico Leva (Nemo)
Jonas Öberg, 04/12/2014 08:31:
> In our work with Elog.io[1], we've come across a number of duplicate
> files in Commons.

Great!

> Some of them are explainable, such as PNGs which
> also have a thumbnail as JPG[2], but others seem to be more clear-cut
> duplicated uploads, like [3] and [4], and yet others are the same work
> but different sizes like [5] and [6].

Are most of the case you find perfect duplicates like these?

>
> Going through this is quite an effort, and likely requires a bit of
> manual work. Is there any organised structure/group of people, that
> deal with duplicate works? We'd love to contribute our findings to
> such an effort once we clean up our data a bit.

Sure. You can edit the files and add
https://commons.wikimedia.org/wiki/Template:Duplicate
If you need to report many thousands files, it may be better to use a
flagged bot account:
https://commons.wikimedia.org/wiki/Commons:Bots/Requests

Nemo

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

James Heald
We really need a better way to mark duplicates on Commons (and images
that are details from a larger work).  A structure to record this is
something that probably ought to be on the radar for the new Structured
Data project.

As well as exact duplicates, there may often also be different versions
of the same painting with different lighting, or scans of slightly
different reproductions of the same work.  I don't know whether the
algorithm is permissive enough to pick all of these up, but as many as
can be picked up would be good to tag as "other versions" of the same
underlying image.

In general, we probably wouldn't *remove* duplicate images, but we would
want to identify them as versions of each other.

All best,

    James.


On 04/12/2014 08:25, Federico Leva (Nemo) wrote:

> Jonas Öberg, 04/12/2014 08:31:
>> In our work with Elog.io[1], we've come across a number of duplicate
>> files in Commons.
>
> Great!
>
>> Some of them are explainable, such as PNGs which
>> also have a thumbnail as JPG[2], but others seem to be more clear-cut
>> duplicated uploads, like [3] and [4], and yet others are the same work
>> but different sizes like [5] and [6].
>
> Are most of the case you find perfect duplicates like these?
>
>>
>> Going through this is quite an effort, and likely requires a bit of
>> manual work. Is there any organised structure/group of people, that
>> deal with duplicate works? We'd love to contribute our findings to
>> such an effort once we clean up our data a bit.
>
> Sure. You can edit the files and add
> https://commons.wikimedia.org/wiki/Template:Duplicate
> If you need to report many thousands files, it may be better to use a
> flagged bot account:
> https://commons.wikimedia.org/wiki/Commons:Bots/Requests
>
> Nemo
>
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/commons-l


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Hay (Husky)
On Thu, Dec 4, 2014 at 10:08 AM, James Heald <[hidden email]> wrote:
> As well as exact duplicates, there may often also be different versions of
> the same painting with different lighting, or scans of slightly different
> reproductions of the same work.  I don't know whether the algorithm is
> permissive enough to pick all of these up, but as many as can be picked up
> would be good to tag as "other versions" of the same underlying image.
>
> In general, we probably wouldn't *remove* duplicate images, but we would
> want to identify them as versions of each other.
We probably need a good definition of all these terms, because people
tend to have different interpretations of a 'duplicate'. E.g., for me
a lower quality reproduction of a painting is a duplicate, but other
people on Commons define it more strictly: only 'downsized' versions
of a reproduction (that could also be made using the thumbnail
service) are considered to be duplicates. We also need to have
definitions for things like details, alternate angles, etcetera.

-- Hay

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

David Gerard-2
In reply to this post by James Heald
On 4 December 2014 at 09:08, James Heald <[hidden email]> wrote:

> As well as exact duplicates, there may often also be different versions of
> the same painting with different lighting, or scans of slightly different
> reproductions of the same work.  I don't know whether the algorithm is
> permissive enough to pick all of these up, but as many as can be picked up
> would be good to tag as "other versions" of the same underlying image.


Careful here - algorithms that spot almost-duplicates will happily
flag different shots from the same shoot. Definitely not something to
act upon without close human inspection.


> In general, we probably wouldn't *remove* duplicate images, but we would
> want to identify them as versions of each other.



Oh yeah, this'll be useful.


- d.

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Jonas Öberg
Hi everyone,

> Careful here - algorithms that spot almost-duplicates will happily
> flag different shots from the same shoot. Definitely not something to
> act upon without close human inspection.

I agree, and I wouldn't want to flag anything automatically based on
our findings.

The algorithm we use is meant to capture verbatim re-use, not
derivative works. This means that it does a very poor job at matching
images that are different photographic reproductions of the same work
(light conditions, angles, borders, etc, will all differ). It does a
fairly good job at matching images that are verbatim copies, allowing
for resizing and format changes, but it's not perfect, and we
definitely end up with the same hash for some images, even if they're
not identical. This happens often with maps, for instance. For example
two maps of US states, one marking Washington in red and one marking
California in red. With no other differences, they'll end up hashed
very close to each other.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Jonas Öberg
In reply to this post by Federico Leva (Nemo)
Hi Federico, and others,

> Are most of the case you find perfect duplicates like these?

I'm still running the comparison, but I made a first list of ~500
duplicate works available here:

   http://belar.coyote.org/~jonas/wmcdups.html

It would be very useful to get some feedback on that. Looking through
some of those will give an idea of the kind of "duplicates" we find.


Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

James Heald
In reply to this post by Jonas Öberg
Just out of interest, would your algorithm identify eg

https://commons.wikimedia.org/wiki/File:Portrait_du_Bienheureux_Pierre_de_Luxembourg_-_Mus%C3%A9e_du_Petit_Palais_d%27Avignon.jpg

and

https://commons.wikimedia.org/wiki/File:Master_Of_The_Avignon_School_-_Vision_of_Peter_of_Luxembourg_-_WGA14511.jpg

as duplicates?  (Just as a pair of images I happen to have run across
this morning).

They're very similar, though the smaller image is in fact sharper, a
little darker, and slightly differently framed.

So I'd be interested whether they would ping the algorithm or not.

All best,

   James.


On 04/12/2014 09:43, Jonas Öberg wrote:

> Hi everyone,
>
>> Careful here - algorithms that spot almost-duplicates will happily
>> flag different shots from the same shoot. Definitely not something to
>> act upon without close human inspection.
>
> I agree, and I wouldn't want to flag anything automatically based on
> our findings.
>
> The algorithm we use is meant to capture verbatim re-use, not
> derivative works. This means that it does a very poor job at matching
> images that are different photographic reproductions of the same work
> (light conditions, angles, borders, etc, will all differ). It does a
> fairly good job at matching images that are verbatim copies, allowing
> for resizing and format changes, but it's not perfect, and we
> definitely end up with the same hash for some images, even if they're
> not identical. This happens often with maps, for instance. For example
> two maps of US states, one marking Washington in red and one marking
> California in red. With no other differences, they'll end up hashed
> very close to each other.
>
> Sincerely,
> Jonas
>
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Jonas Öberg
Hi James,

> They're very similar, though the smaller image is in fact sharper, a little
> darker, and slightly differently framed.

This wouldn't trigger any bells for us. They're too different for us
to, mathematically, say that they are similar without also triggering
a lot of false positives.

If we look at the hashes generated by our blockhash[1] algorithm for
those two images, we end up with this:

8000bc409f7c9ffd9cd096689fe883e4f3fd83c583c101e183e101e60073e7bf
80019c819ff99ff18cc1944197e19fe9f7e9c3c983c103c1a3e183ee217004ff

You can see that there is some commonality, but that they're also
quite far apart. If we convert this to bits and calculate the hamming
distance (the number of bits that differ) between the two, we end up
with a distance of 48 bits (out of 256). So far, we've found that a
maximum distance of 10 is usually sufficiently unique to be called a
match, though with the draft query for duplicate Commons worked that I
linked to, I've been even more restrictive and not allowed even 1 bit
to differ, just to get a better match for those that do match, at the
expense of not matching as many.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

James Heald
In reply to this post by Jonas Öberg
Would it be possible to split the list into images that are

* byte-for-byte identical
* very different sizes (eg > x2 difference -- this is often intentional,
especially for large tiffs).
* others ?

I think this would be useful.

It would also be useful to do some further processing to identify images
which, though probably related, are *not* in fact duplicates, eg due to
a notable difference somewhere (eg arrows or legend added, or a
difference in some local blocks of colour, eg:

https://commons.wikimedia.org/wiki/File:Map_-_NL_-_Putten_-_Wijk_00_Putten_-_Buurt_01_Putten-Zuid-Oost.svg

https://commons.wikimedia.org/wiki/File:Map_-_NL_-_Putten_-_Wijk_00_Putten_-_Buurt_03_Putten-Zuid-West.svg


-- James.




On 04/12/2014 09:44, Jonas Öberg wrote:

> Hi Federico, and others,
>
>> Are most of the case you find perfect duplicates like these?
>
> I'm still running the comparison, but I made a first list of ~500
> duplicate works available here:
>
>     http://belar.coyote.org/~jonas/wmcdups.html
>
> It would be very useful to get some feedback on that. Looking through
> some of those will give an idea of the kind of "duplicates" we find.
>
>
> Sincerely,
> Jonas
>
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Jonas Öberg
Hi James,

> * byte-for-byte identical

That's something probably best done by WMF staff themselves, I think a
simple md5 comparison would give quite a few matches. Doing on the WMF
side would alleviate the need to transfer large amounts of data.

For the rest, that's something that require a few API lookups only to
get the relevant information (size etc). I can also imagine that it
might be useful to take the results we've gotten, apply some secondary
matching on the pairs that we've identified. Such a secondary matching
could be more specific than ours to narrow down to true duplicates,
and also take size into consideration.

That's beyond our need though: we're happy with the information we
have, and while it would contribute to our work to eliminate
duplicates in Commons, it's not critical right now. But if someone is
interested in working with our results or our data, we'd be happy to
collaborate around that if it would benefit Commons.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Fæ
On 4 December 2014 at 18:39, Jonas Öberg <[hidden email]> wrote:
>> * byte-for-byte identical
>
> That's something probably best done by WMF staff themselves, I think a
> simple md5 comparison would give quite a few matches. Doing on the WMF
> side would alleviate the need to transfer large amounts of data.

Volunteers can do this using simple database queries, which is a lot
more efficient than pulling data out of the API. For example while
writing this email I knocked out a query to show all non-trivial
images (>2 pixels wide) on Commons with at least *3* files having the
same SHA1 checksum and showing each image just once. The matching
files are listed at the bottom of each image page on Commons.
Interestingly, this shows that most of the 226 files have been from an
upload of Gospel illustrations. The low number seems reassuring
considering the size of Commons. The files are reported in descending
order by image resolution.

Report: http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887

On its own this is an interesting list to use as a backlog for fixes.
Listing identical duplicates with 2 or more files matching would be
simpler but longer; at the moment I count 3,279 files like this on
Commons which took over 9 minutes to run. :-)

Fae
--
[hidden email] https://commons.wikimedia.org/wiki/User:Fae

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Jonas Öberg
Hi Fae,

> Listing identical duplicates with 2 or more files matching would be
> simpler but longer; at the moment I count 3,279 files like this on
> Commons which took over 9 minutes to run. :-)

This is very interesting. I had a closer look at our matches and it
seems that many of them are files where there are slight color
variations, or where the jpg has simply been compressed differently,
so a sha1 wouldn't mach them against each other. But that speaks in
favor of the fact that the matches we find need a human to validate
case by case. My Python script is still processing :-) but it's
currently recorded 12,475 matches, which then also includes your
3,279.

But your 3,279 should be fairly uncomplicated to do something about it
seems, though perhaps there too it needs a human to assist since the
metadata and use may vary?


Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Sreejith K.
In reply to this post by Fæ
I am using Wikimedia APIs to create a gallery of duplicates and routinely clean them. You can see the results here.


The page also has a link to the script. If anyone is interested in using this script, let me know and I can work with you to customize it.

- Sreejith K.


On Thu, Dec 4, 2014 at 2:46 PM, Fæ <[hidden email]> wrote:
On 4 December 2014 at 18:39, Jonas Öberg <[hidden email]> wrote:
>> * byte-for-byte identical
>
> That's something probably best done by WMF staff themselves, I think a
> simple md5 comparison would give quite a few matches. Doing on the WMF
> side would alleviate the need to transfer large amounts of data.

Volunteers can do this using simple database queries, which is a lot
more efficient than pulling data out of the API. For example while
writing this email I knocked out a query to show all non-trivial
images (>2 pixels wide) on Commons with at least *3* files having the
same SHA1 checksum and showing each image just once. The matching
files are listed at the bottom of each image page on Commons.
Interestingly, this shows that most of the 226 files have been from an
upload of Gospel illustrations. The low number seems reassuring
considering the size of Commons. The files are reported in descending
order by image resolution.

Report: http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887

On its own this is an interesting list to use as a backlog for fixes.
Listing identical duplicates with 2 or more files matching would be
simpler but longer; at the moment I count 3,279 files like this on
Commons which took over 9 minutes to run. :-)

Fae
--
[hidden email] https://commons.wikimedia.org/wiki/User:Fae

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Daniel Kinzler
In reply to this post by Jonas Öberg
Am 04.12.2014 19:39, schrieb Jonas Öberg:
> Hi James,
>
>> * byte-for-byte identical
>
> That's something probably best done by WMF staff themselves, I think a
> simple md5 comparison would give quite a few matches. Doing on the WMF
> side would alleviate the need to transfer large amounts of data.

This is happening automatically: the SHA1 hash of every file is computed on
upload, and placed in the img_sha1 field on the database. I believe this is used
to warn users who try to upload an exact duplicate, but I'm not sure this is
true. Anyway, *exact* duplicates can easily be found in the database by anyone
who has an account on toollabs. The relevant query is:

select A.img_name, A.img_sha1, B.img_name from image as A join image as B on
A.img_sha1 =  B.img_sha1 and A.img_name < B.img_name;

Having a list of "effective" duplicates, such as the same image in slightly
different resolution or compression, would of course be very interesting.

-- daniel

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Jane Darnell
In reply to this post by Jonas Öberg
Just casually clicking on some of the results you pulled, I can see at a glance that lots of these duplicate uploads are caused because the oldest version is virtually "unfindable" for Wikimedians; i.e. it is not in any category whatsoever.

On Thu, Dec 4, 2014 at 7:39 PM, Jonas Öberg <[hidden email]> wrote:
Hi James,

> * byte-for-byte identical

That's something probably best done by WMF staff themselves, I think a
simple md5 comparison would give quite a few matches. Doing on the WMF
side would alleviate the need to transfer large amounts of data.

For the rest, that's something that require a few API lookups only to
get the relevant information (size etc). I can also imagine that it
might be useful to take the results we've gotten, apply some secondary
matching on the pairs that we've identified. Such a secondary matching
could be more specific than ours to narrow down to true duplicates,
and also take size into consideration.

That's beyond our need though: we're happy with the information we
have, and while it would contribute to our work to eliminate
duplicates in Commons, it's not critical right now. But if someone is
interested in working with our results or our data, we'd be happy to
collaborate around that if it would benefit Commons.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Daniel Kinzler
In reply to this post by Daniel Kinzler
Am 04.12.2014 22:49, schrieb Daniel Kinzler:
> select A.img_name, A.img_sha1, B.img_name from image as A join image as B on
> A.img_sha1 =  B.img_sha1 and A.img_name < B.img_name;

Fwiw, i get 197495 results for that query (took about 4 minutes).
So about 200k images could be deleted, but all the categories and meta-info
should be merged, and any usages moved to one file...

-- daniel

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Gergo Tisza
In reply to this post by Daniel Kinzler
On Thu, Dec 4, 2014 at 1:49 PM, Daniel Kinzler <[hidden email]> wrote:
This is happening automatically: the SHA1 hash of every file is computed on
upload, and placed in the img_sha1 field on the database. I believe this is used
to warn users who try to upload an exact duplicate, but I'm not sure this is
true.

Indeed; so chances are exact duplicates have been uploaded intentionally. (It's not certain because they might have been uploaded before this warning existed, or uploaded by a bot configured to ignore warnings.)


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

bawolff
In reply to this post by Jonas Öberg


> >
> > Message: 4
> > Date: Thu, 4 Dec 2014 14:58:37 -0500
> > From: "Sreejith K." <[hidden email]>
> > To: Wikimedia Commons Discussion List <[hidden email]>
> > Subject: Re: [Commons-l] Duplicate removal?
> > Message-ID:
> >         <CAN8yy7Mtte+FPJ5N=hq=[hidden email]>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I am using Wikimedia APIs to create a gallery of duplicates and routinely
> > clean them. You can see the results here.
> >
> > https://commons.wikimedia.org/wiki/User:Sreejithk2000/Duplicates
> >
> > The page also has a link to the script. If anyone is interested in using
> > this script, let me know and I can work with you to customize it.
> >
> > - Sreejith K.
> >
> >
>
See also https://commons.wikimedia.org/wiki/Special:ListDuplicatedFiles which lists files that have the most byte for byte duplicates (really most of the time those should use file redirects).

--

Thanks Jonas for experimenting with this sort of thing. I always wished we did something with preceptual hashes internally in addition to the sha1 hashes we do currently.

--bawolff


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate removal?

Federico Leva (Nemo)
In reply to this post by Gergo Tisza
Gergo Tisza, 04/12/2014 23:27:

> On Thu, Dec 4, 2014 at 1:49 PM, Daniel Kinzler wrote:
>
>     This is happening automatically: the SHA1 hash of every file is
>     computed on
>     upload, and placed in the img_sha1 field on the database. I believe
>     this is used
>     to warn users who try to upload an exact duplicate, but I'm not sure
>     this is
>     true.
>
>
> Indeed; so chances are exact duplicates have been uploaded
> intentionally. (It's not certain because they might have been uploaded
> before this warning existed, or uploaded by a bot configured to ignore
> warnings.)

This warning is shown at Special:Upload. I don't remember if all users
have the permission to click the "upload anyway"/"ignore warnings"
button. On UploadWizard, the upload just fails in the "upload" step and
I can only remove the file.

It's possible users didn't understand that duplicate means byte by byte
duplicate, but I checked a dozen examples and most seem result of an
upload failure. For instance: two pages of a book are given the same
image; an image uploaded twice within 3 minutes (and used only with the
second title).

Nemo

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l