Hi everyone,
In our work with Elog.io[1], we've come across a number of duplicate files in Commons. Some of them are explainable, such as PNGs which also have a thumbnail as JPG[2], but others seem to be more clear-cut duplicated uploads, like [3] and [4], and yet others are the same work but different sizes like [5] and [6]. Going through this is quite an effort, and likely requires a bit of manual work. Is there any organised structure/group of people, that deal with duplicate works? We'd love to contribute our findings to such an effort once we clean up our data a bit. [1] http://elog.io/ [2] Like https://commons.wikimedia.org/wiki/File:Island_House,_Bellows_Falls,_by_P._W._Taft.png [3] https://commons.wikimedia.org/wiki/File:Defense.gov_News_Photo_090910-N-8420M-038.jpg [4] https://commons.wikimedia.org/wiki/File:US_Navy_090910-N-8420M-038_Students_in_Basic_Underwater_Demolition-SEAL_(BUD-S)_class_279_participate_in_a_surf_passage_exercise_during_the_first_phase_of_training_at_Naval_Amphibious_Base_Coronado.jpg [5] https://commons.wikimedia.org/wiki/File:P0772931871(37827)(NRCS_Photo_Gallery).jpg [6] https://commons.wikimedia.org/wiki/File:NRCSMT01082(18769)(NRCS_Photo_Gallery).jpg -- Jonas Öberg, Founder & Shuttleworth Foundation Fellow Commons Machinery | [hidden email] E-mail is the fastest way to my attention _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
Jonas Öberg, 04/12/2014 08:31:
> In our work with Elog.io[1], we've come across a number of duplicate > files in Commons. Great! > Some of them are explainable, such as PNGs which > also have a thumbnail as JPG[2], but others seem to be more clear-cut > duplicated uploads, like [3] and [4], and yet others are the same work > but different sizes like [5] and [6]. Are most of the case you find perfect duplicates like these? > > Going through this is quite an effort, and likely requires a bit of > manual work. Is there any organised structure/group of people, that > deal with duplicate works? We'd love to contribute our findings to > such an effort once we clean up our data a bit. Sure. You can edit the files and add https://commons.wikimedia.org/wiki/Template:Duplicate If you need to report many thousands files, it may be better to use a flagged bot account: https://commons.wikimedia.org/wiki/Commons:Bots/Requests Nemo _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
We really need a better way to mark duplicates on Commons (and images
that are details from a larger work). A structure to record this is something that probably ought to be on the radar for the new Structured Data project. As well as exact duplicates, there may often also be different versions of the same painting with different lighting, or scans of slightly different reproductions of the same work. I don't know whether the algorithm is permissive enough to pick all of these up, but as many as can be picked up would be good to tag as "other versions" of the same underlying image. In general, we probably wouldn't *remove* duplicate images, but we would want to identify them as versions of each other. All best, James. On 04/12/2014 08:25, Federico Leva (Nemo) wrote: > Jonas Öberg, 04/12/2014 08:31: >> In our work with Elog.io[1], we've come across a number of duplicate >> files in Commons. > > Great! > >> Some of them are explainable, such as PNGs which >> also have a thumbnail as JPG[2], but others seem to be more clear-cut >> duplicated uploads, like [3] and [4], and yet others are the same work >> but different sizes like [5] and [6]. > > Are most of the case you find perfect duplicates like these? > >> >> Going through this is quite an effort, and likely requires a bit of >> manual work. Is there any organised structure/group of people, that >> deal with duplicate works? We'd love to contribute our findings to >> such an effort once we clean up our data a bit. > > Sure. You can edit the files and add > https://commons.wikimedia.org/wiki/Template:Duplicate > If you need to report many thousands files, it may be better to use a > flagged bot account: > https://commons.wikimedia.org/wiki/Commons:Bots/Requests > > Nemo > > _______________________________________________ > Commons-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/commons-l _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
On Thu, Dec 4, 2014 at 10:08 AM, James Heald <[hidden email]> wrote:
> As well as exact duplicates, there may often also be different versions of > the same painting with different lighting, or scans of slightly different > reproductions of the same work. I don't know whether the algorithm is > permissive enough to pick all of these up, but as many as can be picked up > would be good to tag as "other versions" of the same underlying image. > > In general, we probably wouldn't *remove* duplicate images, but we would > want to identify them as versions of each other. We probably need a good definition of all these terms, because people tend to have different interpretations of a 'duplicate'. E.g., for me a lower quality reproduction of a painting is a duplicate, but other people on Commons define it more strictly: only 'downsized' versions of a reproduction (that could also be made using the thumbnail service) are considered to be duplicates. We also need to have definitions for things like details, alternate angles, etcetera. -- Hay _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by James Heald
On 4 December 2014 at 09:08, James Heald <[hidden email]> wrote:
> As well as exact duplicates, there may often also be different versions of > the same painting with different lighting, or scans of slightly different > reproductions of the same work. I don't know whether the algorithm is > permissive enough to pick all of these up, but as many as can be picked up > would be good to tag as "other versions" of the same underlying image. Careful here - algorithms that spot almost-duplicates will happily flag different shots from the same shoot. Definitely not something to act upon without close human inspection. > In general, we probably wouldn't *remove* duplicate images, but we would > want to identify them as versions of each other. Oh yeah, this'll be useful. - d. _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
Hi everyone,
> Careful here - algorithms that spot almost-duplicates will happily > flag different shots from the same shoot. Definitely not something to > act upon without close human inspection. I agree, and I wouldn't want to flag anything automatically based on our findings. The algorithm we use is meant to capture verbatim re-use, not derivative works. This means that it does a very poor job at matching images that are different photographic reproductions of the same work (light conditions, angles, borders, etc, will all differ). It does a fairly good job at matching images that are verbatim copies, allowing for resizing and format changes, but it's not perfect, and we definitely end up with the same hash for some images, even if they're not identical. This happens often with maps, for instance. For example two maps of US states, one marking Washington in red and one marking California in red. With no other differences, they'll end up hashed very close to each other. Sincerely, Jonas _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Federico Leva (Nemo)
Hi Federico, and others,
> Are most of the case you find perfect duplicates like these? I'm still running the comparison, but I made a first list of ~500 duplicate works available here: http://belar.coyote.org/~jonas/wmcdups.html It would be very useful to get some feedback on that. Looking through some of those will give an idea of the kind of "duplicates" we find. Sincerely, Jonas _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Jonas Öberg
Just out of interest, would your algorithm identify eg
https://commons.wikimedia.org/wiki/File:Portrait_du_Bienheureux_Pierre_de_Luxembourg_-_Mus%C3%A9e_du_Petit_Palais_d%27Avignon.jpg and https://commons.wikimedia.org/wiki/File:Master_Of_The_Avignon_School_-_Vision_of_Peter_of_Luxembourg_-_WGA14511.jpg as duplicates? (Just as a pair of images I happen to have run across this morning). They're very similar, though the smaller image is in fact sharper, a little darker, and slightly differently framed. So I'd be interested whether they would ping the algorithm or not. All best, James. On 04/12/2014 09:43, Jonas Öberg wrote: > Hi everyone, > >> Careful here - algorithms that spot almost-duplicates will happily >> flag different shots from the same shoot. Definitely not something to >> act upon without close human inspection. > > I agree, and I wouldn't want to flag anything automatically based on > our findings. > > The algorithm we use is meant to capture verbatim re-use, not > derivative works. This means that it does a very poor job at matching > images that are different photographic reproductions of the same work > (light conditions, angles, borders, etc, will all differ). It does a > fairly good job at matching images that are verbatim copies, allowing > for resizing and format changes, but it's not perfect, and we > definitely end up with the same hash for some images, even if they're > not identical. This happens often with maps, for instance. For example > two maps of US states, one marking Washington in red and one marking > California in red. With no other differences, they'll end up hashed > very close to each other. > > Sincerely, > Jonas > > _______________________________________________ > Commons-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/commons-l > _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
Hi James,
> They're very similar, though the smaller image is in fact sharper, a little > darker, and slightly differently framed. This wouldn't trigger any bells for us. They're too different for us to, mathematically, say that they are similar without also triggering a lot of false positives. If we look at the hashes generated by our blockhash[1] algorithm for those two images, we end up with this: 8000bc409f7c9ffd9cd096689fe883e4f3fd83c583c101e183e101e60073e7bf 80019c819ff99ff18cc1944197e19fe9f7e9c3c983c103c1a3e183ee217004ff You can see that there is some commonality, but that they're also quite far apart. If we convert this to bits and calculate the hamming distance (the number of bits that differ) between the two, we end up with a distance of 48 bits (out of 256). So far, we've found that a maximum distance of 10 is usually sufficiently unique to be called a match, though with the draft query for duplicate Commons worked that I linked to, I've been even more restrictive and not allowed even 1 bit to differ, just to get a better match for those that do match, at the expense of not matching as many. Sincerely, Jonas _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Jonas Öberg
Would it be possible to split the list into images that are
* byte-for-byte identical * very different sizes (eg > x2 difference -- this is often intentional, especially for large tiffs). * others ? I think this would be useful. It would also be useful to do some further processing to identify images which, though probably related, are *not* in fact duplicates, eg due to a notable difference somewhere (eg arrows or legend added, or a difference in some local blocks of colour, eg: https://commons.wikimedia.org/wiki/File:Map_-_NL_-_Putten_-_Wijk_00_Putten_-_Buurt_01_Putten-Zuid-Oost.svg https://commons.wikimedia.org/wiki/File:Map_-_NL_-_Putten_-_Wijk_00_Putten_-_Buurt_03_Putten-Zuid-West.svg -- James. On 04/12/2014 09:44, Jonas Öberg wrote: > Hi Federico, and others, > >> Are most of the case you find perfect duplicates like these? > > I'm still running the comparison, but I made a first list of ~500 > duplicate works available here: > > http://belar.coyote.org/~jonas/wmcdups.html > > It would be very useful to get some feedback on that. Looking through > some of those will give an idea of the kind of "duplicates" we find. > > > Sincerely, > Jonas > > _______________________________________________ > Commons-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/commons-l > _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
Hi James,
> * byte-for-byte identical That's something probably best done by WMF staff themselves, I think a simple md5 comparison would give quite a few matches. Doing on the WMF side would alleviate the need to transfer large amounts of data. For the rest, that's something that require a few API lookups only to get the relevant information (size etc). I can also imagine that it might be useful to take the results we've gotten, apply some secondary matching on the pairs that we've identified. Such a secondary matching could be more specific than ours to narrow down to true duplicates, and also take size into consideration. That's beyond our need though: we're happy with the information we have, and while it would contribute to our work to eliminate duplicates in Commons, it's not critical right now. But if someone is interested in working with our results or our data, we'd be happy to collaborate around that if it would benefit Commons. Sincerely, Jonas _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
On 4 December 2014 at 18:39, Jonas Öberg <[hidden email]> wrote:
>> * byte-for-byte identical > > That's something probably best done by WMF staff themselves, I think a > simple md5 comparison would give quite a few matches. Doing on the WMF > side would alleviate the need to transfer large amounts of data. Volunteers can do this using simple database queries, which is a lot more efficient than pulling data out of the API. For example while writing this email I knocked out a query to show all non-trivial images (>2 pixels wide) on Commons with at least *3* files having the same SHA1 checksum and showing each image just once. The matching files are listed at the bottom of each image page on Commons. Interestingly, this shows that most of the 226 files have been from an upload of Gospel illustrations. The low number seems reassuring considering the size of Commons. The files are reported in descending order by image resolution. Report: http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887 On its own this is an interesting list to use as a backlog for fixes. Listing identical duplicates with 2 or more files matching would be simpler but longer; at the moment I count 3,279 files like this on Commons which took over 9 minutes to run. :-) Fae -- [hidden email] https://commons.wikimedia.org/wiki/User:Fae _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
Hi Fae,
> Listing identical duplicates with 2 or more files matching would be > simpler but longer; at the moment I count 3,279 files like this on > Commons which took over 9 minutes to run. :-) This is very interesting. I had a closer look at our matches and it seems that many of them are files where there are slight color variations, or where the jpg has simply been compressed differently, so a sha1 wouldn't mach them against each other. But that speaks in favor of the fact that the matches we find need a human to validate case by case. My Python script is still processing :-) but it's currently recorded 12,475 matches, which then also includes your 3,279. But your 3,279 should be fairly uncomplicated to do something about it seems, though perhaps there too it needs a human to assist since the metadata and use may vary? Sincerely, Jonas _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Fæ
I am using Wikimedia APIs to create a gallery of duplicates and routinely clean them. You can see the results here. The page also has a link to the script. If anyone is interested in using this script, let me know and I can work with you to customize it. - Sreejith K. On Thu, Dec 4, 2014 at 2:46 PM, Fæ <[hidden email]> wrote: On 4 December 2014 at 18:39, Jonas Öberg <[hidden email]> wrote: _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Jonas Öberg
Am 04.12.2014 19:39, schrieb Jonas Öberg:
> Hi James, > >> * byte-for-byte identical > > That's something probably best done by WMF staff themselves, I think a > simple md5 comparison would give quite a few matches. Doing on the WMF > side would alleviate the need to transfer large amounts of data. This is happening automatically: the SHA1 hash of every file is computed on upload, and placed in the img_sha1 field on the database. I believe this is used to warn users who try to upload an exact duplicate, but I'm not sure this is true. Anyway, *exact* duplicates can easily be found in the database by anyone who has an account on toollabs. The relevant query is: select A.img_name, A.img_sha1, B.img_name from image as A join image as B on A.img_sha1 = B.img_sha1 and A.img_name < B.img_name; Having a list of "effective" duplicates, such as the same image in slightly different resolution or compression, would of course be very interesting. -- daniel _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Jonas Öberg
Just casually clicking on some of the results you pulled, I can see at a glance that lots of these duplicate uploads are caused because the oldest version is virtually "unfindable" for Wikimedians; i.e. it is not in any category whatsoever. On Thu, Dec 4, 2014 at 7:39 PM, Jonas Öberg <[hidden email]> wrote: Hi James, _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Daniel Kinzler
Am 04.12.2014 22:49, schrieb Daniel Kinzler:
> select A.img_name, A.img_sha1, B.img_name from image as A join image as B on > A.img_sha1 = B.img_sha1 and A.img_name < B.img_name; Fwiw, i get 197495 results for that query (took about 4 minutes). So about 200k images could be deleted, but all the categories and meta-info should be merged, and any usages moved to one file... -- daniel _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Daniel Kinzler
On Thu, Dec 4, 2014 at 1:49 PM, Daniel Kinzler <[hidden email]> wrote: This is happening automatically: the SHA1 hash of every file is computed on Indeed; so chances are exact duplicates have been uploaded intentionally. (It's not certain because they might have been uploaded before this warning existed, or uploaded by a bot configured to ignore warnings.) There is also an automated report: https://commons.wikimedia.org/wiki/Special:ListDuplicatedFiles _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Jonas Öberg
> > -- Thanks Jonas for experimenting with this sort of thing. I always wished we did something with preceptual hashes internally in addition to the sha1 hashes we do currently. --bawolff _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
In reply to this post by Gergo Tisza
Gergo Tisza, 04/12/2014 23:27:
> On Thu, Dec 4, 2014 at 1:49 PM, Daniel Kinzler wrote: > > This is happening automatically: the SHA1 hash of every file is > computed on > upload, and placed in the img_sha1 field on the database. I believe > this is used > to warn users who try to upload an exact duplicate, but I'm not sure > this is > true. > > > Indeed; so chances are exact duplicates have been uploaded > intentionally. (It's not certain because they might have been uploaded > before this warning existed, or uploaded by a bot configured to ignore > warnings.) This warning is shown at Special:Upload. I don't remember if all users have the permission to click the "upload anyway"/"ignore warnings" button. On UploadWizard, the upload just fails in the "upload" step and I can only remove the file. It's possible users didn't understand that duplicate means byte by byte duplicate, but I checked a dozen examples and most seem result of an upload failure. For instance: two pages of a book are given the same image; an image uploaded twice within 3 minutes (and used only with the second title). Nemo _______________________________________________ Commons-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/commons-l |
Free forum by Nabble | Edit this page |