Image ID. Looking for community input.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Image ID. Looking for community input.

Gregory Maxwell
Someone came to me with an image they believed they obtained from
commons but where unsure of exactly where they found it...

I was eventually able to locate the image, but it took a lot of work.

Had the image been deleted for copyright problems it would have been
nearly impossible but it is in exactly that sort of situation which we
may need the ability to find the image the most.

I have locally a database of image fingerprints (quantized color
histograms) which could be used to locate images... it didn't work for
this case because the image was newer than our last image backup
(which was last year). It might be useful but it's not a complete
solution.

I wanted to get input from the community on a couple of actions we
might like to take to improve the situation in the future:

# On upload we could attach the URL the image was uploaded as to the
image in an EXIF tag.  There are a great many EXIF tags defined and
I'm sure we could find a fitting one. This would only work for .JPG
but it would be easy to implement.  As a separate topic, we should
consider adding license data to our exif tags (I do it for my images,
but we should perhaps do it more generally.   This would be fairly
easy to do and I don't think this would be controversial, although
there would be some complexity with respect to image moves once we
gain that ability in the future.   Does anyone object to this?

# We could also add the same in the PNG comments... although such use
of png comments is non-standard .. I don't think it would break
anything. Anyone have any thoughts on that?

# We could add some RDF tags to SVGs for the same purpose, although I
think the PNG rasterizations of SVGs would be more important.

# Finally, something that might be somewhat controversial:   I think
it would be a good idea to add some text to the (raster) thumbnail
image on the image page. My idea is that we would add an extra white
area below the image large enough to contain a line of text which
mentions where the image came from. This would have a two fold
benefit: 1) it would encourage people to use the full resolution image
for reuse, 2) it would cause automated scraping processes which hit
our image pages to preserve human readable tracking information.
Unlike a classic watermark this addition could be removed via
cropping.  Technically this would require just a few more arguments to
imagemagik during thumbnailing, but we'd have to make a few other
changes to handle smaller images and to treat the image page thumb
differently from other same-sized thumbs.


In general I think we need to think about how to push our metadata
into the image files themselves. Only if the metadata is embedded in
the images will downstream users have a hope of keeping track of the
images. If the rest of the world did this our lives would be much
easier, so let us do on to other as we would have others do onto us.
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Image ID. Looking for community input.

Magnus Manske-2
On 12/5/06, Gregory Maxwell <[hidden email]> wrote:

> Someone came to me with an image they believed they obtained from
> commons but where unsure of exactly where they found it...
>
> I was eventually able to locate the image, but it took a lot of work.
>
> Had the image been deleted for copyright problems it would have been
> nearly impossible but it is in exactly that sort of situation which we
> may need the ability to find the image the most.
>
> I have locally a database of image fingerprints (quantized color
> histograms) which could be used to locate images... it didn't work for
> this case because the image was newer than our last image backup
> (which was last year). It might be useful but it's not a complete
> solution.

A few hours ago, I wrote about adding an MD5 hash (or the like) to
each image entry in the database. That would have helped finding the
image in question as well, except if it has been altered.

> I wanted to get input from the community on a couple of actions we
> might like to take to improve the situation in the future:
>
> # On upload we could attach the URL the image was uploaded as to the
> image in an EXIF tag.  There are a great many EXIF tags defined and
> I'm sure we could find a fitting one. This would only work for .JPG
> but it would be easy to implement.

That could be done as part of the upload process. If we eventually
enable copy-from-web (again, some code of mine deactivated for unknown
reasons; next time, I'll set these things to "on" by default, so the
gods in charge can't ignore it forever like they do now) we could also
include the "original" (pre-commons) URL.

> As a separate topic, we should
> consider adding license data to our exif tags (I do it for my images,
> but we should perhaps do it more generally.   This would be fairly
> easy to do and I don't think this would be controversial, although
> there would be some complexity with respect to image moves once we
> gain that ability in the future.   Does anyone object to this?
>
> # We could also add the same in the PNG comments... although such use
> of png comments is non-standard .. I don't think it would break
> anything. Anyone have any thoughts on that?
>
> # We could add some RDF tags to SVGs for the same purpose, although I
> think the PNG rasterizations of SVGs would be more important.

Adding licenses to the image will require changing the image on any
license-altereing edit to the description page. It also means we need
to parse said description for license tags. Unless, of course, we
limit this function to the license set on the upload page.

That said, I think either is a good idea.

> # Finally, something that might be somewhat controversial:   I think
> it would be a good idea to add some text to the (raster) thumbnail
> image on the image page. My idea is that we would add an extra white
> area below the image large enough to contain a line of text which
> mentions where the image came from. This would have a two fold
> benefit: 1) it would encourage people to use the full resolution image
> for reuse, 2) it would cause automated scraping processes which hit
> our image pages to preserve human readable tracking information.
> Unlike a classic watermark this addition could be removed via
> cropping.  Technically this would require just a few more arguments to
> imagemagik during thumbnailing, but we'd have to make a few other
> changes to handle smaller images and to treat the image page thumb
> differently from other same-sized thumbs.

I'm not sure it's worth the effort.
1) We already link to the high-res version in the line below the
image. Altering the thumbnail requires people to edit the image if
they don't want the high-res version (maybe they're on a modem?)
2) That would be useful for automated image-scrapers that don't use
the page as well and don't link back to the commons. Do you have an
example for this?
Also, IMHO such a bar would uglify (is that a word?) most images. And,
our JPG thumbnails are JPGs as well; depending on the compression,
JPGs don't render (small) text very well.

> In general I think we need to think about how to push our metadata
> into the image files themselves. Only if the metadata is embedded in
> the images will downstream users have a hope of keeping track of the
> images. If the rest of the world did this our lives would be much
> easier, so let us do on to other as we would have others do onto us.

Agreed. But I'd also like for us to use existing data more within the
system. We already use EXIF data to categorize camera models, IIRC?
The images themselves contain data (color etc.); how about "similar
images"? (yes, I know that's a big one, just dreaming here;-)

I created a new flickr account a few days ago, and I very much like
the "feel" of it. THe whole site screams that it's designed for
images. Maybe we should think about tag/category clouds, pre-link
various image sizes, integrate mass-organization (like "show me my
images, select this and that, tag them with category XYZ"). I'm not
saying we should become flickr, but we should learn from them.

Magnus
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Image ID. Looking for community input.

Alphax (Wikipedia email)
Magnus Manske wrote:

> On 12/5/06, Gregory Maxwell <[hidden email]> wrote:
>> Someone came to me with an image they believed they obtained from
>> commons but where unsure of exactly where they found it...
>>
>> I was eventually able to locate the image, but it took a lot of work.
>>
>> Had the image been deleted for copyright problems it would have been
>> nearly impossible but it is in exactly that sort of situation which we
>> may need the ability to find the image the most.
>>
>> I have locally a database of image fingerprints (quantized color
>> histograms) which could be used to locate images... it didn't work for
>> this case because the image was newer than our last image backup
>> (which was last year). It might be useful but it's not a complete
>> solution.
>
> A few hours ago, I wrote about adding an MD5 hash (or the like) to
> each image entry in the database. That would have helped finding the
> image in question as well, except if it has been altered.
>
A Google search for "image hash" gives 878 results (curses, [[Image:Hash
function.svg]] shows up several times); surely there's a good one with a
free implementation /somewhere/?

--
Alphax - http://en.wikipedia.org/wiki/User:Alphax
Contributor to Wikipedia, the Free Encyclopedia
"We make the internet not suck" - Jimbo Wales
Public key: http://en.wikipedia.org/wiki/User:Alphax/OpenPGP


_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l

signature.asc (581 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Image ID. Looking for community input.

Gregory Maxwell
On 12/5/06, Alphax (Wikipedia email) <[hidden email]> wrote:
> A Google search for "image hash" gives 878 results (curses, [[Image:Hash
> function.svg]] shows up several times); surely there's a good one with a
> free implementation /somewhere/?

The subject you really want to search is "image indexing" and it's a
surprisingly immature area.


A simple exact match hash will be kept by the later image storage
system. It would be useful for some thing (detecting exact digital
duplicates), but wouldn't have helped with my image (which was a copy
of the image page thumbnail).
_______________________________________________
Commons-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/commons-l