Hashing Wikimedia Commons

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Hashing Wikimedia Commons

Jonas Öberg
Dear all,

some of you may have been at our presentation during Wikimania and you'll find this familiar, but for the rest of you, I'm working with Commons Machinery on software that will hope to identify images on the web, even when they are used outside of their original context, to provide automatic attribution and a referral back to its origin. Imagine a blogger using a photo from Commons, visiting that blog and having a browser plugin overlay a small icon showing that the image is from Commons and inviting to find out more - even if the blogger forgot to attribute.

We're currently working on an addon for Firefox to do just this, and we've previously worked out a backend to store the information we need to make these matches, some utilities for perceptual image hashing etc. We would love to work with images from Wikimedia Commons as a first dataset to explore how this will all work in practice.

But in order to do so, we need information from Commons, and we want to make this as easy on the WMF servers as possible, so we'd appreciate some help and pointers. What we're looking at retrieving is information about (1) title, (2) author, (3) license, and (4) thumbnails of medium size.

The first three we can get from pretty much either API, or extract directly from a dump file. The latter is eluding us though, for two reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not something we find in the dumps - though we can get it from one of the APIs.

The other is thumbnail sizes. We need to retrieve a reasonably sized image (but in many cases less than the original size) of about 640px wide, so that we can then run a perceptual hash algorithm on this file.

From what we can understand, you can request any size thumbnail on an image simply by prefixing it with the size you want (like 123x-Filename.jpg). But it seems really silly to always request 640x for instance, since that would mean the WMF servers would need to generate that for us specifically if the resolution doesn't exist.

What we'd find much more appealing is to be able to determine before making the call what sizes already exist and which can be retrieved without the WMF servers needing to rescale them for us. And while the viewer on Commons do seem to offer thumbnails in various sizes, we can't seem to get that information from any API.

We can scrape the Commons web page for this information, but we figured that people here might have good ideas for how we approach this with minimal impact on the WMF servers :)

Sincerely,
Jonas


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Jean-Frédéric
Hi Jonas,

Awesome project!

I’m cc-ing the WMF Multimedia team, who might have some more answers :)


2014-09-04 12:26 GMT+02:00 Jonas Öberg <[hidden email]>:
Dear all,

some of you may have been at our presentation during Wikimania and you'll find this familiar, but for the rest of you, I'm working with Commons Machinery on software that will hope to identify images on the web, even when they are used outside of their original context, to provide automatic attribution and a referral back to its origin. Imagine a blogger using a photo from Commons, visiting that blog and having a browser plugin overlay a small icon showing that the image is from Commons and inviting to find out more - even if the blogger forgot to attribute.

We're currently working on an addon for Firefox to do just this, and we've previously worked out a backend to store the information we need to make these matches, some utilities for perceptual image hashing etc. We would love to work with images from Wikimedia Commons as a first dataset to explore how this will all work in practice.

But in order to do so, we need information from Commons, and we want to make this as easy on the WMF servers as possible, so we'd appreciate some help and pointers. What we're looking at retrieving is information about (1) title, (2) author, (3) license, and (4) thumbnails of medium size.

The first three we can get from pretty much either API, or extract directly from a dump file. The latter is eluding us though, for two reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not something we find in the dumps - though we can get it from one of the APIs.

The other is thumbnail sizes. We need to retrieve a reasonably sized image (but in many cases less than the original size) of about 640px wide, so that we can then run a perceptual hash algorithm on this file.

From what we can understand, you can request any size thumbnail on an image simply by prefixing it with the size you want (like 123x-Filename.jpg). But it seems really silly to always request 640x for instance, since that would mean the WMF servers would need to generate that for us specifically if the resolution doesn't exist.

What we'd find much more appealing is to be able to determine before making the call what sizes already exist and which can be retrieved without the WMF servers needing to rescale them for us. And while the viewer on Commons do seem to offer thumbnails in various sizes, we can't seem to get that information from any API.

We can scrape the Commons web page for this information, but we figured that people here might have good ideas for how we approach this with minimal impact on the WMF servers :)

Sincerely,
Jonas


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l




--
Jean-Frédéric

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Daniel Kinzler
In reply to this post by Jonas Öberg
Am 04.09.2014 12:26, schrieb Jonas Öberg:
> The first three we can get from pretty much either API, or extract directly from
> a dump file. The latter is eluding us though, for two reasons. One is that a
> file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory -
> but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not
> something we find in the dumps - though we can get it from one of the APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical
way would be to use
https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machinery_2.jpg

Which generates a redirect to
https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2.jpg

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/"
and the desired size in the correct location (maybe Special:Redirect can do that
for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machinery_2.jpg/640px-30C3_Commons_Machinery_2.jpg

HTH
Daniel

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Jeremy Baron
On Thu, Sep 4, 2014 at 10:40 AM, Daniel Kinzler <[hidden email]> wrote:
> To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/"
> and the desired size in the correct location (maybe Special:Redirect can do that
> for you, but I do not know how):
>
> https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machinery_2.jpg/640px-30C3_Commons_Machinery_2.jpg

you should use one of the standard sizes that we're already using anyway.

(maybe a mediaviewer size. not sure what those are offhand)

-Jeremy

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Jean-Frédéric
In reply to this post by Daniel Kinzler

> The first three we can get from pretty much either API, or extract directly from
> a dump file. The latter is eluding us though, for two reasons. One is that a
> file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory -
> but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not
> something we find in the dumps - though we can get it from one of the APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical
way would be to use
https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machinery_2.jpg

Which generates a redirect to
https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2.jpg

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/"
and the desired size in the correct location (maybe Special:Redirect can do that
for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machinery_2.jpg/640px-30C3_Commons_Machinery_2.jpg

If I am not mistaken you can use thumb.php to get the needed thumb?
<https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100>


Hope that helps,
--
Jean-Frédéric

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Daniel Schwen-2

I was told thumb.php is evil (for lack of caching).
I'm using special:redirect with the width=640 parameter.
Daniel

On Sep 4, 2014 5:49 AM, "Jean-Frédéric" <[hidden email]> wrote:

> The first three we can get from pretty much either API, or extract directly from
> a dump file. The latter is eluding us though, for two reasons. One is that a
> file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory -
> but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not
> something we find in the dumps - though we can get it from one of the APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical
way would be to use
https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machinery_2.jpg

Which generates a redirect to
https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2.jpg

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/"
and the desired size in the correct location (maybe Special:Redirect can do that
for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machinery_2.jpg/640px-30C3_Commons_Machinery_2.jpg

If I am not mistaken you can use thumb.php to get the needed thumb?
<https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100>


Hope that helps,
--
Jean-Frédéric

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Gergo Tisza
On Thu, Sep 4, 2014 at 3:05 PM, Daniel Schwen <[hidden email]> wrote:

I was told thumb.php is evil (for lack of caching).
I'm using special:redirect with the width=640 parameter.

I'm not sure there is a difference (both hit PHP and neither will recreate the image if it exists already), but thumb.php will result in an error if the size is too large, while Special:Redirect will return the original file, so probably that's the cleanest solution:


(this form will work better if the file name contains a question mark)

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Gergo Tisza
In reply to this post by Jonas Öberg
On Thu, Sep 4, 2014 at 12:26 PM, Jonas Öberg <[hidden email]> wrote:
But in order to do so, we need information from Commons, and we want to make this as easy on the WMF servers as possible, so we'd appreciate some help and pointers. What we're looking at retrieving is information about (1) title, (2) author, (3) license, and (4) thumbnails of medium size.

Not sure what you mean by title (the filename might be the closest thing - files have a description but that's long and can contain arbitrary HTML), but for author/license you can use the imageinfo/extmetadata API. [1][2] (The author name might also contain HTML, and in many cases complex structure like a listing of multiple authors.)

The ongoing structured data project [3] will hopefully make it easier to process image attribution metadata.

From what we can understand, you can request any size thumbnail on an image simply by prefixing it with the size you want (like 123x-Filename.jpg). But it seems really silly to always request 640x for instance, since that would mean the WMF servers would need to generate that for us specifically if the resolution doesn't exist.

What we'd find much more appealing is to be able to determine before making the call what sizes already exist and which can be retrieved without the WMF servers needing to rescale them for us. And while the viewer on Commons do seem to offer thumbnails in various sizes, we can't seem to get that information from any API.

There is no way to know whether a size exists already (other than trying). Standardizing some common sizes is an ongoing effort, so the result is not set in stone yet, but we will probably end up with the standard screen widths, such as 640 and 800 pixel; various tools use tools use those already.


[2] https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&titles=File:30C3_Commons_Machinery_2.jpg&prop=imageinfo&iiprop=extmetadata&iiextmetadatafilter=ObjectName|Artist|LicenseShortName

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Daniel Schwen-2
In reply to this post by Gergo Tisza
> I'm not sure there is a difference (both hit PHP and neither will recreate
> the image if it exists already), but thumb.php will result in an error if

Well, here is what Krinkle wrote me
http://commons.wikimedia.org/w/index.php?title=Help_talk:FastCCI&diff=127317992&oldid=125445681

Cheers,
Daniel

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Gergo Tisza
On Thu, Sep 4, 2014 at 5:56 PM, Daniel Schwen <[hidden email]> wrote:
> I'm not sure there is a difference (both hit PHP and neither will recreate
> the image if it exists already), but thumb.php will result in an error if

Well, here is what Krinkle wrote me
http://commons.wikimedia.org/w/index.php?title=Help_talk:FastCCI&diff=127317992&oldid=125445681

The relevant bit is around line 300 in thumb.php (after the comment "Stream the file if it exists already"). Also, it does return a 304 when appropriate (of course your browser needs to send an If-Modified-Since header for that to happen, which it probably won't do). thumb.php streams the file from a PHP process, while Special:Redirect just sends the browser to a new location which is served directly by the web server, so that's indeed less overhead, especially for large files.

Anyway, thumb.php is internals, while Special:Redirect is a public URL, so it is always more appropriate to use the latter (or the API).

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Daniel Schwen-2
I'm wondering if thumb.php (although it internally reuses existing
thumbnails) play nicely with the varnish cache layer. If you get to a
point where you have to execute a PHP script you are already
generating more load than necessary.

On Thu, Sep 4, 2014 at 4:25 PM, Gergo Tisza <[hidden email]> wrote:

> On Thu, Sep 4, 2014 at 5:56 PM, Daniel Schwen <[hidden email]> wrote:
>>
>> > I'm not sure there is a difference (both hit PHP and neither will
>> > recreate
>> > the image if it exists already), but thumb.php will result in an error
>> > if
>>
>> Well, here is what Krinkle wrote me
>>
>> http://commons.wikimedia.org/w/index.php?title=Help_talk:FastCCI&diff=127317992&oldid=125445681
>
>
> The relevant bit is around line 300 in thumb.php (after the comment "Stream
> the file if it exists already"). Also, it does return a 304 when appropriate
> (of course your browser needs to send an If-Modified-Since header for that
> to happen, which it probably won't do). thumb.php streams the file from a
> PHP process, while Special:Redirect just sends the browser to a new location
> which is served directly by the web server, so that's indeed less overhead,
> especially for large files.
>
> Anyway, thumb.php is internals, while Special:Redirect is a public URL, so
> it is always more appropriate to use the latter (or the API).
>
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: Hashing Wikimedia Commons

Gergo Tisza
On Thu, Sep 4, 2014 at 10:38 PM, Daniel Schwen <[hidden email]> wrote:
I'm wondering if thumb.php (although it internally reuses existing
thumbnails) play nicely with the varnish cache layer. If you get to a
point where you have to execute a PHP script you are already
generating more load than necessary.

Special:Redirect is a PHP script as well. If you want to avoid circumventing varnish, you need to guess the thumbnail URL on your own. That's very hard if you want to cover all the edge cases.

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Jonas Öberg
In reply to this post by Jean-Frédéric
Thanks to everyone who took time to contribute here!

Let me try to sum up, from my understanding. For metadata information
about an image, using the imageinfo/extmetadata API is sensible for
the moment. We're aware and followed the talks on the structured data
project during Wikimania, and we're quite keen to see the results of
that when and if it starts being useful.

For thumbnails, there's no way to know if a thumbnail size has already
been rendered or not, but given that the MediaViewer has a default
list of widths that correspond to popular screen size resolutions[1],
it's a fair bet that for instance 640x and 800x would work, except for
situations when the image file is smaller than the requested thumbnail
size.

It's possible to use Special:Redirect or thumb.php to get the
thumbnail/URL, but both are actually PHP scripts that need running. So
while perhaps not ideal, it seems to make the most sense here to
generate the thumbnail URLs ourselves and hit the web server directly.


Sincerely,
Jonas

[1] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FMultimediaViewer/e7ea5cb25d285bc7685838a6f375bcf0d9b4b6ff/resources%2Fmmv%2Fmmv.ThumbnailWidthCalculator.js



On 4 September 2014 21:47, Derk-Jan Hartman <[hidden email]> wrote:

> Correct, better not rely on thumb.php, the servers will just generate
> the thumb if it is not yet present on the canonical address yet, that
> Special:Redirect can point you at.
>
> Also, almost all this info can be retrieved in one go from the api.php
> of course:
>
> http://commons.wikimedia.org/w/api.php?action=query&titles=File:30C3_Commons_Machinery_2.jpg&prop=imageinfo&iilimit=50&iiprop=sha1|url|thumbmime|extmetadata|archivename&iiurlwidth=640&iilimit=1
>
> Lists almost all the info of the latest revision of the file.
>
> DJ
>
> On Thu, Sep 4, 2014 at 3:04 PM, Daniel Schwen <[hidden email]> wrote:
>> I was told thumb.php is evil (for lack of caching).
>> I'm using special:redirect with the width=640 parameter.
>> Daniel
>>
>> On Sep 4, 2014 5:49 AM, "Jean-Frédéric" <[hidden email]> wrote:
>>>
>>>
>>>>> > The first three we can get from pretty much either API, or extract
>>>>> > directly from
>>>>> > a dump file. The latter is eluding us though, for two reasons. One is
>>>>> > that a
>>>>> > file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/
>>>>> > directory -
>>>>> > but where this /b/ba/ comes from (a hash?) is unclear to us now, and
>>>>> > it's not
>>>>> > something we find in the dumps - though we can get it from one of the
>>>>> > APIs.
>>>>
>>>>
>>>> Yes, /b/ba ist based on the first two digits of the MD5 hash of the
>>>> title:
>>>>
>>>> md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb
>>>>
>>>> But this is "arcane knowledge" which nobody should really rely on. The
>>>> canonical
>>>> way would be to use
>>>>
>>>> https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machinery_2.jpg
>>>>
>>>> Which generates a redirect to
>>>>
>>>> https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2.jpg
>>>>
>>>> To get a thumbnail, you can directly manipulate that URL, by inserting
>>>> "thumb/"
>>>> and the desired size in the correct location (maybe Special:Redirect can
>>>> do that
>>>> for you, but I do not know how):
>>>>
>>>>
>>>> https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machinery_2.jpg/640px-30C3_Commons_Machinery_2.jpg
>>>
>>>
>>> If I am not mistaken you can use thumb.php to get the needed thumb?
>>> <https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100>
>>>
>>> (That’s what I used in my CommonsDownloader [1])
>>>
>>> [1]
>>> <https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownloader/thumbnaildownload.py>
>>>
>>> Hope that helps,
>>> --
>>> Jean-Frédéric
>>>
>>> _______________________________________________
>>> Commons-l mailing list
>>> [hidden email]
>>> https://lists.wikimedia.org/mailman/listinfo/commons-l
>>>
>>
>> _______________________________________________
>> Multimedia mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/multimedia
>>
>
> _______________________________________________
> Multimedia mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/multimedia

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Gergo Tisza
On Fri, Sep 5, 2014 at 10:21 AM, Jonas Öberg <[hidden email]> wrote:
It's possible to use Special:Redirect or thumb.php to get the
thumbnail/URL, but both are actually PHP scripts that need running. So
while perhaps not ideal, it seems to make the most sense here to
generate the thumbnail URLs ourselves and hit the web server directly.

That can work if you don't mind getting errors in some % of cases where the file format would require a more complex URL scheme. Otherwise, you have three options:
  • just use Special:Redirect. Depending on your request frequency, it might be fine. We can ask ops what speed limit would be reasonable; for bots using the API, the general recommendation is 12 requests per minute.
  • scrape file description pages. The HTML page is cached in varnish and it has links to various standard image sizes, so you won't hit PHP this way; of course, HTML scraping is not the most reliable way of retrieving data.
  • use the API in batches. You can retrieve the information (including thumbnail URL) for 500 files in a single request (5000 if you get a bot flag):
<a href="https://en.wikipedia.org/w/api.php?format=jsonfm&amp;action=query&amp;titles=File:30C3_Commons_Machinery_1.jpg|File:30C3_Commons_Machinery_2.jpg|File:30C3_Commons_Machinery_3.jpg&amp;prop=imageinfo&amp;iiprop=extmetadata|url&amp;iiextmetadatafilter=ObjectName|Artist|LicenseShortName&amp;iiurlwidth=640">https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&titles=File:30C3_Commons_Machinery_1.jpg|File:30C3_Commons_Machinery_2.jpg|File:30C3_Commons_Machinery_3.jpg&prop=imageinfo&iiprop=extmetadata|url&iiextmetadatafilter=ObjectName|Artist|LicenseShortName&iiurlwidth=640

IMO the last option is the cleanest one.

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Jonas Öberg
Hi Gergo,


> That can work if you don't mind getting errors in some % of cases where the
> file format would require a more complex URL scheme.

I forgot an important aspect which is that at this point, we're only
concerned about JPG and PNG formats, which I suppose should be fairly
uncomplex.

> use the API in batches. You can retrieve the information (including
> thumbnail URL) for 500 files in a single request (5000 if you get a bot
> flag):

That's a neat idea - I didn't know that the API took multiple file
names in one query. If we could do 500 files per request, 10-12 per
minute, that's a more than adequate - but it feels that this is
something that we should be talking to ops about to validate?

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Gergo Tisza
On Fri, Sep 5, 2014 at 1:54 PM, Jonas Öberg <[hidden email]> wrote:
That's a neat idea - I didn't know that the API took multiple file
names in one query. If we could do 500 files per request, 10-12 per
minute, that's a more than adequate - but it feels that this is
something that we should be talking to ops about to validate?

12 per minute is the default setting for the standard bot framework, there are lots of bots doing processing with that speed and the max allowed item limit. I don't think you need to ask anyone before doing that. If you want to be extra nice, you can use the maxlag parameter.

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Daniel Schwen-2
In reply to this post by Gergo Tisza


> use the API in batches. You can retrieve the information (including thumbnail URL) for 500 files in a single request (5000 if you get a bot flag):

Its actually 50 and 500 with bot flag.
Unless the api docs aren't up to date and this was changed since the last time I tried (a few months ago)
Daniel


_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Gergo Tisza
On Fri, Sep 5, 2014 at 4:53 PM, Daniel Schwen <[hidden email]> wrote:

> use the API in batches. You can retrieve the information (including thumbnail URL) for 500 files in a single request (5000 if you get a bot flag):

Its actually 50 and 500 with bot flag.
Unless the api docs aren't up to date and this was changed since the last time I tried (a few months ago)


You are right, I was confusing it with the revision-per-image limit. 

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Federico Leva (Nemo)
IMHO worrying about load for such a small number of downloads (less than
25 millions) is silly. Just use a sensible, common size, for instance
800px wide which is the default. There's also an outdated
<https://www.mediawiki.org/wiki/Requests_for_comment/Standardized_thumbnails_sizes/thumb_sizes_requested>
Alternatively you can download/mirror the whole
https://archive.org/details/wikimediacommons and do some scaling
yourself. :P

Nemo

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
Reply | Threaded
Open this post in threaded view
|

Re: [Multimedia] Hashing Wikimedia Commons

Jonas Öberg
Thanks Federico! I love working with projects where 25 million
downloads is a small number :-)

Sincerely,
Jonas

On 6 September 2014 13:57, Federico Leva (Nemo) <[hidden email]> wrote:

> IMHO worrying about load for such a small number of downloads (less than
> 25 millions) is silly. Just use a sensible, common size, for instance
> 800px wide which is the default. There's also an outdated
> <https://www.mediawiki.org/wiki/Requests_for_comment/Standardized_thumbnails_sizes/thumb_sizes_requested>
> Alternatively you can download/mirror the whole
> https://archive.org/details/wikimediacommons and do some scaling
> yourself. :P
>
> Nemo
>
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/commons-l

_______________________________________________
Commons-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/commons-l
12