commons.wikimedia.org allowing directory indexes and web robots

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

commons.wikimedia.org allowing directory indexes and web robots

Alexandre Dulaunoy-5
Hi All,

Commons.wikimedia.org is growing and provides a quite complete set
of media files including a lot of interesting historical documents.
Contributors are relying on the availability and persistence of
commons.wikimedia.org but currently the full export is only
available on download.wikimedia.org (ok not Today ;-).

I was wondering if it would be possible to allow web robots to access
http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror
the media files. As this is pure HTTP, the mirroring could benefit from
the caching mechanisms of HTTP object (instead of having a large dump
containing all the media files, that is more difficult to cache/update).

Maybe this could allow a more distributed backup approach to ensure
the resilience of commons.wikimedia.org?

Thanks a lot for your work,

adulau

--
--                   Alexandre Dulaunoy (adulau) -- http://www.foo.be/
--                             http://www.foo.be/cgi-bin/wiki.pl/Diary
--         "Knowledge can create problems, it is not through ignorance
--                                that we can solve them" Isaac Asimov

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

signature.asc (161 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

David Gerard-2
2009/7/18 Alexandre Dulaunoy <[hidden email]>:

> I was wondering if it would be possible to allow web robots to access
> http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror
> the media files. As this is pure HTTP, the mirroring could benefit from
> the caching mechanisms of HTTP object (instead of having a large dump
> containing all the media files, that is more difficult to cache/update).


I see lots of files on upload.wikimedia.org on Google Image Search
already. Is that actually forbidden by our robots.txt?

It'd actually be better if Google properly indexed text pages whose
name ends in .jpg or whatever ... but they're aware we'd like that, so
it's up to them.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Alexandre Dulaunoy-5
On Sat, Jul 18, 2009 at 3:20 PM, David Gerard<[hidden email]> wrote:

> 2009/7/18 Alexandre Dulaunoy <[hidden email]>:
>
>> I was wondering if it would be possible to allow web robots to access
>> http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror
>> the media files. As this is pure HTTP, the mirroring could benefit from
>> the caching mechanisms of HTTP object (instead of having a large dump
>> containing all the media files, that is more difficult to cache/update).
>
>
> I see lots of files on upload.wikimedia.org on Google Image Search
> already. Is that actually forbidden by our robots.txt?
>
> It'd actually be better if Google properly indexed text pages whose
> name ends in .jpg or whatever ... but they're aware we'd like that, so
> it's up to them.

But the current directory listing (upload dir) is disallowed, for example :

http://upload.wikimedia.org/wikipedia/commons/8/8c/

Of course, the bot will be able to get the media files by
following the links from the other pages but this is not
very handy/effective to make a exact mirror of just
the current media files repository.

Would it possible to enable directory listing of
http://upload.wikimedia.org/wikipedia/commons
and the following subdirectories?

Thanks for the feedback,


--
--                   Alexandre Dulaunoy (adulau) -- http://www.foo.be/
--                             http://www.foo.be/cgi-bin/wiki.pl/Diary
--         "Knowledge can create problems, it is not through ignorance
--                                that we can solve them" Isaac Asimov

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Robert Rohde
In reply to this post by David Gerard-2
On Sat, Jul 18, 2009 at 6:20 AM, David Gerard<[hidden email]> wrote:

> 2009/7/18 Alexandre Dulaunoy <[hidden email]>:
>
>> I was wondering if it would be possible to allow web robots to access
>> http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror
>> the media files. As this is pure HTTP, the mirroring could benefit from
>> the caching mechanisms of HTTP object (instead of having a large dump
>> containing all the media files, that is more difficult to cache/update).
>
>
> I see lots of files on upload.wikimedia.org on Google Image Search
> already. Is that actually forbidden by our robots.txt?
>
> It'd actually be better if Google properly indexed text pages whose
> name ends in .jpg or whatever ... but they're aware we'd like that, so
> it's up to them.

Which is why my personal wiki is patched to translate the ".jpg" into
"_jpg", etc. for all references to image description pages.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

David Gerard-2
2009/7/18 Robert Rohde <[hidden email]>:
> On Sat, Jul 18, 2009 at 6:20 AM, David Gerard<[hidden email]> wrote:

>> It'd actually be better if Google properly indexed text pages whose
>> name ends in .jpg or whatever ... but they're aware we'd like that, so
>> it's up to them.

> Which is why my personal wiki is patched to translate the ".jpg" into
> "_jpg", etc. for all references to image description pages.


Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
for _jpg to be the default image page name and .jpg an alias for
backward compatibility? That'd be really helpful in all sorts of ways
- on pretty much any website *not* running MediaWiki, something ending
".jpg" is going to be the image, not a text page.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Dmitriy Sintsov
* David Gerard <[hidden email]> [Sat, 18 Jul 2009 14:55:28 +0100]:

> 2009/7/18 Robert Rohde <[hidden email]>:
> > On Sat, Jul 18, 2009 at 6:20 AM, David Gerard<[hidden email]>
> wrote:
>
> >> It'd actually be better if Google properly indexed text pages whose
> >> name ends in .jpg or whatever ... but they're aware we'd like that,
> so
> >> it's up to them.
>
> > Which is why my personal wiki is patched to translate the ".jpg"
into
> > "_jpg", etc. for all references to image description pages.
>
>
> Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
> for _jpg to be the default image page name and .jpg an alias for
> backward compatibility? That'd be really helpful in all sorts of ways
> - on pretty much any website *not* running MediaWiki, something ending
> ".jpg" is going to be the image, not a text page.
>
I am not sure that the underscore is the most suitable character,
because in MediaWiki it's interchangable with the space character. The
type of the document should be determined by it's mime-type. If Google
uses the web path "extension" (which is meaningless by the way, because
that's a virtual path) instead of mime-type to determine whether the
page should be indexed, that's amazing bug for Google.
Dmitriy

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

David Gerard-2
2009/7/20 Dmitriy Sintsov <[hidden email]>:
> * David Gerard <[hidden email]> [Sat, 18 Jul 2009 14:55:28 +0100]:

>> Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
>> for _jpg to be the default image page name and .jpg an alias for
>> backward compatibility? That'd be really helpful in all sorts of ways
>> - on pretty much any website *not* running MediaWiki, something ending
>> ".jpg" is going to be the image, not a text page.

> I am not sure that the underscore is the most suitable character,
> because in MediaWiki it's interchangable with the space character.


Or whatever, as long as it isn't ending .jpg .


> The
> type of the document should be determined by it's mime-type. If Google
> uses the web path "extension" (which is meaningless by the way, because
> that's a virtual path) instead of mime-type to determine whether the
> page should be indexed, that's amazing bug for Google.


Yes, it's an amazing bug for Google. It's also the way they do it.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Nikola Smolenski
In reply to this post by Dmitriy Sintsov
Dmitriy Sintsov wrote:
> because in MediaWiki it's interchangable with the space character. The
> type of the document should be determined by it's mime-type. If Google
> uses the web path "extension" (which is meaningless by the way, because
> that's a virtual path) instead of mime-type to determine whether the
> page should be indexed, that's amazing bug for Google.

It's a necessary evil however, because of a number of servers that serve
incorrect mime types. IIRC, previously Google didn't index our images at
all, but later added MediaWiki as an exception.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Robert Rohde
In reply to this post by David Gerard-2
On Sat, Jul 18, 2009 at 6:55 AM, David Gerard<[hidden email]> wrote:

> 2009/7/18 Robert Rohde <[hidden email]>:
>> On Sat, Jul 18, 2009 at 6:20 AM, David Gerard<[hidden email]> wrote:
>
>>> It'd actually be better if Google properly indexed text pages whose
>>> name ends in .jpg or whatever ... but they're aware we'd like that, so
>>> it's up to them.
>
>> Which is why my personal wiki is patched to translate the ".jpg" into
>> "_jpg", etc. for all references to image description pages.
>
>
> Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
> for _jpg to be the default image page name and .jpg an alias for
> backward compatibility? That'd be really helpful in all sorts of ways
> - on pretty much any website *not* running MediaWiki, something ending
> ".jpg" is going to be the image, not a text page.

Honestly, I'm not entirely sure how large a hack it would be.  In my
particular case, the hack I added was in the link generator very late
in the process and fairly ugly.  Really, one should probably be
modifying Title.php to change the url form of image description page
name, but there may be unexpected dependencies associated with doing
that.  Also, in my hack I used apache's mod_rewrite to get the right
destination for incoming queries, but for a generally application this
should also be handled by Title.php or something similar.

As long as one requires files have explicit type suffixes (e.g.
".jpg", ".svg", etc), one can use the allowed list to determine what
file names to translate without generating conflicts.  I believe all
Wikimedia sites require such suffixes, but Mediawiki can be configured
to remove that requirement which would need to be considered for a
general application (i.e. what to do if the configuration allows
separate files named "Foo_jpg" and "Foo.jpg")

I'd definitely like to see Mediawiki include a configuration option so
that image description pages would handle the suffix differently, so
maybe I'll think about it a bit more.  This is one of a half dozen or
so issues that I end up repatching on my local install every time I
decide to upgrade.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Aryeh Gregor
In reply to this post by Dmitriy Sintsov
On Mon, Jul 20, 2009 at 6:20 AM, Dmitriy Sintsov<[hidden email]> wrote:
> I am not sure that the underscore is the most suitable character,
> because in MediaWiki it's interchangable with the space character. The
> type of the document should be determined by it's mime-type. If Google
> uses the web path "extension" (which is meaningless by the way, because
> that's a virtual path) instead of mime-type to determine whether the
> page should be indexed, that's amazing bug for Google.

Maybe they don't retrieve the page in the first place, because they
don't want to waste bandwidth and processing time getting images.  It
would be rather a waste to send dozens or hundreds of HEAD requests on
every Flickr page (or whatever) just to make sure that all those
things ending in a suffix universally accepted to designate images
really *are* images.

On Mon, Jul 20, 2009 at 9:45 AM, Nikola Smolenski<[hidden email]> wrote:
> It's a necessary evil however, because of a number of servers that serve
> incorrect mime types.

Well, that would make no difference if you actually downloaded the
content, or the first handful of bytes.  It's easy to *very* reliably
distinguish binary image data from HTML if you get to look at the
first several bytes of the file.

Anyway, I think the "right" way to do this would be to omit the suffix
from the page name entirely, treating the format as an implementation
detail.  That way you could, for instance, upload an SVG over a PNG or
a PNG over a JPEG, and have all users be automatically updated without
manually changing the references.  This does get a little confusing
when you consider totally different types of media, though, like audio
or video or PDF or whatnot.  If NS_FILE (NS_IMAGE) weren't hardcoded
in thirty million places both in code and templates, I might suggest
different namespaces for different media types instead of one unified
File: namespace, but that seems impractical at this point.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Chad
If making different namespaces per filetype wasn't
feasible, what about making [[File:]] better so it
automatically returns the best way to use the media--<img> tag for images,
video/audio tags (or fallbacks)
as appropriate. That way if a file is changed (ie ogg
over png) it still displays properly.

This is all dependent on stripping extensions from
uploads, though.

-Chad

On Jul 20, 2009 6:15 PM, "Aryeh Gregor"
<[hidden email]<Simetrical%[hidden email]>>
wrote:

On Mon, Jul 20, 2009 at 6:20 AM, Dmitriy Sintsov<[hidden email]> wrote:
> I am not sure that the...
Maybe they don't retrieve the page in the first place, because they
don't want to waste bandwidth and processing time getting images.  It
would be rather a waste to send dozens or hundreds of HEAD requests on
every Flickr page (or whatever) just to make sure that all those
things ending in a suffix universally accepted to designate images
really *are* images.

On Mon, Jul 20, 2009 at 9:45 AM, Nikola Smolenski<[hidden email]> wrote:
> It's a necessary evil...
Well, that would make no difference if you actually downloaded the
content, or the first handful of bytes.  It's easy to *very* reliably
distinguish binary image data from HTML if you get to look at the
first several bytes of the file.

Anyway, I think the "right" way to do this would be to omit the suffix
from the page name entirely, treating the format as an implementation
detail.  That way you could, for instance, upload an SVG over a PNG or
a PNG over a JPEG, and have all users be automatically updated without
manually changing the references.  This does get a little confusing
when you consider totally different types of media, though, like audio
or video or PDF or whatnot.  If NS_FILE (NS_IMAGE) weren't hardcoded
in thirty million places both in code and templates, I might suggest
different namespaces for different media types instead of one unified
File: namespace, but that seems impractical at this point.

_______________________________________________ Wikitech-l mailing list
[hidden email]....
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Aryeh Gregor
On Mon, Jul 20, 2009 at 10:28 PM, Chad<[hidden email]> wrote:
> If making different namespaces per filetype wasn't
> feasible, what about making [[File:]] better so it
> automatically returns the best way to use the media--<img> tag for images,
> video/audio tags (or fallbacks)
> as appropriate. That way if a file is changed (ie ogg
> over png) it still displays properly.

That would work, it would just be slightly confusing.  Consider the
following wikimarkup:

[[File:Anthony Kennedy]]

Is that a picture, a video, a sound clip?  Maybe even a PDF of a book
by that name?  It's not obvious.  Of course, hitting preview would
immediately tell you, so I think the confusion would be tolerable.
But it would be a little weird.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexes and web robots

Robert Rohde
In reply to this post by David Gerard-2
On Sat, Jul 18, 2009 at 6:55 AM, David Gerard<[hidden email]> wrote:

> 2009/7/18 Robert Rohde <[hidden email]>:
>> On Sat, Jul 18, 2009 at 6:20 AM, David Gerard<[hidden email]> wrote:
>
>>> It'd actually be better if Google properly indexed text pages whose
>>> name ends in .jpg or whatever ... but they're aware we'd like that, so
>>> it's up to them.
>
>> Which is why my personal wiki is patched to translate the ".jpg" into
>> "_jpg", etc. for all references to image description pages.
>
>
> Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
> for _jpg to be the default image page name and .jpg an alias for
> backward compatibility? That'd be really helpful in all sorts of ways
> - on pretty much any website *not* running MediaWiki, something ending
> ".jpg" is going to be the image, not a text page.

I've created bug:19874 for this enhancement request.  As it is of
personal utility to me, I may also work on writing a patch, though
probably not in the near term.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Mark Clements (HappyDog)
In reply to this post by Robert Rohde

"Robert Rohde" <[hidden email]> wrote in message
news:[hidden email]...
> On Sat, Jul 18, 2009 at 6:55 AM, David Gerard<[hidden email]> wrote:
>> Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
>> for _jpg to be the default image page name and .jpg an alias for
>> backward compatibility? That'd be really helpful in all sorts of ways
>> - on pretty much any website *not* running MediaWiki, something ending
>> ".jpg" is going to be the image, not a text page.
[SNIP]
>
> As long as one requires files have explicit type suffixes (e.g.
> ".jpg", ".svg", etc), one can use the allowed list to determine what
> file names to translate without generating conflicts.  I believe all
> Wikimedia sites require such suffixes, but Mediawiki can be configured
> to remove that requirement which would need to be considered for a
> general application (i.e. what to do if the configuration allows
> separate files named "Foo_jpg" and "Foo.jpg")

How about making the type a prefix?  E.g.  Image:jpg:Foo (or File:jpg:Foo).
It would be a bit more work I suspect, but would retain the information that
the extension gives as well as resolving the indexing problem which started
this thread.

It would also be theoretically possible to use a one-to-many mapping here
(so uploading Foo.jpg or Foo.jpeg result in the same File:jpg:Foo - though
there might be an issue with naming conflicts here).  Or go even more
general, e.g. File:Image:Foo, File:Video:Foo, etc. All the name-conflict
problems that would occur in any attempt to resolve the "changing image file
format" problem would obviously apply here, but that might be better than
dropping the type information given by an extension altogether.

- Mark Clements (HappyDog)



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Brion Vibber-3
On 7/27/09 6:05 AM, Mark Clements (HappyDog) wrote:
>> As long as one requires files have explicit type suffixes (e.g.
>> ".jpg", ".svg", etc), one can use the allowed list to determine what
>> file names to translate without generating conflicts.  I believe all
>> Wikimedia sites require such suffixes, but Mediawiki can be configured
>> to remove that requirement which would need to be considered for a
>> general application (i.e. what to do if the configuration allows
>> separate files named "Foo_jpg" and "Foo.jpg")
>
> How about making the type a prefix?

Really there's no reason to expose the file type at all at this level;
it's an implementation detail which shouldn't be forced onto the on-wiki
identifier for a media item.

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Mark Clements (HappyDog)
"Brion Vibber" <[hidden email]> wrote in message
news:[hidden email]...

> On 7/27/09 6:05 AM, Mark Clements (HappyDog) wrote:
>>> As long as one requires files have explicit type suffixes (e.g.
>>> ".jpg", ".svg", etc), one can use the allowed list to determine what
>>> file names to translate without generating conflicts.  I believe all
>>> Wikimedia sites require such suffixes, but Mediawiki can be configured
>>> to remove that requirement which would need to be considered for a
>>> general application (i.e. what to do if the configuration allows
>>> separate files named "Foo_jpg" and "Foo.jpg")
>>
>> How about making the type a prefix?
>
> Really there's no reason to expose the file type at all at this level;
> it's an implementation detail which shouldn't be forced onto the on-wiki
> identifier for a media item.
>

This suggestion was to solve the problem of serving html documents that
appear to have a non-html file extension (e.g. page names which end .jpg).
This would provide a one-to-one mapping that is more sensible (imho) than
replacing the final period with another character (underscore was
suggested).

There is a separate issue of whether this information should be removed
altogether, which in theory is a good idea, but leads to a practical problem
of naming conflicts which has not yet been addressed to my knowledge (e.g.
when "File:Foo.jpg" and "File:Foo.gif" both exist).  If that could be
resolved then yes, the file's type information would not be required (either
as a file extension, or elsewhere).  In this case though, my second
suggstion ("File:Video:Foo", "File:Image:Bar") might be useful, as we
probably still want to know what type of file we are embedding, even if we
don't need to know the exact file format.

In the absence of a solution to the second problem, and in light of the fact
that solutions to the first issue are currently being considered, I think my
original suggestion is quite relevant, and has the added bonus of still
being useful if/when the second problem is solved.

- Mark Clements (HappyDog)



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Aryeh Gregor
On Mon, Jul 27, 2009 at 11:46 AM, Mark Clements
(HappyDog)<[hidden email]> wrote:
> There is a separate issue of whether this information should be removed
> altogether, which in theory is a good idea, but leads to a practical problem
> of naming conflicts which has not yet been addressed to my knowledge (e.g.
> when "File:Foo.jpg" and "File:Foo.gif" both exist).

We'd have to keep the existing page names working anyway to avoid
breaking everything, so we could just use the new convention for new
uploads.  Then old files could be moved to appropriate names manually
over time, with conflicts resolved manually.

> If that could be
> resolved then yes, the file's type information would not be required (either
> as a file extension, or elsewhere).  In this case though, my second
> suggstion ("File:Video:Foo", "File:Image:Bar") might be useful, as we
> probably still want to know what type of file we are embedding, even if we
> don't need to know the exact file format.

Maybe, but there are potentially a lot of very specific formats.  Like
Djvu, PDF, document formats, spreadsheets, . . . It might be simplest
to just drop the format info totally and assume it won't cause big
problems if the format isn't obvious from the name.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Robert Rohde
On Mon, Jul 27, 2009 at 9:47 AM, Aryeh
Gregor<[hidden email]> wrote:

> On Mon, Jul 27, 2009 at 11:46 AM, Mark Clements
> (HappyDog)<[hidden email]> wrote:
>> There is a separate issue of whether this information should be removed
>> altogether, which in theory is a good idea, but leads to a practical problem
>> of naming conflicts which has not yet been addressed to my knowledge (e.g.
>> when "File:Foo.jpg" and "File:Foo.gif" both exist).
>
> We'd have to keep the existing page names working anyway to avoid
> breaking everything, so we could just use the new convention for new
> uploads.  Then old files could be moved to appropriate names manually
> over time, with conflicts resolved manually.
<snip>

Forgive me, but that seems like you'd be asking the community to do a
huge amount of work (moving images and updating [[File:]] calls) in
order to address a problem that could be solved on purely technical
grounds.

At least, that is, if we agree that the problem is principally having
"misleading" file extensions in urls for HTML content.
http://en.wikipedia.org/wiki/File:Foo.jpg could be translated into any
number of things through a completely unambiguous one-to-one mapping
that would remove or mask the ".jpg" extension.  That is something I
would like to see and encourage.

However, if the "solution" is to manually rename everything to
extension-less structure then I would be opposed to that.  It is more
trouble than it is worth, and does little to benefit the existing
wikis owned by Wikimedia or those controlled by third parties.
Personally, I think it is actually a good thing that files have
file-like nomenclature in general.  It seems less confusing for
uploaders that way.  I'd prefer the current nomenclature be preserved
but some addition system of naming, minus the confusing extensions, be
placed on top as the default.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Aryeh Gregor
On Mon, Jul 27, 2009 at 1:03 PM, Robert Rohde<[hidden email]> wrote:
> Forgive me, but that seems like you'd be asking the community to do a
> huge amount of work (moving images and updating [[File:]] calls) in
> order to address a problem that could be solved on purely technical
> grounds.

Well, we could automatically move everything to the new names and
leave redirects, and only leave conflicts to be manually resolved.

> At least, that is, if we agree that the problem is principally having
> "misleading" file extensions in urls for HTML content.

I don't think that's the only problem we should be solving here.  We
should also allow an image in one format to be replaced by an image in
another format without changing the name.  That requires getting rid
of the extensions entirely.  (Allowing an image to be replaced by a
video or such, however, wouldn't make much sense.)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Robert Rohde
On Mon, Jul 27, 2009 at 10:09 AM, Aryeh
Gregor<[hidden email]> wrote:
> On Mon, Jul 27, 2009 at 1:03 PM, Robert Rohde<[hidden email]> wrote:
>> Forgive me, but that seems like you'd be asking the community to do a
>> huge amount of work (moving images and updating [[File:]] calls) in
>> order to address a problem that could be solved on purely technical
>> grounds.
>
> Well, we could automatically move everything to the new names and
> leave redirects, and only leave conflicts to be manually resolved.

Last I checked image moves weren't actually working and I thought
image redirects were disabled as well, though I could be mistaken.
Those are technical issues that it would be good to solve for their
own reasons though.

However, if redirects work in the traditional way, then it wouldn't
solve my problem.  Namely File:Foo.jpg might draw it's content from
File:Foo, but it still lives at a url for File:Foo.jpg.  In order to
avoid the extensions in urls you need to change where the links
actually go, which at the present time requires changing each actual
call.

Beyond that, it strikes me that it would be very hard to do the kind
of automatic resolution you have in mind without breaking things.  You
can arguably do it on a single wiki, but with Commons in the mix it
gets considerably harder.  If Commons has Foo.jpg and Enwiki has
Foo.gif, then who gets to live at File:Foo?  Either you have to check
for conflicts across all wikis or you are likely to end up with at
least some wikis with unexpected links.

>> At least, that is, if we agree that the problem is principally having
>> "misleading" file extensions in urls for HTML content.
>
> I don't think that's the only problem we should be solving here.  We
> should also allow an image in one format to be replaced by an image in
> another format without changing the name.  That requires getting rid
> of the extensions entirely.  (Allowing an image to be replaced by a
> video or such, however, wouldn't make much sense.)

From my point of view that's a much less annoying bug than the link
formatting one.  Not to mention that there are cases when it is
beneficial to explicitly provide different file formats for the same
material (for example if an SVG renders poorly on the WMF system).

They aren't antagonistic proposals though.  One could make changes
that allow extension agnostic file names, e.g. File:Foo, while also
coming up with an automatic way to hide file extensions on existing
works regardless of whether they are moved/redirected.  Any reason not
to allow both?  As mentioned earlier in the thread, I've been patching
my own wikis to mask extensions for years.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12