commons.wikimedia.org allowing directory indexes and web robots

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Brion Vibber-3
On 7/27/09 10:03 AM, Robert Rohde wrote:
> Forgive me, but that seems like you'd be asking the community to do a
> huge amount of work (moving images and updating [[File:]] calls) in
> order to address a problem that could be solved on purely technical
> grounds.

There's no technical need to change anything; ***Google already knows
what our site looks like and indexes our image pages just fine since
years ago***.

We're just talking about what would look nicer going forward, which
would be to do things more sanely and not spam a file extension onto the
on-wiki page name when it's really not necessary.

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Robert Rohde
On Mon, Jul 27, 2009 at 11:12 AM, Brion Vibber<[hidden email]> wrote:
> On 7/27/09 10:03 AM, Robert Rohde wrote:
>> Forgive me, but that seems like you'd be asking the community to do a
>> huge amount of work (moving images and updating [[File:]] calls) in
>> order to address a problem that could be solved on purely technical
>> grounds.
>
> There's no technical need to change anything; ***Google already knows
> what our site looks like and indexes our image pages just fine since
> years ago***.

Google indexes the WMF this way.  ***They do not index third party
Mediawiki sites this way.***

Compare:

http://www.google.com/search?q=file:*.jpg+site:wikimedia.org

Which shows no end of *.jpg Image description pages on Commons to

http://www.google.com/search?q=file:*.jpg+site:mediawiki.org
http://www.google.com/search?q=file:*.jpg+site:memory-alpha.org
http://www.google.com/search?q=file:*.jpg+site:stargate.wikia.com

Which show no *.jpg image description pages on mediawiki.org,
memory-alpha.org, or stargate.wikia.org

So congratulations Google treats the WMF special, but the rest of the
Mediawiki user base, myself included, still have a problem that we
would like to see solved.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Brion Vibber-3
On 7/27/09 11:37 AM, Robert Rohde wrote:
> Google indexes the WMF this way.  ***They do not index third party
> Mediawiki sites this way.***
[snip]
> So congratulations Google treats the WMF special, but the rest of the
> Mediawiki user base, myself included, still have a problem that we
> would like to see solved.

Feel free to rename your files as you like once sane naming is supported
natively. :)

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Brion Vibber-3
In reply to this post by Robert Rohde
On 7/27/09 10:39 AM, Robert Rohde wrote:

> On Mon, Jul 27, 2009 at 10:09 AM, Aryeh
> Gregor<[hidden email]>  wrote:
>> On Mon, Jul 27, 2009 at 1:03 PM, Robert Rohde<[hidden email]>  wrote:
>>> Forgive me, but that seems like you'd be asking the community to do a
>>> huge amount of work (moving images and updating [[File:]] calls) in
>>> order to address a problem that could be solved on purely technical
>>> grounds.
>>
>> Well, we could automatically move everything to the new names and
>> leave redirects, and only leave conflicts to be manually resolved.
>
> Last I checked image moves weren't actually working and I thought
> image redirects were disabled as well, though I could be mistaken.
> Those are technical issues that it would be good to solve for their
> own reasons though.

Image redirects are quite active. Renames were re-disabled due to
breakage with images which had missing past versions (eg, a lot in
production) -- which I think has been fixed to handle this case cleanly.

Anyway, don't consider that an impediment.

> However, if redirects work in the traditional way, then it wouldn't
> solve my problem.  Namely File:Foo.jpg might draw it's content from
> File:Foo, but it still lives at a url for File:Foo.jpg.  In order to
> avoid the extensions in urls you need to change where the links
> actually go, which at the present time requires changing each actual
> call.

You wouldn't care if anybody indexed File:Foo.jpg, since the content
would be indexed at File:Foo.

> Beyond that, it strikes me that it would be very hard to do the kind
> of automatic resolution you have in mind without breaking things.  You
> can arguably do it on a single wiki, but with Commons in the mix it
> gets considerably harder.  If Commons has Foo.jpg and Enwiki has
> Foo.gif, then who gets to live at File:Foo?  Either you have to check
> for conflicts across all wikis or you are likely to end up with at
> least some wikis with unexpected links.

This is hardly an insurmountable problem; automated renames can easily
detect the existence of such conflicts and either leave them for
eventual manual attention or give them disambiguating suffixes.

> They aren't antagonistic proposals though.  One could make changes
> that allow extension agnostic file names, e.g. File:Foo, while also
> coming up with an automatic way to hide file extensions on existing
> works regardless of whether they are moved/redirected.  Any reason not
> to allow both?

There's no particular reason to do the latter when its results are
equivalent to the former.

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: commons.wikimedia.org allowing directory indexesand web robots

Aryeh Gregor
In reply to this post by Robert Rohde
On Mon, Jul 27, 2009 at 1:39 PM, Robert Rohde<[hidden email]> wrote:
> Last I checked image moves weren't actually working and I thought
> image redirects were disabled as well, though I could be mistaken.
> Those are technical issues that it would be good to solve for their
> own reasons though.

Well, I was actually thinking that in this case we could do a proper
301 if you try directly visiting the page, and actually change all
generated links.  Since upload of new files under names with
extensions would be forbidden, the redirect would be immutable and
there would be no need to support the redirect notice.

> Beyond that, it strikes me that it would be very hard to do the kind
> of automatic resolution you have in mind without breaking things.  You
> can arguably do it on a single wiki, but with Commons in the mix it
> gets considerably harder.  If Commons has Foo.jpg and Enwiki has
> Foo.gif, then who gets to live at File:Foo?  Either you have to check
> for conflicts across all wikis or you are likely to end up with at
> least some wikis with unexpected links.

We'd have to check for conflicts across wikis, sure.

> From my point of view that's a much less annoying bug than the link
> formatting one.

My opinion is the opposite.  The issue with indexing isn't a bug on
our side at all, it's a deficiency with how Google indexes pages.  If
Google doesn't want to needlessly retrieve zillions of images and
needs a hint that we're linking to an HTML page, then the correct fix
on our side would be to do

<a href="/wiki/File:WTM_sheila_0015.jpg" class="image" title="The Beth
Hamedrash Hagadol congregation building" type="text/html">

and then find some Googlers to poke with pointy sticks if they don't
respect the type="" attribute.  We could do that immediately, in fact.
 I'm sure they'd be happy to remove their special-case code.  (I
really wish they'd talk to us about things like this instead of trying
to hack around our less-than-ideal behavior . . .)

On the other hand, having the file format be part of the page name is
a pain in the neck.

> Not to mention that there are cases when it is
> beneficial to explicitly provide different file formats for the same
> material (for example if an SVG renders poorly on the WMF system).

Then they could just be at different names, so nothing's lost.  On the
other hand it's very common for people to upload things that should be
PNG as JPEG, or things that should be SVG as PNG/JPEG, and currently
we have to rename.  Plus we can currently have Foo.jpg and Foo.jpeg
and Foo.JPG and Foo.JPEG and Foo.png and Foo.PNG and Foo.svg and
Foo.SVG, or whatever, which is unreasonable.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12