Daniel Brandt Links to Google

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Daniel Brandt Links to Google

jmerkey-3

Brion/Gregory,

Would it be possible to block Brandt's article from being scraped from
the search engines in the main site
robots.txt file?   It would help alleviate the current conflict and
hopefully remove the remaining issues
between Daniel and Wikipedia.  After this final issue is addressed, I
feel we have done all we can to correct
Daniel's bio and address his concerns, that being said, there is a limit
on how far good samaritanism should go,
and I think we have done about all we can here.   The rest is up to Daniel.

Jeff

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

David Gerard-2
On 07/05/07, Jeff V. Merkey <[hidden email]> wrote:

> Would it be possible to block Brandt's article from being scraped from
> the search engines in the main site
> robots.txt file?   It would help alleviate the current conflict and
> hopefully remove the remaining issues
> between Daniel and Wikipedia.  After this final issue is addressed, I
> feel we have done all we can to correct
> Daniel's bio and address his concerns, that being said, there is a limit
> on how far good samaritanism should go,
> and I think we have done about all we can here.   The rest is up to Daniel.


This idea has actually been suggested on wikien-l and met with a
mostly positive response - selective noindexing of some living
biographies. This would actually cut down the quantity of OTRS
complaints tremendously. The response was not unanimous, I must note.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

jmerkey-3
David Gerard wrote:

>On 07/05/07, Jeff V. Merkey <[hidden email]> wrote:
>
>  
>
>>Would it be possible to block Brandt's article from being scraped from
>>the search engines in the main site
>>robots.txt file?   It would help alleviate the current conflict and
>>hopefully remove the remaining issues
>>between Daniel and Wikipedia.  After this final issue is addressed, I
>>feel we have done all we can to correct
>>Daniel's bio and address his concerns, that being said, there is a limit
>>on how far good samaritanism should go,
>>and I think we have done about all we can here.   The rest is up to Daniel.
>>    
>>
>
>
>This idea has actually been suggested on wikien-l and met with a
>mostly positive response - selective noindexing of some living
>biographies. This would actually cut down the quantity of OTRS
>complaints tremendously. The response was not unanimous, I must note.
>
>
>- d.
>  
>

David,

I think in this case, we should consider doing it, particularly if the
subject of a BLP
asks for us to do so as a courtesy. I realize we do not have to
accomdate anyone, but
still, I think it would be the polite and considerate thing to do.

Jeff

>_______________________________________________
>Wikitech-l mailing list
>[hidden email]
>http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>  
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Gregory Maxwell
In reply to this post by David Gerard-2
On 5/7/07, David Gerard <[hidden email]> wrote:

> On 07/05/07, Jeff V. Merkey <[hidden email]> wrote:
>
> > Would it be possible to block Brandt's article from being scraped from
> > the search engines in the main site
> > robots.txt file?   It would help alleviate the current conflict and
> > hopefully remove the remaining issues
> > between Daniel and Wikipedia.  After this final issue is addressed, I
> > feel we have done all we can to correct
> > Daniel's bio and address his concerns, that being said, there is a limit
> > on how far good samaritanism should go,
> > and I think we have done about all we can here.   The rest is up to Daniel.
>
>
> This idea has actually been suggested on wikien-l and met with a
> mostly positive response - selective noindexing of some living
> biographies. This would actually cut down the quantity of OTRS
> complaints tremendously. The response was not unanimous, I must note.

This would be very useful for another use case:
Sometimes google will pick up a cached copy of a vandalized page. In
order to purge the google cache you need to make the page 404 (which
deletion doesn't do), put the page into a robots.txt deny, or include
some directive in the page that stops indexing.

If we provided some directive to do one of the latter two (ideally the
last) we could use it temporally to purge google cached copies of
vandalism... so it would even be useful for pages that we normally
want to keep indexed.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Jay Ashworth-2
On Mon, May 07, 2007 at 03:51:48PM -0400, Gregory Maxwell wrote:

> This would be very useful for another use case:
> Sometimes google will pick up a cached copy of a vandalized page. In
> order to purge the google cache you need to make the page 404 (which
> deletion doesn't do), put the page into a robots.txt deny, or include
> some directive in the page that stops indexing.
>
> If we provided some directive to do one of the latter two (ideally the
> last) we could use it temporally to purge google cached copies of
> vandalism... so it would even be useful for pages that we normally
> want to keep indexed.

With all due respect to... oh, whomever the hell thinks they deserve
some:  aren't we big enough to get a little special handling from
Google?  I should think that if we have a page get cached that's either
been vandalised or in some other way exposes us to liability, that as
big as we are, and as many high ranked search results as we return on
Google (we're often the top hit, and *very* often in the top 20),
perhaps we might be able to access some *slightly* more prompt
deindexing facility?  At least for, say, our top 10 administators?

Cheers,
-- jra
--
Jay R. Ashworth                                                [hidden email]
Designer                          Baylink                             RFC 2100
Ashworth & Associates        The Things I Think                        '87 e24
St Petersburg FL USA      http://baylink.pitas.com             +1 727 647 1274

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Gregory Maxwell
On 5/7/07, Jay R. Ashworth <[hidden email]> wrote:
> With all due respect to... oh, whomever the hell thinks they deserve
> some:  aren't we big enough to get a little special handling from
> Google?  I should think that if we have a page get cached that's either
> been vandalised or in some other way exposes us to liability, that as
> big as we are, and as many high ranked search results as we return on
> Google (we're often the top hit, and *very* often in the top 20),
> perhaps we might be able to access some *slightly* more prompt
> deindexing facility?  At least for, say, our top 10 administators?

They've provided a reasonable API, "put something on the page that you
want deindexed and tell them to visit it".. I don't know what more we
could ask for since anything else would have to have some complex
authentication system.. using metatags in the page avoids that problem
nicely. .. We just need to support the API.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

David Gerard-2
On 07/05/07, Gregory Maxwell <[hidden email]> wrote:

> They've provided a reasonable API, "put something on the page that you
> want deindexed and tell them to visit it".. I don't know what more we
> could ask for since anything else would have to have some complex
> authentication system.. using metatags in the page avoids that problem
> nicely. .. We just need to support the API.


Add a tickybox that admins can use to noindex a page? When the box is
ticked or unticked, the URL is sent to Google.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Steve Sanbeg
In reply to this post by Jay Ashworth-2
On Mon, 07 May 2007 17:12:52 -0400, Jay R. Ashworth wrote:

> On Mon, May 07, 2007 at 03:51:48PM -0400, Gregory Maxwell wrote:
>> This would be very useful for another use case: Sometimes google will
>> pick up a cached copy of a vandalized page. In order to purge the
>> google cache you need to make the page 404 (which deletion doesn't do),
>> put the page into a robots.txt deny, or include some directive in the
>> page that stops indexing.
>>
>> If we provided some directive to do one of the latter two (ideally the
>> last) we could use it temporally to purge google cached copies of
>> vandalism... so it would even be useful for pages that we normally want
>> to keep indexed.
>
> With all due respect to... oh, whomever the hell thinks they deserve
> some:  aren't we big enough to get a little special handling from
> Google?  I should think that if we have a page get cached that's either
> been vandalised or in some other way exposes us to liability, that as
> big as we are, and as many high ranked search results as we return on
> Google (we're often the top hit, and *very* often in the top 20),
> perhaps we might be able to access some *slightly* more prompt
> deindexing facility?  At least for, say, our top 10 administators?
>
> Cheers,
> -- jra


Doesn't this assume that:

1) The foundation is willing to self censor its content.

2) Google will recognize that if a URL is marked like a crawler trap in
robots.txt that obviously isn't, it means that the corresponding censored
article shouldn't be crawled or extracted from the syndication dumps.
3)The foundation wants to set up a private channel of information
exclusively for Google.

Misusing robots.txt is somewhat dubious when you don't publish XML dumps
for syndication, and seems somewhat pointless when you do.  Having a team
of censors maintaining a secret blacklist to be sent to one corporation
seems somewhat contrary to the foundations goals.

There may be better ways to do it, but they wouldn't be as simple as
adding a name to a file; and some may consider the ramifications of hiding
an article like this to be more serious than deleting it, not less so.





_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

jmerkey-3
Steve Sanbeg wrote:

>On Mon, 07 May 2007 17:12:52 -0400, Jay R. Ashworth wrote:
>
>  
>
>>On Mon, May 07, 2007 at 03:51:48PM -0400, Gregory Maxwell wrote:
>>    
>>
>>>This would be very useful for another use case: Sometimes google will
>>>pick up a cached copy of a vandalized page. In order to purge the
>>>google cache you need to make the page 404 (which deletion doesn't do),
>>>put the page into a robots.txt deny, or include some directive in the
>>>page that stops indexing.
>>>
>>>If we provided some directive to do one of the latter two (ideally the
>>>last) we could use it temporally to purge google cached copies of
>>>vandalism... so it would even be useful for pages that we normally want
>>>to keep indexed.
>>>      
>>>
>>With all due respect to... oh, whomever the hell thinks they deserve
>>some:  aren't we big enough to get a little special handling from
>>Google?  I should think that if we have a page get cached that's either
>>been vandalised or in some other way exposes us to liability, that as
>>big as we are, and as many high ranked search results as we return on
>>Google (we're often the top hit, and *very* often in the top 20),
>>perhaps we might be able to access some *slightly* more prompt
>>deindexing facility?  At least for, say, our top 10 administators?
>>
>>Cheers,
>>-- jra
>>    
>>
>
>
>Doesn't this assume that:
>
>1) The foundation is willing to self censor its content.
>  
>
Preventing google from scraping content at the request of BLP subjects
is not censorship and sounds reasonable. It does not
compromise wikipedia, just external engines creating biaed link summaries.

>2) Google will recognize that if a URL is marked like a crawler trap in
>robots.txt that obviously isn't, it means that the corresponding censored
>article shouldn't be crawled or extracted from the syndication dumps.
>3)The foundation wants to set up a private channel of information
>exclusively for Google.
>  
>
insert noindex into the HTML output -- very easy and straightforward.

Jeff

>Misusing robots.txt is somewhat dubious when you don't publish XML dumps
>for syndication, and seems somewhat pointless when you do.  Having a team
>of censors maintaining a secret blacklist to be sent to one corporation
>seems somewhat contrary to the foundations goals.
>
>There may be better ways to do it, but they wouldn't be as simple as
>adding a name to a file; and some may consider the ramifications of hiding
>an article like this to be more serious than deleting it, not less so.
>
>
>
>
>
>_______________________________________________
>Wikitech-l mailing list
>[hidden email]
>http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>  
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Jay Ashworth-2
In reply to this post by David Gerard-2
On Mon, May 07, 2007 at 10:49:46PM +0100, David Gerard wrote:
> On 07/05/07, Gregory Maxwell <[hidden email]> wrote:
> > They've provided a reasonable API, "put something on the page that you
> > want deindexed and tell them to visit it".. I don't know what more we
> > could ask for since anything else would have to have some complex
> > authentication system.. using metatags in the page avoids that problem
> > nicely. .. We just need to support the API.
>
> Add a tickybox that admins can use to noindex a page? When the box is
> ticked or unticked, the URL is sent to Google.

Well, my point was more "what, *exactly* happens when you ask Google to
index a page?"  The issue sounded like "could we please get this pulled
offline *RIGHT NOW*!?", and I have no reason to believe that the
publically available Google API for this has anything like that degree
of control...

Cheers,
-- jra
--
Jay R. Ashworth                                                [hidden email]
Designer                          Baylink                             RFC 2100
Ashworth & Associates        The Things I Think                        '87 e24
St Petersburg FL USA      http://baylink.pitas.com             +1 727 647 1274

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Gregory Maxwell
On 5/7/07, Jay R. Ashworth <[hidden email]> wrote:
> Well, my point was more "what, *exactly* happens when you ask Google to
> index a page?"  The issue sounded like "could we please get this pulled
> offline *RIGHT NOW*!?", and I have no reason to believe that the
> publically available Google API for this has anything like that degree
> of control...

It does exactly what I said it does in my prior post. It pulls pages
*right* *now*, iff you've marked them to be no indexed (through any of
several mechanisms) and you tell it to hit them.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Erik Moeller-4
In reply to this post by jmerkey-3
On 5/7/07, Jeff V. Merkey <[hidden email]> wrote:
>
> Brion/Gregory,
>
> Would it be possible to block Brandt's article from being scraped from
> the search engines in the main site
> robots.txt file?

I oppose this. We should not have articles invisible to search
engines. Either the page gets deleted, or it stays. None of this
monkey business.


--
Peace & Love,
Erik

DISCLAIMER: This message does not represent an official position of
the Wikimedia Foundation or its Board of Trustees.

"An old, rigid civilization is reluctantly dying. Something new, open,
free and exciting is waking up." -- Ming the Mechanic

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Gregory Maxwell
On 5/7/07, Erik Moeller <[hidden email]> wrote:
> I oppose this. We should not have articles invisible to search
> engines. Either the page gets deleted, or it stays. None of this
> monkey business.

Last I tried google wouldn't remove 'deleted' pages from the index
because they still 'existed'. :(

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Rob Church
In reply to this post by Erik Moeller-4
On 08/05/07, Erik Moeller <[hidden email]> wrote:
> I oppose this. We should not have articles invisible to search
> engines. Either the page gets deleted, or it stays. None of this
> monkey business.

I concur, for a change. ;) Wikipedia is supposed to be an information
resource, and search engines are the main means of finding such
resources on the web. While I strongly sympathise with any victims of
libel, and join the call to tighten up on it, I feel it's important to
continue supporting indexing of all articles.

I would suggest a better idea would be to find some means of asking
Google to update caches and indexes for a particular page "right now"
for these cases.


Rob Church

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Thomas Dalton
In reply to this post by Erik Moeller-4
> I oppose this. We should not have articles invisible to search
> engines. Either the page gets deleted, or it stays. None of this
> monkey business.

I agree. If the information is good enough to be on Wikipedia, it's
good enough to appear in a Google search. We have no obligation to
protect subjects of articles from getting upset (beyond libel law, of
course). While it is always nice to try and keep people happy, we
shouldn't be going out of our way to do so.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Thomas Dalton
In reply to this post by Gregory Maxwell
> Last I tried google wouldn't remove 'deleted' pages from the index
> because they still 'existed'. :(

That's a good point. Why doesn't MediaWiki return a 404 when a page
isn't found? As far as I know, we could show exactly the same page,
it's just a matter of changing the status header from 200 to 404. Same
applies to redirects - HTTP has a "Page Moved" code, or something
similar, which we should return.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Aryeh Gregor
On 5/7/07, Thomas Dalton <[hidden email]> wrote:
> > Last I tried google wouldn't remove 'deleted' pages from the index
> > because they still 'existed'. :(
>
> That's a good point. Why doesn't MediaWiki return a 404 when a page
> isn't found?

http://bugzilla.wikimedia.org/show_bug.cgi?id=2585

It was tried and seemed to cause problems for some users.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

jmerkey-3
In reply to this post by Rob Church
Rob Church wrote:

>On 08/05/07, Erik Moeller <[hidden email]> wrote:
>  
>
>>I oppose this. We should not have articles invisible to search
>>engines. Either the page gets deleted, or it stays. None of this
>>monkey business.
>>    
>>
>
>I concur, for a change. ;) Wikipedia is supposed to be an information
>resource, and search engines are the main means of finding such
>resources on the web. While I strongly sympathise with any victims of
>libel, and join the call to tighten up on it, I feel it's important to
>continue supporting indexing of all articles.
>
>I would suggest a better idea would be to find some means of asking
>Google to update caches and indexes for a particular page "right now"
>for these cases.
>
>
>Rob Church
>
>  
>

Rob,

Need a solution that balances both concerns. How about if an article is
placed under WP:OFFICE then it gets removed
from the index. This would be managable.

Erik,

There is such a thing as a 1st ammendment right of "expressive
association". There may be a sound basis for reviewing the issues of
search engines and their artificial pay-for-rankings scheme. I mean, if
WP is free content, then why again should we allow Google to
reorder our information for paying customers by scraping every template
and corner of WP? There is no "higher morality" in allowing search
engines to decide what we publish (since they filter it anyway for a
fee) nor is it "more open".

Putting some controls there to get rid of more complaints might be worth
the tradeof, particularly since the whole "Google is Divine and Open"
view (which it is really not) is only illusionary - the way Google wants
all of us to believe.

Jeff

>_______________________________________________
>Wikitech-l mailing list
>[hidden email]
>http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>  
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Platonides
In reply to this post by Aryeh Gregor
Simetrical wrote:
> On 5/7/07, Thomas Dalton <[hidden email]> wrote:
>>> Last I tried google wouldn't remove 'deleted' pages from the index
>>> because they still 'existed'. :(
>> That's a good point. Why doesn't MediaWiki return a 404 when a page
>> isn't found?
>
> http://bugzilla.wikimedia.org/show_bug.cgi?id=2585
>
> It was tried and seemed to cause problems for some users.

"Several users have reported being persistently unable to access
nonexistent pages" is quite vague.
Do you know what was the specific problem? Did they get a blank page? A
niced 404? A MessageBox?


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Daniel Brandt Links to Google

Brion Vibber-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Platonides wrote:

> Simetrical wrote:
>> On 5/7/07, Thomas Dalton <[hidden email]> wrote:
>>>> Last I tried google wouldn't remove 'deleted' pages from the index
>>>> because they still 'existed'. :(
>>> That's a good point. Why doesn't MediaWiki return a 404 when a page
>>> isn't found?
>> http://bugzilla.wikimedia.org/show_bug.cgi?id=2585
>>
>> It was tried and seemed to cause problems for some users.
>
> "Several users have reported being persistently unable to access
> nonexistent pages" is quite vague.
> Do you know what was the specific problem? Did they get a blank page? A
> niced 404? A MessageBox?

We were unable to reproduce the problem at the time, so cannot say more
specifically than that it was disruptive and there were multiple complaints.

- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGQIHywRnhpk1wk44RAjQhAJ9XODz6Q1aKSriBmF4M9t5LAnvxOwCg1u5w
qZe5sfPZVtFWVu2ki6SnetU=
=2axZ
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
12