403 with content to Python?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

403 with content to Python?

Andre Engels
Through a message on another list, I found that when one tries to
reach wikipedia (or at least wikipedia-en) specifying the User Agent
as "Python-urllib/1.17", the server gives a "403 Forbidden" response,
together with the content of the page.

Two questions:
1. Why is this User Agent getting this response? If I remember
correctly, this was installed in the early days of the pywikipediabot,
when Brion wanted to block it because it had a programming error
causing it to fetch each page twice (sometimes even more?). If that is
the actual reason, I see no reason why it should still be active years
afterward...
2. If this User Agent is really to be blocked, why do we still provide
the content of the page that is forbidden?

--
André Engels, [hidden email]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Daniel Kinzler
Andre Engels schrieb:
> 1. Why is this User Agent getting this response? If I remember
> correctly, this was installed in the early days of the pywikipediabot,
> when Brion wanted to block it because it had a programming error
> causing it to fetch each page twice (sometimes even more?). If that is
> the actual reason, I see no reason why it should still be active years
> afterward...

The default UA-Strings of many popular libraries (pythion, perl, java, php...)
are blocked from accessing wikipedia.

The idea is to force people to provide a descriptive UA string for their
particular tool, so it can be blocked selectively when it breaks. Ideally, the
UA string should give some way of contacting the operator, or at least the author.

Good netizenship dictates: don't use default UA strings, use something unique
and  descriptive. Always, not only when accessing wikipedia.

As to whythe content is served anyway: I don't know. May be a bug even. or it's
intentional. Would be interesting to hear about this.

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Brion Vibber-3
In reply to this post by Andre Engels
On 1/23/09 2:36 AM, Andre Engels wrote:
> Two questions:
> 1. Why is this User Agent getting this response? If I remember
> correctly, this was installed in the early days of the pywikipediabot,
> when Brion wanted to block it because it had a programming error
> causing it to fetch each page twice (sometimes even more?). If that is
> the actual reason, I see no reason why it should still be active years
> afterward...

This has nothing to do with pywikipediabot.

We too frequently encountered poorly-written bots and site-scrapers
which slammed the servers too hard and caused problems. Blocking default
UAs of common libraries cut these incidents down dramatically, and helps
encourage thoughtful bot writers to put specific information into their
user-agent string, making it possible to track them down more easily if
they are problematic.


> 2. If this User Agent is really to be blocked, why do we still provide
> the content of the page that is forbidden?

We don't; you get a big fat Wikimedia-customized error page with a
generic multilingual message, and this bit somewhere in the middle:

    <!-- Technical details of the error; shows all the time, with any
language -->
    <div class="TechnicalStuff">
     <bdo dir="ltr">
      Request: GET http://en.wikipedia.org/wiki/Foo, from 69.17.48.227
via sq24.wikimedia.org (squid/2.6.STABLE21) to  ()<br/>
      Error: ERR_ACCESS_DENIED, errno [No Error] at Fri, 23 Jan 2009
17:59:46 GMT
     </bdo>
     <div id="AdditionalTechnicalStuff"></div>
    </div>

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Marco Schuster-2
On Fri, Jan 23, 2009 at 7:03 PM, Brion Vibber <[hidden email]> wrote:

> On 1/23/09 2:36 AM, Andre Engels wrote:
>> Two questions:
>> 1. Why is this User Agent getting this response? If I remember
>> correctly, this was installed in the early days of the pywikipediabot,
>> when Brion wanted to block it because it had a programming error
>> causing it to fetch each page twice (sometimes even more?). If that is
>> the actual reason, I see no reason why it should still be active years
>> afterward...
>
> This has nothing to do with pywikipediabot.
>
> We too frequently encountered poorly-written bots and site-scrapers
> which slammed the servers too hard and caused problems. Blocking default
> UAs of common libraries cut these incidents down dramatically, and helps
> encourage thoughtful bot writers to put specific information into their
> user-agent string, making it possible to track them down more easily if
> they are problematic.
>
Is there any list of those UAs or UA parts available?
I had this problem some time ago with my bot which used a custom UA
string and got access denied, so I changed its UA to Firefox as I had
no nerves to track down WHICH part of the UA triggered the filter.

Marco

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Platonides
Marco Schuster wrote:
> Is there any list of those UAs or UA parts available?
> I had this problem some time ago with my bot which used a custom UA
> string and got access denied, so I changed its UA to Firefox as I had
> no nerves to track down WHICH part of the UA triggered the filter.
>
> Marco

Perhaps they were blocking *your* bot?
Faking your user agent to match a browser make sysadmins assume bad faith...


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Marco Schuster-2
On Sat, Jan 24, 2009 at 3:48 PM, Platonides <[hidden email]> wrote:

> Marco Schuster wrote:
>> Is there any list of those UAs or UA parts available?
>> I had this problem some time ago with my bot which used a custom UA
>> string and got access denied, so I changed its UA to Firefox as I had
>> no nerves to track down WHICH part of the UA triggered the filter.
>>
>> Marco
>
> Perhaps they were blocking *your* bot?
> Faking your user agent to match a browser make sysadmins assume bad faith...
No, as the bot was not active before (and I'm pretty sure the UA also).

Marco

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Aryeh Gregor
In reply to this post by Marco Schuster-2
On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster
<[hidden email]> wrote:
> Is there any list of those UAs or UA parts available?
> I had this problem some time ago with my bot which used a custom UA
> string and got access denied, so I changed its UA to Firefox as I had
> no nerves to track down WHICH part of the UA triggered the filter.

Just change it to something like "YourBotName, run by Marco Schuster
<[hidden email]>".  That will certainly avoid any filters, and
provide the desired info.

I don't know why the error page doesn't give this info already.  The
current message only confuses people and -- if they can figure out
it's UA-based -- tempts them to mimic browser UA strings.  That stands
a good chance of getting your IP address blocked if it's noticed (and
it's pretty easy to tell when a script is pretending to be a browser,
if you look at the whole HTTP request).

The error message is in SVN, but it's the same message provided for
all errors.  I don't know what sort of config would needed to be done
to get a custom message for this error.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Marco Schuster-2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Jan 25, 2009 at 1:11 AM, Aryeh Gregor  wrote:

> On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster
>  wrote:
>> Is there any list of those UAs or UA parts available?
>> I had this problem some time ago with my bot which used a custom UA
>> string and got access denied, so I changed its UA to Firefox as I had
>> no nerves to track down WHICH part of the UA triggered the filter.
>
> Just change it to something like "YourBotName, run by Marco Schuster
> ".  That will certainly avoid any filters, and
> provide the desired info.
I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered
the filters.

> I don't know why the error page doesn't give this info already.  The
> current message only confuses people and -- if they can figure out
> it's UA-based -- tempts them to mimic browser UA strings.
Anyone skilled enough to write a bot is skilled enough to find that out, IMO.
Anyway, it should also be in the error message what part of the UA is forbidden.

Marco
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: http://getfiregpg.org

iD8DBQFJe7C4W6S2GapJUuQRAvcgAJ9YY1N0ckE9DzqG21K45teAiG1QVQCfcGBJ
hFtOQisDPnYlLyXjTwKaTTI=
=iuTY
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Platonides
Simetrical wrote:
> Just change it to something like "YourBotName, run by Marco Schuster
> <[hidden email]>".  That will certainly avoid any filters, and
> provide the desired info.

The email should be at a From: header. Although I don't know if it's
logged or not.
In general, anyone responsible enough to set a From: header (with their
valid email) shouldn't get automatically blocked.


Marco Schuster wrote:
> I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered
> the filters.

Perhaps the mention to "php", although I'm not being blocked when using
that UA, so can't test.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Aryeh Gregor
On Sun, Jan 25, 2009 at 8:50 AM, Platonides <[hidden email]> wrote:
> The email should be at a From: header. Although I don't know if it's
> logged or not.
> In general, anyone responsible enough to set a From: header (with their
> valid email) shouldn't get automatically blocked.

A From: header?  In HTTP?  What standard specifies that header's
existence and semantics?  It's not at [[List of HTTP headers]].

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Marco Schuster-2
In reply to this post by Platonides
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Jan 25, 2009 at 2:50 PM, Platonides  wrote:
> Marco Schuster wrote:
>> I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered
>> the filters.
>
> Perhaps the mention to "php", although I'm not being blocked when using
> that UA, so can't test.

Yeah, I'm also not blocked anymore...nice to hear that. But again,
it'd be nice to see in an error message what part of the UA triggered
the filter and why this part is blocked.
Brion, do you have a list of blocked UA (parts)?

Marco
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: http://getfiregpg.org

iD4DBQFJfJOQW6S2GapJUuQRAiwgAJdXucmjZ4d9BToMAnK3uKuzq3ooAJ4mFGFZ
AeFuiPnC+cSzTuseHDtAUg==
=OwNP
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Platonides
In reply to this post by Aryeh Gregor
Aryeh Gregor wrote:
> On Sun, Jan 25, 2009 at 8:50 AM, Platonides <[hidden email]> wrote:
>> The email should be at a From: header. Although I don't know if it's
>> logged or not.
>> In general, anyone responsible enough to set a From: header (with their
>> valid email) shouldn't get automatically blocked.
>
> A From: header?  In HTTP?  What standard specifies that header's
> existence and semantics?  It's not at [[List of HTTP headers]].

I also thought that it was a confusion when I first saw it on HTTP
article at wikipedia.

RFC 2616 (HTTP/1.1) section 14.22

   The From request-header field, if given, SHOULD contain an Internet
   e-mail address for the human user who controls the requesting user
   agent. The address SHOULD be machine-usable, as defined by "mailbox"
   in RFC 822 [9] as updated by RFC 1123 [8]:

       From   = "From" ":" mailbox

   An example is:

       From: [hidden email]

   This header field MAY be used for logging purposes and as a means for
   identifying the source of invalid or unwanted requests. It SHOULD NOT
   be used as an insecure form of access protection. The interpretation
   of this field is that the request is being performed on behalf of the
   person given, who accepts responsibility for the method performed. In
   particular, robot agents SHOULD include this header so that the
   person responsible for running the robot can be contacted if problems
   occur on the receiving end.

   The Internet e-mail address in this field MAY be separate from the
   Internet host which issued the request. For example, when a request
   is passed through a proxy the original issuer's address SHOULD be
   used.

   The client SHOULD NOT send the From header field without the user's
   approval, as it might conflict with the user's privacy interests or
   their site's security policy. It is strongly recommended that the
   user be able to disable, enable, and modify the value of this field
   at any time prior to a request.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Bugzilla from andrew@epstone.net
In reply to this post by Marco Schuster-2
On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster
<[hidden email]> wrote:

> Brion, do you have a list of blocked UA (parts)?

Squid configuration files are available at
http://noc.wikimedia.org/conf. It should be in there.

--
Andrew Garrett

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Aryeh Gregor
In reply to this post by Platonides
On Sun, Jan 25, 2009 at 4:42 PM, Platonides <[hidden email]> wrote:

> I also thought that it was a confusion when I first saw it on HTTP
> article at wikipedia.
>
> RFC 2616 (HTTP/1.1) section 14.22
>
>   The From request-header field, if given, SHOULD contain an Internet
>   e-mail address for the human user who controls the requesting user
>   agent. The address SHOULD be machine-usable, as defined by "mailbox"
>   in RFC 822 [9] as updated by RFC 1123 [8]:
> ...

Well, since I doubt most people have ever heard of that, it's probably
not logged.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 403 with content to Python?

Platonides
In reply to this post by Bugzilla from andrew@epstone.net
Andrew Garrett wrote:
> On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster
> <[hidden email]> wrote:
>
>> Brion, do you have a list of blocked UA (parts)?
>
> Squid configuration files are available at
> http://noc.wikimedia.org/conf. It should be in there.

Which of them are for the squids? I think they server config there is
just for the apaches.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l