Small Amendment to User-Agent Policy

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Small Amendment to User-Agent Policy

Marcel Ruiz Forns
Hi wikitech-l,

After the discussion in analytics-l [1][2] and Phabricator [3], the
Analytics team added a small amendment [4] to Wikimedia's user-agent policy
[5] with the intention of improving the quality of WMF's pageview
statistics.

The amendment asks Wikimedia bot/framework maintainers to optionally add
the word *bot* (case insensitive) to their user-agents. With that, the
analytical jobs that process request data into pageview statistics will be
capable of better identifying traffic generated by bots, and thus of better
isolating traffic originated by humans (corresponding code is already in
production [6]). The convention is optional, because modifications to the
user-agent can be a breaking change.

Targets of this convention are: bots/frameworks that can generate Wikimedia
pageviews [7] to Wikimedia sites and/or API and are not for in-situ human
consumption. Not targets: bots/frameworks used to assist in-situ human
consumption, and bots/frameworks that are otherwise well known and
recognizable like WordPress, Scrapy, etc. Note that there are many editing
bots that also generate pageviews, like when trying to copy content from
one page to another the source page is requested and the corresponding
pageview is generated.

Cheers!

[1] https://lists.wikimedia.org/pipermail/analytics/2016-January/004858.html
[2]
https://lists.wikimedia.org/pipermail/analytics/2016-February/004882.html
[3] https://phabricator.wikimedia.org/T108599
[4]
https://meta.wikimedia.org/w/index.php?title=User-Agent_policy&type=revision&diff=15343269&oldid=14833024
[5] https://meta.wikimedia.org/wiki/User-Agent_policy
[6]
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L58
[7] https://meta.wikimedia.org/wiki/Research:Page_view

--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Small Amendment to User-Agent Policy

John Mark Vandenberg
On Mon, Mar 21, 2016 at 12:37 PM, Marcel Ruiz Forns
<[hidden email]> wrote:

> Hi wikitech-l,
>
> After the discussion in analytics-l [1][2] and Phabricator [3], the
> Analytics team added a small amendment [4] to Wikimedia's user-agent policy
> [5] with the intention of improving the quality of WMF's pageview
> statistics.
>
> The amendment asks Wikimedia bot/framework maintainers to optionally add
> the word *bot* (case insensitive) to their user-agents. With that, the
> analytical jobs that process request data into pageview statistics will be
> capable of better identifying traffic generated by bots, and thus of better
> isolating traffic originated by humans (corresponding code is already in
> production [6]). The convention is optional, because modifications to the
> user-agent can be a breaking change.

As asked on the talk page over a month ago with no response...
https://meta.wikimedia.org/wiki/Talk:User-Agent_policy#bot

How does adding 'bot' help over and above including email addresses
and URLs in the User-Agent?
Are there significant cases of human traffic browsers including email
addresses and URLs in the User-Agent?

If not, I am struggling to understand how the addition of 'bot'
assists better isolating traffic originated by humans.

Or, is adding 'bot' an alternative to including email addresses and
URLs?  This will also introduce some false positives, as 'bot' is a
word and word-part with meanings other than the English meaning. See
https://en.wiktionary.org/wiki/bot ,
https://en.wiktionary.org/wiki/Special:Search/intitle:bot and
https://en.wiktionary.org/wiki/Talk:-bot#Fish_suffix

> Targets of this convention are: bots/frameworks that can generate Wikimedia
> pageviews [7] to Wikimedia sites and/or API and are not for in-situ human
> consumption. Not targets: bots/frameworks used to assist in-situ human
> consumption, and bots/frameworks that are otherwise well known and
> recognizable like WordPress, Scrapy, etc. Note that there are many editing
> bots that also generate pageviews, like when trying to copy content from
> one page to another the source page is requested and the corresponding
> pageview is generated.

I appreciate this attempt to classify devise a clearer "target" for
when a client needs to follow this new convention from the analytics
team, as requested during the discussion on the analytics list.

Regarding "Wikimedia pageviews [7] to Wikimedia sites and/or API ..
[7] https://meta.wikimedia.org/wiki/Research:Page_view"

There is very little information at
https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
I can see) regarding what use of the API is considered to be a
**page** view.  For example, is it a page view when I ask the API for
metadata only of the last revision of a page -- i.e. the page/revision
text is not included in the response?

"in-situ human consumption" is an interesting formula.
"in situ human" strongly implies a human is directly accessing the
content that caused the page view.

But how much 'consumption' is required?  This was briefly discussed
during the analytics list discussion, and it would be good to bring
the wider audience into this discussion.

Obviously 'Navigation popups'/Hovercards is definitely "in-situ human
consumption".

But what about gadgets Twinkle's "unlink" feature and Cat-a-lot (on
Wikimeda Commons)?  They do batch modifications to pages, and the
in-situ human does not see the pages fetched by the JavaScript.  Based
on your responses in analytics mailing list discussion, and this new
terminology "in-situ human consumption", I believe that these gadgets
would be considered subject to the bot user-agent policy.
It would be good to identify a list of gadgets which need to be
updated to comply with the new user agent policy.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Small Amendment to User-Agent Policy

Marcel Ruiz Forns
John, thanks for your ideas!


> As asked on the talk page over a month ago with no response...
> https://meta.wikimedia.org/wiki/Talk:User-Agent_policy#bot


The responses are there now. Sorry for that, I forgot to watch the page.

How does adding 'bot' help over and above including email addresses
> and URLs in the User-Agent?
> Are there significant cases of human traffic browsers including email
> addresses and URLs in the User-Agent?
> If not, I am struggling to understand how the addition of 'bot'
> assists better isolating traffic originated by humans.


No, I don't think that there are cases of humans with such user-agents :] I
understand, though, that the policy asks the bot maintainers to add "some
way of contacting them" to the user-agent, and some examples are given:
"(e.g. a userpage on the local wiki, a userpage on a related wiki using
interwiki linking syntax, a URI for a relevant external website, or an
email address)". I assume that the example list is not exclusive, meaning
they may also use other ways of contact info. Also, parsing the word bot is
less error prone and cheaper than parsing long heterogeneous strings.

Or, is adding 'bot' an alternative to including email addresses and
> URLs?


Adding bot is not intended to be an alternative or to replace the current
policy at all. It is only intended to add an optional way bot maintainers
can help us.

This will also introduce some false positives, as 'bot' is a
> word and word-part with meanings other than the English meaning.


I think adding the word bot to the user-agent of bot-like programs is a
widely adopted convention. Actually, the word bot is already (for a long
time now) being parsed and used to tag requests as bot-originated in our
jobs that process requests into pageviews stats, because many external bots
include it in their user-agent. See:
http://www.useragentstring.com/pages/Crawlerlist/

There is very little information at
> https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
> I can see) regarding what use of the API is considered to be a
> **page** view.  For example, is it a page view when I ask the API for
> metadata only of the last revision of a page -- i.e. the page/revision
> text is not included in the response?


You're right, and this is a very good question. I fear the only ways to
look into this are browsing the actual code in:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java
or asking the Research team, who owns the definition.

But how much 'consumption' is required?  This was briefly discussed
> during the analytics list discussion, and it would be good to bring
> the wider audience into this discussion.


Another good question. It is very difficult, though, to get to the perfect
wording that will make it totally clear to all bot maintainers if their bot
is target of the convention or not. I guess it will come to the good sense
and will-to-help of the bot maintainers to decide this in the case the bot
behavior is in the frontier between "in-situ human consumption" and "not
in-situ human consumption".

It would be good to identify a list of gadgets which need to be
> updated to comply with the new user agent policy.


I'd say all gadgets already comply with the amendment to the user-agent
policy, because the amendment is optional. Nevertheless, our next step is
reaching out to main bot maintainers to present them with that option.

Thanks again for the discussion!


On Mon, Mar 21, 2016 at 4:28 PM, John Mark Vandenberg <[hidden email]>
wrote:

> On Mon, Mar 21, 2016 at 12:37 PM, Marcel Ruiz Forns
> <[hidden email]> wrote:
> > Hi wikitech-l,
> >
> > After the discussion in analytics-l [1][2] and Phabricator [3], the
> > Analytics team added a small amendment [4] to Wikimedia's user-agent
> policy
> > [5] with the intention of improving the quality of WMF's pageview
> > statistics.
> >
> > The amendment asks Wikimedia bot/framework maintainers to optionally add
> > the word *bot* (case insensitive) to their user-agents. With that, the
> > analytical jobs that process request data into pageview statistics will
> be
> > capable of better identifying traffic generated by bots, and thus of
> better
> > isolating traffic originated by humans (corresponding code is already in
> > production [6]). The convention is optional, because modifications to the
> > user-agent can be a breaking change.
>
> As asked on the talk page over a month ago with no response...
> https://meta.wikimedia.org/wiki/Talk:User-Agent_policy#bot
>
> How does adding 'bot' help over and above including email addresses
> and URLs in the User-Agent?
> Are there significant cases of human traffic browsers including email
> addresses and URLs in the User-Agent?
>
> If not, I am struggling to understand how the addition of 'bot'
> assists better isolating traffic originated by humans.
>
> Or, is adding 'bot' an alternative to including email addresses and
> URLs?  This will also introduce some false positives, as 'bot' is a
> word and word-part with meanings other than the English meaning. See
> https://en.wiktionary.org/wiki/bot ,
> https://en.wiktionary.org/wiki/Special:Search/intitle:bot and
> https://en.wiktionary.org/wiki/Talk:-bot#Fish_suffix
>
> > Targets of this convention are: bots/frameworks that can generate
> Wikimedia
> > pageviews [7] to Wikimedia sites and/or API and are not for in-situ human
> > consumption. Not targets: bots/frameworks used to assist in-situ human
> > consumption, and bots/frameworks that are otherwise well known and
> > recognizable like WordPress, Scrapy, etc. Note that there are many
> editing
> > bots that also generate pageviews, like when trying to copy content from
> > one page to another the source page is requested and the corresponding
> > pageview is generated.
>
> I appreciate this attempt to classify devise a clearer "target" for
> when a client needs to follow this new convention from the analytics
> team, as requested during the discussion on the analytics list.
>
> Regarding "Wikimedia pageviews [7] to Wikimedia sites and/or API ..
> [7] https://meta.wikimedia.org/wiki/Research:Page_view"
>
> There is very little information at
> https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
> I can see) regarding what use of the API is considered to be a
> **page** view.  For example, is it a page view when I ask the API for
> metadata only of the last revision of a page -- i.e. the page/revision
> text is not included in the response?
>
> "in-situ human consumption" is an interesting formula.
> "in situ human" strongly implies a human is directly accessing the
> content that caused the page view.
>
> But how much 'consumption' is required?  This was briefly discussed
> during the analytics list discussion, and it would be good to bring
> the wider audience into this discussion.
>
> Obviously 'Navigation popups'/Hovercards is definitely "in-situ human
> consumption".
>
> But what about gadgets Twinkle's "unlink" feature and Cat-a-lot (on
> Wikimeda Commons)?  They do batch modifications to pages, and the
> in-situ human does not see the pages fetched by the JavaScript.  Based
> on your responses in analytics mailing list discussion, and this new
> terminology "in-situ human consumption", I believe that these gadgets
> would be considered subject to the bot user-agent policy.
> It would be good to identify a list of gadgets which need to be
> updated to comply with the new user agent policy.
>
> --
> John Vandenberg
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l




--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Small Amendment to User-Agent Policy

John Mark Vandenberg
On Tue, Mar 22, 2016 at 12:44 AM, Marcel Ruiz Forns
<[hidden email]> wrote:
> ...
> I think adding the word bot to the user-agent of bot-like programs is a
> widely adopted convention. Actually, the word bot is already (for a long
> time now) being parsed and used to tag requests as bot-originated in our
> jobs that process requests into pageviews stats, because many external bots
> include it in their user-agent. See:
> http://www.useragentstring.com/pages/Crawlerlist/

The algorithm has been imperfect for a long time.  How long and how
imperfect doesnt matter.  Analytics is all about making good use of
imperfect algorithms to provide reasonable approximations.

However, I expect it is the role of Analytics is to improve the
definitions and implementation over time, not force a bad algorithms
into policy.

Pywiki*bot* has the string 'bot' in its useragent, because it is part
of the product name.
However, not all usage of Pywikibot is a crawler or even a bot, in any
sensible definition of those concepts.
Pywikibot is a *user agent* that knows how to be a client of the
*MediaWiki API*.  It can be used for "in-situ human consumption" or
not.

It is no different from a web browser in how it *may* be used,
although of course typically the primary goal of using Pywikibot
instead of a Web browser is to reduce the amount of human consumption
and decision making needed to perform a task.  But that is no
different to Gadgets written using the JavaScript libraries that run
in the Web browser.

It can function *exactly* like a web browser reading a special:search
results page, viewing some of those page in the search results, and
making edits to some of them.  Each page may be viewed by a real
human, who is making decisions throughout the entire process about
which pages to view and which pages to edit.

Or it can function *exactly* like a crawler, spider, bot, etc., with
zero human consumption.

Almost every script that is packaged with Pywikibot has an automatic
and non-automatic mode of operation.
Should we change our user-agent to "Pywikihuman" when in non-automatic
mode of operation, so that it isnt considered to be a bot by
Analytics?

Using the string 'bot' in the user-agent may be a useful approximation
for Analytics to use circa 2010, but it is bad policy, and Analytics
can and should do much better than that in 2016 now that API usage is
in focus.

> There is very little information at
>> https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
>> I can see) regarding what use of the API is considered to be a
>> **page** view.  For example, is it a page view when I ask the API for
>> metadata only of the last revision of a page -- i.e. the page/revision
>> text is not included in the response?
>
> You're right, and this is a very good question. I fear the only ways to
> look into this are browsing the actual code in:
> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java

I am not very interested in the code, which is at best an attempt at
implementing the API page view definition.  I'd like to understand the
high level goal.

However, having read that file, and the accompanying test suite, it is
my understanding that there is no definition of an API page view.
i.e. all requests to api.php , excepting api.php usage by the
Wikipedia App (i.e. with user-agent "WikipediaApp", used by the iOS
and Android Apps), is classified as *not a page view*.

fwiw, rather than reading the source, this test data file with
expected results is a simpler way to see the current status.

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/test/resources/pageview_test_data.csv

> or asking the Research team, who owns the definition.

Could the Research team please publish their definition of API
(api.php) page views, like they do for Web (index.php) page views.

Without this, it is hard to have a serious conversation about how
changing the user-agent policy might be helpful to achieve the goal of
better classifying API page views.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Small Amendment to User-Agent Policy

Marcel Ruiz Forns
>
> The algorithm has been imperfect for a long time.  How long and how
> imperfect doesnt matter.  Analytics is all about making good use of
> imperfect algorithms to provide reasonable approximations.
> However, I expect it is the role of Analytics is to improve the
> definitions and implementation over time, not force a bad algorithms
> into policy.


I don't think it is a bad algorithm. Using 'bot' in the user-agent is a
widely adopted convention, so analytics code needs to implement this (even
if it is an approximation). Because of that, Wikimedia bots having the word
'bot' in their user-agents have been tagged as bots for a long time now.
And it seems to make sense to have a line that refers to that fact in the
user-agent policy.

It is no different from a web browser in how it *may* be used,
> although of course typically the primary goal of using Pywikibot
> instead of a Web browser is to reduce the amount of human consumption
> and decision making needed to perform a task.


That is also Analytics view on the subject. As you said, it is an
approximation that won't fit all cases. But in general, it makes sense to
approximate that, and tag them as non-human.




On Tue, Mar 22, 2016 at 5:18 AM, John Mark Vandenberg <[hidden email]>
wrote:

> On Tue, Mar 22, 2016 at 12:44 AM, Marcel Ruiz Forns
> <[hidden email]> wrote:
> > ...
> > I think adding the word bot to the user-agent of bot-like programs is a
> > widely adopted convention. Actually, the word bot is already (for a long
> > time now) being parsed and used to tag requests as bot-originated in our
> > jobs that process requests into pageviews stats, because many external
> bots
> > include it in their user-agent. See:
> > http://www.useragentstring.com/pages/Crawlerlist/
>
> The algorithm has been imperfect for a long time.  How long and how
> imperfect doesnt matter.  Analytics is all about making good use of
> imperfect algorithms to provide reasonable approximations.
>
> However, I expect it is the role of Analytics is to improve the
> definitions and implementation over time, not force a bad algorithms
> into policy.
>
> Pywiki*bot* has the string 'bot' in its useragent, because it is part
> of the product name.
> However, not all usage of Pywikibot is a crawler or even a bot, in any
> sensible definition of those concepts.
> Pywikibot is a *user agent* that knows how to be a client of the
> *MediaWiki API*.  It can be used for "in-situ human consumption" or
> not.
>
> It is no different from a web browser in how it *may* be used,
> although of course typically the primary goal of using Pywikibot
> instead of a Web browser is to reduce the amount of human consumption
> and decision making needed to perform a task.  But that is no
> different to Gadgets written using the JavaScript libraries that run
> in the Web browser.
>
> It can function *exactly* like a web browser reading a special:search
> results page, viewing some of those page in the search results, and
> making edits to some of them.  Each page may be viewed by a real
> human, who is making decisions throughout the entire process about
> which pages to view and which pages to edit.
>
> Or it can function *exactly* like a crawler, spider, bot, etc., with
> zero human consumption.
>
> Almost every script that is packaged with Pywikibot has an automatic
> and non-automatic mode of operation.
> Should we change our user-agent to "Pywikihuman" when in non-automatic
> mode of operation, so that it isnt considered to be a bot by
> Analytics?
>
> Using the string 'bot' in the user-agent may be a useful approximation
> for Analytics to use circa 2010, but it is bad policy, and Analytics
> can and should do much better than that in 2016 now that API usage is
> in focus.
>
> > There is very little information at
> >> https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
> >> I can see) regarding what use of the API is considered to be a
> >> **page** view.  For example, is it a page view when I ask the API for
> >> metadata only of the last revision of a page -- i.e. the page/revision
> >> text is not included in the response?
> >
> > You're right, and this is a very good question. I fear the only ways to
> > look into this are browsing the actual code in:
> >
> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java
>
> I am not very interested in the code, which is at best an attempt at
> implementing the API page view definition.  I'd like to understand the
> high level goal.
>
> However, having read that file, and the accompanying test suite, it is
> my understanding that there is no definition of an API page view.
> i.e. all requests to api.php , excepting api.php usage by the
> Wikipedia App (i.e. with user-agent "WikipediaApp", used by the iOS
> and Android Apps), is classified as *not a page view*.
>
> fwiw, rather than reading the source, this test data file with
> expected results is a simpler way to see the current status.
>
>
> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/test/resources/pageview_test_data.csv
>
> > or asking the Research team, who owns the definition.
>
> Could the Research team please publish their definition of API
> (api.php) page views, like they do for Web (index.php) page views.
>
> Without this, it is hard to have a serious conversation about how
> changing the user-agent policy might be helpful to achieve the goal of
> better classifying API page views.
>
> --
> John Vandenberg
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l