Kill the bots

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Kill the bots

Brian Keegan-3
Is there a way to retrieve a canonical list of bots on enwiki or elsewhere? I'm interested in omitting automated revisions (sorry Stuart!) for the purposes of building co-authorship networks. 

Grabbing everything under 'Category:All Wikipedia bots' excludes some major ones like SmackBot, Cydebot, VIAFbot, Full-date unlinking bot, etc. because these bots have changed names but the redirect is not categorized, the account has been removed/deprecated, or a user appears to have removed the relevant bot categories from the page. 

Can anyone advise me on how to kill all the bots in my data without having to resort to manual cleaning or hacky regex?


--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

M: 617.803.6971
O: 617.373.7200
Skype: bckeegan

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Andrew G. West
User name policy states that "*bot*" names are reserved for bots. Thus,
such a regex shouldn't be too hacky, but I cannot comment whether some
non-automated cases might slip through new user patrol. I do think dumps
make the 'users' table available, and I know for sure one could get a
full list via the API.

As a check on this, you could check that when these usernames edit,
whether or not they set the "bot" flag. -AW

--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com


On 05/18/2014 12:10 PM, Brian Keegan wrote:

> Is there a way to retrieve a canonical list of bots on enwiki or
> elsewhere? I'm interested in omitting automated revisions (sorry
> Stuart!) for the purposes of building co-authorship networks.
>
> Grabbing everything under 'Category:All Wikipedia bots' excludes some
> major ones like SmackBot, Cydebot, VIAFbot, Full-date unlinking bot,
> etc. because these bots have changed names but the redirect is not
> categorized, the account has been removed/deprecated, or a user appears
> to have removed the relevant bot categories from the page.
>
> Can anyone advise me on how to kill all the bots in my data without
> having to resort to manual cleaning or hacky regex?
>
>
> --
> Brian C. Keegan, Ph.D.
> Post-Doctoral Research Fellow, Lazer Lab
> College of Social Sciences and Humanities, Northeastern University
> Fellow, Institute for Quantitative Social Sciences, Harvard University
> Affiliate, Berkman Center for Internet & Society, Harvard Law School
>
> [hidden email] <mailto:[hidden email]>
> www.brianckeegan.com <http://www.brianckeegan.com>
> M: 617.803.6971
> O: 617.373.7200
> Skype: bckeegan
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Amir E. Aharoni

People whose last name is Abbot will be discriminated.

And a true story: A prominent human Catalan Wikipedia editor whose name is PauCabot skewed the results of an actual study.

So don't trust just the user names.

בתאריך 18 במאי 2014 19:34, מאת "Andrew G. West" <[hidden email]>:
User name policy states that "*bot*" names are reserved for bots. Thus, such a regex shouldn't be too hacky, but I cannot comment whether some non-automated cases might slip through new user patrol. I do think dumps make the 'users' table available, and I know for sure one could get a full list via the API.

As a check on this, you could check that when these usernames edit, whether or not they set the "bot" flag. -AW

--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com


On 05/18/2014 12:10 PM, Brian Keegan wrote:
Is there a way to retrieve a canonical list of bots on enwiki or
elsewhere? I'm interested in omitting automated revisions (sorry
Stuart!) for the purposes of building co-authorship networks.

Grabbing everything under 'Category:All Wikipedia bots' excludes some
major ones like SmackBot, Cydebot, VIAFbot, Full-date unlinking bot,
etc. because these bots have changed names but the redirect is not
categorized, the account has been removed/deprecated, or a user appears
to have removed the relevant bot categories from the page.

Can anyone advise me on how to kill all the bots in my data without
having to resort to manual cleaning or hacky regex?


--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

[hidden email] <mailto:[hidden email]>
www.brianckeegan.com <http://www.brianckeegan.com>
M: 617.803.6971
O: 617.373.7200
Skype: bckeegan


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

lbenedix
In reply to this post by Andrew G. West
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So 18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Scott Hale
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <[hidden email]> wrote:
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So 18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

R.Stuart Geiger
Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get no mercy. :-)

But seriously, my tl;dr: instead of asking if an account is or isn't a bot, ask if a set of edits are or are not automated

Great responses so far: searching usernames for *bot will exclude non-bot users who were registered before the username policy change (although *Bot is a bit better), and the logging table is a great way to collect bot flags. However, Scott is right -- the bot flag (or *Bot username) doesn't signify a bot, it signifies a bureaucrat recognizing that a user account successfully went through the Bot Approval Group process. If I see an account with a bot flag, I can generally assume the edits that account makes are initiated by an automated software agent. This is especially the case in the main namespace. The inverse assumption is not nearly as easy: I can't assume that every edit made from an account *without* a bot flag was *not* an automated edit.

About unauthorized bots: yes, there are a relatively small number of Wikipedians who, on occasion, run fully-automated, continuously-operating bots without approval. Complicating this, if someone is going to take the time to build and run a bot, but isn't going to create a separate account for it, then it is likely that they are also using that account to do non-automated edits. Sometimes new bot developers will run an unauthorized bot under their own account during the initial stages of development, and only later in the process will they create a separate bot account and seek formal approval and flagging. It can get tricky when you exclude all the edits from an account for being automated based on a single suspicious set of edits.

More commonly, there are many more people who use automated batch tools like AutoWikiBrowser to support one-off tasks, like mass find-and-replace or category cleanup. Accounts powered by AWB are technically not bots, only because a human has to sit there and click "save" for every batch edit that is made. Some people will create a separate bot account for AWB work and get it approved and flagged, but many more will not bother. Then there are people using semi-automated, human-in-the-loop tools like Huggle to do vandal fighting. I find that the really hard question is whether you include or exclude these different kinds of 'cyborgs', because it really makes you think hard about what exactly you're measuring. Is someone who does a mass find-and-replace on all articles in a category a co-author of each article they edit? Is a vandal fighter patrolling the recent changes feed with Huggle a co-author of all the articles they edit when they revert vandalism and then move on to the next diff? What about somebody using rollback in the web browser? If so, what is it that makes these entities authors and ClueBot NG not an author?

When you think about it, user accounts are actually pretty remarkable in that they allow such a diverse set of uses and agents to be attributed to a single entity. So when it comes to identifying automation, I personally think it is better to shift the unit of analysis from the user account to the individual edit. A bot flag lets you assume all edits from an account are automated, but you can use a range of approaches to identifying sets of automated edits from non-flagged accounts. Then I have a set of regex SQL queries in the Query Library [1] which parses edit summaries for the traces that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default. You can also use the edit session approach like Scott has suggested -- Aaron and I found a few unauthorized bots in our edit session study [2], and we were even using a more aggressive break, with no more than a 60 minute gap between edits. To catch short bursts of bulk edits, you could look at large numbers of edits made in a short period of time -- I'd say more than 7 main namespace edits a minute for 10 minutes would be a hard rate for even a very aggressive vandal fighter to maintain with Huggle.

I'll conclude by saying that different kinds of automated editing techniques are different ways of participating in and contributing to Wikipedia. To systematically exclude automated edits is to remove a very important, meaningful, and heterogeneous kind of activity from view. These activities constitute a core part of what Wikipedia is, particularly those forms of automation which the community has explicitly authorized and recognized. Now, we researchers inevitably have to selectively reveal and occlude -- a co-authorship network based on main namespace edits also excludes talk page discussions and conflict resolution, and this also constitutes a core part of what Wikipedia is. It isn't wrong per se to exclude automated edits, and it is certainly much worse to not recognize that they exist at all. However, I always appreciate seeing how the analysis would be different if bots were not excluded. The fact that there are these weird users which absolutely dominate a co-authorship network graph if you don't filter them out is pretty amazing, at least to me.

Best,
Stuart



On Sun, May 18, 2014 at 10:08 AM, Scott Hale <[hidden email]> wrote:
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <[hidden email]> wrote:
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So 18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Oliver Keyes-4
Personally, I'm a big fan of Scott's method (and the associated paper, which I've been throwing about internally ;)).  Stu's points are worth addressing, though, and I think his per-edit approach is probably the way to go.

TL;DR all the sensible things have already been said, I'm just +1ing them ;p.


On 18 May 2014 12:33, R.Stuart Geiger <[hidden email]> wrote:
Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get no mercy. :-)

But seriously, my tl;dr: instead of asking if an account is or isn't a bot, ask if a set of edits are or are not automated

Great responses so far: searching usernames for *bot will exclude non-bot users who were registered before the username policy change (although *Bot is a bit better), and the logging table is a great way to collect bot flags. However, Scott is right -- the bot flag (or *Bot username) doesn't signify a bot, it signifies a bureaucrat recognizing that a user account successfully went through the Bot Approval Group process. If I see an account with a bot flag, I can generally assume the edits that account makes are initiated by an automated software agent. This is especially the case in the main namespace. The inverse assumption is not nearly as easy: I can't assume that every edit made from an account *without* a bot flag was *not* an automated edit.

About unauthorized bots: yes, there are a relatively small number of Wikipedians who, on occasion, run fully-automated, continuously-operating bots without approval. Complicating this, if someone is going to take the time to build and run a bot, but isn't going to create a separate account for it, then it is likely that they are also using that account to do non-automated edits. Sometimes new bot developers will run an unauthorized bot under their own account during the initial stages of development, and only later in the process will they create a separate bot account and seek formal approval and flagging. It can get tricky when you exclude all the edits from an account for being automated based on a single suspicious set of edits.

More commonly, there are many more people who use automated batch tools like AutoWikiBrowser to support one-off tasks, like mass find-and-replace or category cleanup. Accounts powered by AWB are technically not bots, only because a human has to sit there and click "save" for every batch edit that is made. Some people will create a separate bot account for AWB work and get it approved and flagged, but many more will not bother. Then there are people using semi-automated, human-in-the-loop tools like Huggle to do vandal fighting. I find that the really hard question is whether you include or exclude these different kinds of 'cyborgs', because it really makes you think hard about what exactly you're measuring. Is someone who does a mass find-and-replace on all articles in a category a co-author of each article they edit? Is a vandal fighter patrolling the recent changes feed with Huggle a co-author of all the articles they edit when they revert vandalism and then move on to the next diff? What about somebody using rollback in the web browser? If so, what is it that makes these entities authors and ClueBot NG not an author?

When you think about it, user accounts are actually pretty remarkable in that they allow such a diverse set of uses and agents to be attributed to a single entity. So when it comes to identifying automation, I personally think it is better to shift the unit of analysis from the user account to the individual edit. A bot flag lets you assume all edits from an account are automated, but you can use a range of approaches to identifying sets of automated edits from non-flagged accounts. Then I have a set of regex SQL queries in the Query Library [1] which parses edit summaries for the traces that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default. You can also use the edit session approach like Scott has suggested -- Aaron and I found a few unauthorized bots in our edit session study [2], and we were even using a more aggressive break, with no more than a 60 minute gap between edits. To catch short bursts of bulk edits, you could look at large numbers of edits made in a short period of time -- I'd say more than 7 main namespace edits a minute for 10 minutes would be a hard rate for even a very aggressive vandal fighter to maintain with Huggle.

I'll conclude by saying that different kinds of automated editing techniques are different ways of participating in and contributing to Wikipedia. To systematically exclude automated edits is to remove a very important, meaningful, and heterogeneous kind of activity from view. These activities constitute a core part of what Wikipedia is, particularly those forms of automation which the community has explicitly authorized and recognized. Now, we researchers inevitably have to selectively reveal and occlude -- a co-authorship network based on main namespace edits also excludes talk page discussions and conflict resolution, and this also constitutes a core part of what Wikipedia is. It isn't wrong per se to exclude automated edits, and it is certainly much worse to not recognize that they exist at all. However, I always appreciate seeing how the analysis would be different if bots were not excluded. The fact that there are these weird users which absolutely dominate a co-authorship network graph if you don't filter them out is pretty amazing, at least to me.

Best,
Stuart



On Sun, May 18, 2014 at 10:08 AM, Scott Hale <[hidden email]> wrote:
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <[hidden email]> wrote:
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So 18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Brian Keegan-3
In reply to this post by R.Stuart Geiger
How does one cite emails in ACM proceedings format? :)

On Sunday, May 18, 2014, R.Stuart Geiger <[hidden email]> wrote:
Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get no mercy. :-)

But seriously, my tl;dr: instead of asking if an account is or isn't a bot, ask if a set of edits are or are not automated

Great responses so far: searching usernames for *bot will exclude non-bot users who were registered before the username policy change (although *Bot is a bit better), and the logging table is a great way to collect bot flags. However, Scott is right -- the bot flag (or *Bot username) doesn't signify a bot, it signifies a bureaucrat recognizing that a user account successfully went through the Bot Approval Group process. If I see an account with a bot flag, I can generally assume the edits that account makes are initiated by an automated software agent. This is especially the case in the main namespace. The inverse assumption is not nearly as easy: I can't assume that every edit made from an account *without* a bot flag was *not* an automated edit.

About unauthorized bots: yes, there are a relatively small number of Wikipedians who, on occasion, run fully-automated, continuously-operating bots without approval. Complicating this, if someone is going to take the time to build and run a bot, but isn't going to create a separate account for it, then it is likely that they are also using that account to do non-automated edits. Sometimes new bot developers will run an unauthorized bot under their own account during the initial stages of development, and only later in the process will they create a separate bot account and seek formal approval and flagging. It can get tricky when you exclude all the edits from an account for being automated based on a single suspicious set of edits.

More commonly, there are many more people who use automated batch tools like AutoWikiBrowser to support one-off tasks, like mass find-and-replace or category cleanup. Accounts powered by AWB are technically not bots, only because a human has to sit there and click "save" for every batch edit that is made. Some people will create a separate bot account for AWB work and get it approved and flagged, but many more will not bother. Then there are people using semi-automated, human-in-the-loop tools like Huggle to do vandal fighting. I find that the really hard question is whether you include or exclude these different kinds of 'cyborgs', because it really makes you think hard about what exactly you're measuring. Is someone who does a mass find-and-replace on all articles in a category a co-author of each article they edit? Is a vandal fighter patrolling the recent changes feed with Huggle a co-author of all the articles they edit when they revert vandalism and then move on to the next diff? What about somebody using rollback in the web browser? If so, what is it that makes these entities authors and ClueBot NG not an author?

When you think about it, user accounts are actually pretty remarkable in that they allow such a diverse set of uses and agents to be attributed to a single entity. So when it comes to identifying automation, I personally think it is better to shift the unit of analysis from the user account to the individual edit. A bot flag lets you assume all edits from an account are automated, but you can use a range of approaches to identifying sets of automated edits from non-flagged accounts. Then I have a set of regex SQL queries in the Query Library [1] which parses edit summaries for the traces that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default. You can also use the edit session approach like Scott has suggested -- Aaron and I found a few unauthorized bots in our edit session study [2], and we were even using a more aggressive break, with no more than a 60 minute gap between edits. To catch short bursts of bulk edits, you could look at large numbers of edits made in a short period of time -- I'd say more than 7 main namespace edits a minute for 10 minutes would be a hard rate for even a very aggressive vandal fighter to maintain with Huggle.

I'll conclude by saying that different kinds of automated editing techniques are different ways of participating in and contributing to Wikipedia. To systematically exclude automated edits is to remove a very important, meaningful, and heterogeneous kind of activity from view. These activities constitute a core part of what Wikipedia is, particularly those forms of automation which the community has explicitly authorized and recognized. Now, we researchers inevitably have to selectively reveal and occlude -- a co-authorship network based on main namespace edits also excludes talk page discussions and conflict resolution, and this also constitutes a core part of what Wikipedia is. It isn't wrong per se to exclude automated edits, and it is certainly much worse to not recognize that they exist at all. However, I always appreciate seeing how the analysis would be different if bots were not excluded. The fact that there are these weird users which absolutely dominate a co-authorship network graph if you don't filter them out is pretty amazing, at least to me.

Best,
Stuart



On Sun, May 18, 2014 at 10:08 AM, Scott Hale <[hidden email]> wrote:
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <[hidden email]> wrote:
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So 18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mai


--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

M: 617.803.6971
O: 617.373.7200
Skype: bckeegan


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Taha Yasseri
from ACM Authors Guide (http://www.acm.org/sigs/publications/sigguide-v2.2sp):

"Private communications should be acknowledged, not referenced (e.g. "[Robertson, personal communication]")."

Although this one was not quite private ;-)

.t



On Mon, May 19, 2014 at 1:59 AM, Brian Keegan <[hidden email]> wrote:
How does one cite emails in ACM proceedings format? :)


On Sunday, May 18, 2014, R.Stuart Geiger <[hidden email]> wrote:
Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get no mercy. :-)

But seriously, my tl;dr: instead of asking if an account is or isn't a bot, ask if a set of edits are or are not automated

Great responses so far: searching usernames for *bot will exclude non-bot users who were registered before the username policy change (although *Bot is a bit better), and the logging table is a great way to collect bot flags. However, Scott is right -- the bot flag (or *Bot username) doesn't signify a bot, it signifies a bureaucrat recognizing that a user account successfully went through the Bot Approval Group process. If I see an account with a bot flag, I can generally assume the edits that account makes are initiated by an automated software agent. This is especially the case in the main namespace. The inverse assumption is not nearly as easy: I can't assume that every edit made from an account *without* a bot flag was *not* an automated edit.

About unauthorized bots: yes, there are a relatively small number of Wikipedians who, on occasion, run fully-automated, continuously-operating bots without approval. Complicating this, if someone is going to take the time to build and run a bot, but isn't going to create a separate account for it, then it is likely that they are also using that account to do non-automated edits. Sometimes new bot developers will run an unauthorized bot under their own account during the initial stages of development, and only later in the process will they create a separate bot account and seek formal approval and flagging. It can get tricky when you exclude all the edits from an account for being automated based on a single suspicious set of edits.

More commonly, there are many more people who use automated batch tools like AutoWikiBrowser to support one-off tasks, like mass find-and-replace or category cleanup. Accounts powered by AWB are technically not bots, only because a human has to sit there and click "save" for every batch edit that is made. Some people will create a separate bot account for AWB work and get it approved and flagged, but many more will not bother. Then there are people using semi-automated, human-in-the-loop tools like Huggle to do vandal fighting. I find that the really hard question is whether you include or exclude these different kinds of 'cyborgs', because it really makes you think hard about what exactly you're measuring. Is someone who does a mass find-and-replace on all articles in a category a co-author of each article they edit? Is a vandal fighter patrolling the recent changes feed with Huggle a co-author of all the articles they edit when they revert vandalism and then move on to the next diff? What about somebody using rollback in the web browser? If so, what is it that makes these entities authors and ClueBot NG not an author?

When you think about it, user accounts are actually pretty remarkable in that they allow such a diverse set of uses and agents to be attributed to a single entity. So when it comes to identifying automation, I personally think it is better to shift the unit of analysis from the user account to the individual edit. A bot flag lets you assume all edits from an account are automated, but you can use a range of approaches to identifying sets of automated edits from non-flagged accounts. Then I have a set of regex SQL queries in the Query Library [1] which parses edit summaries for the traces that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default. You can also use the edit session approach like Scott has suggested -- Aaron and I found a few unauthorized bots in our edit session study [2], and we were even using a more aggressive break, with no more than a 60 minute gap between edits. To catch short bursts of bulk edits, you could look at large numbers of edits made in a short period of time -- I'd say more than 7 main namespace edits a minute for 10 minutes would be a hard rate for even a very aggressive vandal fighter to maintain with Huggle.

I'll conclude by saying that different kinds of automated editing techniques are different ways of participating in and contributing to Wikipedia. To systematically exclude automated edits is to remove a very important, meaningful, and heterogeneous kind of activity from view. These activities constitute a core part of what Wikipedia is, particularly those forms of automation which the community has explicitly authorized and recognized. Now, we researchers inevitably have to selectively reveal and occlude -- a co-authorship network based on main namespace edits also excludes talk page discussions and conflict resolution, and this also constitutes a core part of what Wikipedia is. It isn't wrong per se to exclude automated edits, and it is certainly much worse to not recognize that they exist at all. However, I always appreciate seeing how the analysis would be different if bots were not excluded. The fact that there are these weird users which absolutely dominate a co-authorship network graph if you don't filter them out is pretty amazing, at least to me.

Best,
Stuart



On Sun, May 18, 2014 at 10:08 AM, Scott Hale <[hidden email]> wrote:
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <[hidden email]> wrote:
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So <a href="tel:18.05.2014" value="+3618052014" target="_blank">18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mai


--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

M: 617.803.6971
O: 617.373.7200
Skype: bckeegan


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
.t

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

WereSpielChequers-2
In reply to this post by R.Stuart Geiger
If your bot is only running automated reports in its own userspace then it doesn't need a bot flag. But it probably wont be a very active bot so may not be a problem for your stats

On the English language wikipedia you are going to be fairly close if you exclude all accounts which currently have a bot flag, this list of former bots (I occasionally maintain this in order for the list of editors by edit count to work, as of a couple of weeks ago when I last checked I believe it to be a comprehensive list of retired bots with 6,000 or more edits), and perhaps the individual with a very high edit count who has in the past been blocked for running unauthorised bots on his user account. (I won't name that account on list, but since it also contains a large number of manual edits, the true answer is that you can't get an exact divide between bots and non bots by classifying every account as either a bot or a human).

If you are minded to treat all accounts containing the word syllable bot as bots, then you might want to tweak that to count anyone on these two lists as human even if their name includes bot. I check those lists occasionally and make sure that the only bots included are human editors.


On 18 May 2014 20:33, R.Stuart Geiger <[hidden email]> wrote:
Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get no mercy. :-)

But seriously, my tl;dr: instead of asking if an account is or isn't a bot, ask if a set of edits are or are not automated

Great responses so far: searching usernames for *bot will exclude non-bot users who were registered before the username policy change (although *Bot is a bit better), and the logging table is a great way to collect bot flags. However, Scott is right -- the bot flag (or *Bot username) doesn't signify a bot, it signifies a bureaucrat recognizing that a user account successfully went through the Bot Approval Group process. If I see an account with a bot flag, I can generally assume the edits that account makes are initiated by an automated software agent. This is especially the case in the main namespace. The inverse assumption is not nearly as easy: I can't assume that every edit made from an account *without* a bot flag was *not* an automated edit.

About unauthorized bots: yes, there are a relatively small number of Wikipedians who, on occasion, run fully-automated, continuously-operating bots without approval. Complicating this, if someone is going to take the time to build and run a bot, but isn't going to create a separate account for it, then it is likely that they are also using that account to do non-automated edits. Sometimes new bot developers will run an unauthorized bot under their own account during the initial stages of development, and only later in the process will they create a separate bot account and seek formal approval and flagging. It can get tricky when you exclude all the edits from an account for being automated based on a single suspicious set of edits.

More commonly, there are many more people who use automated batch tools like AutoWikiBrowser to support one-off tasks, like mass find-and-replace or category cleanup. Accounts powered by AWB are technically not bots, only because a human has to sit there and click "save" for every batch edit that is made. Some people will create a separate bot account for AWB work and get it approved and flagged, but many more will not bother. Then there are people using semi-automated, human-in-the-loop tools like Huggle to do vandal fighting. I find that the really hard question is whether you include or exclude these different kinds of 'cyborgs', because it really makes you think hard about what exactly you're measuring. Is someone who does a mass find-and-replace on all articles in a category a co-author of each article they edit? Is a vandal fighter patrolling the recent changes feed with Huggle a co-author of all the articles they edit when they revert vandalism and then move on to the next diff? What about somebody using rollback in the web browser? If so, what is it that makes these entities authors and ClueBot NG not an author?

When you think about it, user accounts are actually pretty remarkable in that they allow such a diverse set of uses and agents to be attributed to a single entity. So when it comes to identifying automation, I personally think it is better to shift the unit of analysis from the user account to the individual edit. A bot flag lets you assume all edits from an account are automated, but you can use a range of approaches to identifying sets of automated edits from non-flagged accounts. Then I have a set of regex SQL queries in the Query Library [1] which parses edit summaries for the traces that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default. You can also use the edit session approach like Scott has suggested -- Aaron and I found a few unauthorized bots in our edit session study [2], and we were even using a more aggressive break, with no more than a 60 minute gap between edits. To catch short bursts of bulk edits, you could look at large numbers of edits made in a short period of time -- I'd say more than 7 main namespace edits a minute for 10 minutes would be a hard rate for even a very aggressive vandal fighter to maintain with Huggle.

I'll conclude by saying that different kinds of automated editing techniques are different ways of participating in and contributing to Wikipedia. To systematically exclude automated edits is to remove a very important, meaningful, and heterogeneous kind of activity from view. These activities constitute a core part of what Wikipedia is, particularly those forms of automation which the community has explicitly authorized and recognized. Now, we researchers inevitably have to selectively reveal and occlude -- a co-authorship network based on main namespace edits also excludes talk page discussions and conflict resolution, and this also constitutes a core part of what Wikipedia is. It isn't wrong per se to exclude automated edits, and it is certainly much worse to not recognize that they exist at all. However, I always appreciate seeing how the analysis would be different if bots were not excluded. The fact that there are these weird users which absolutely dominate a co-authorship network graph if you don't filter them out is pretty amazing, at least to me.

Best,
Stuart



On Sun, May 18, 2014 at 10:08 AM, Scott Hale <[hidden email]> wrote:
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <[hidden email]> wrote:
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So 18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Oliver Keyes-4
That would cover most of them, but runs into the problem of "you're only including the unauthorised bots written poorly enough that we've caught the operator" ;). It seems like this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas.


On 19 May 2014 03:30, WereSpielChequers <[hidden email]> wrote:
If your bot is only running automated reports in its own userspace then it doesn't need a bot flag. But it probably wont be a very active bot so may not be a problem for your stats

On the English language wikipedia you are going to be fairly close if you exclude all accounts which currently have a bot flag, this list of former bots (I occasionally maintain this in order for the list of editors by edit count to work, as of a couple of weeks ago when I last checked I believe it to be a comprehensive list of retired bots with 6,000 or more edits), and perhaps the individual with a very high edit count who has in the past been blocked for running unauthorised bots on his user account. (I won't name that account on list, but since it also contains a large number of manual edits, the true answer is that you can't get an exact divide between bots and non bots by classifying every account as either a bot or a human).

If you are minded to treat all accounts containing the word syllable bot as bots, then you might want to tweak that to count anyone on these two lists as human even if their name includes bot. I check those lists occasionally and make sure that the only bots included are human editors.


On 18 May 2014 20:33, R.Stuart Geiger <[hidden email]> wrote:
Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get no mercy. :-)

But seriously, my tl;dr: instead of asking if an account is or isn't a bot, ask if a set of edits are or are not automated

Great responses so far: searching usernames for *bot will exclude non-bot users who were registered before the username policy change (although *Bot is a bit better), and the logging table is a great way to collect bot flags. However, Scott is right -- the bot flag (or *Bot username) doesn't signify a bot, it signifies a bureaucrat recognizing that a user account successfully went through the Bot Approval Group process. If I see an account with a bot flag, I can generally assume the edits that account makes are initiated by an automated software agent. This is especially the case in the main namespace. The inverse assumption is not nearly as easy: I can't assume that every edit made from an account *without* a bot flag was *not* an automated edit.

About unauthorized bots: yes, there are a relatively small number of Wikipedians who, on occasion, run fully-automated, continuously-operating bots without approval. Complicating this, if someone is going to take the time to build and run a bot, but isn't going to create a separate account for it, then it is likely that they are also using that account to do non-automated edits. Sometimes new bot developers will run an unauthorized bot under their own account during the initial stages of development, and only later in the process will they create a separate bot account and seek formal approval and flagging. It can get tricky when you exclude all the edits from an account for being automated based on a single suspicious set of edits.

More commonly, there are many more people who use automated batch tools like AutoWikiBrowser to support one-off tasks, like mass find-and-replace or category cleanup. Accounts powered by AWB are technically not bots, only because a human has to sit there and click "save" for every batch edit that is made. Some people will create a separate bot account for AWB work and get it approved and flagged, but many more will not bother. Then there are people using semi-automated, human-in-the-loop tools like Huggle to do vandal fighting. I find that the really hard question is whether you include or exclude these different kinds of 'cyborgs', because it really makes you think hard about what exactly you're measuring. Is someone who does a mass find-and-replace on all articles in a category a co-author of each article they edit? Is a vandal fighter patrolling the recent changes feed with Huggle a co-author of all the articles they edit when they revert vandalism and then move on to the next diff? What about somebody using rollback in the web browser? If so, what is it that makes these entities authors and ClueBot NG not an author?

When you think about it, user accounts are actually pretty remarkable in that they allow such a diverse set of uses and agents to be attributed to a single entity. So when it comes to identifying automation, I personally think it is better to shift the unit of analysis from the user account to the individual edit. A bot flag lets you assume all edits from an account are automated, but you can use a range of approaches to identifying sets of automated edits from non-flagged accounts. Then I have a set of regex SQL queries in the Query Library [1] which parses edit summaries for the traces that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default. You can also use the edit session approach like Scott has suggested -- Aaron and I found a few unauthorized bots in our edit session study [2], and we were even using a more aggressive break, with no more than a 60 minute gap between edits. To catch short bursts of bulk edits, you could look at large numbers of edits made in a short period of time -- I'd say more than 7 main namespace edits a minute for 10 minutes would be a hard rate for even a very aggressive vandal fighter to maintain with Huggle.

I'll conclude by saying that different kinds of automated editing techniques are different ways of participating in and contributing to Wikipedia. To systematically exclude automated edits is to remove a very important, meaningful, and heterogeneous kind of activity from view. These activities constitute a core part of what Wikipedia is, particularly those forms of automation which the community has explicitly authorized and recognized. Now, we researchers inevitably have to selectively reveal and occlude -- a co-authorship network based on main namespace edits also excludes talk page discussions and conflict resolution, and this also constitutes a core part of what Wikipedia is. It isn't wrong per se to exclude automated edits, and it is certainly much worse to not recognize that they exist at all. However, I always appreciate seeing how the analysis would be different if bots were not excluded. The fact that there are these weird users which absolutely dominate a co-authorship network graph if you don't filter them out is pretty amazing, at least to me.

Best,
Stuart



On Sun, May 18, 2014 at 10:08 AM, Scott Hale <[hidden email]> wrote:
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <[hidden email]> wrote:
Here is a list of currently flagged bots:
https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot

Another good point to look for bots is here:
https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4

You should also have a look at this pages to find former bots:
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

And last but not least the logging table you can access via tool labs:
SELECT DISTINCT(log_title)
FROM logging
WHERE log_action = 'rights'
AND log_params LIKE '%bot%';

Lukas

Am So 18.05.2014 18:34, schrieb Andrew G. West:
> User name policy states that "*bot*" names are reserved for bots.
> Thus, such a regex shouldn't be too hacky, but I cannot comment
> whether some non-automated cases might slip through new user patrol. I
> do think dumps make the 'users' table available, and I know for sure
> one could get a full list via the API.
>
> As a check on this, you could check that when these usernames edit,
> whether or not they set the "bot" flag. -AW
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Federico Leva (Nemo)
In reply to this post by Brian Keegan-3
Brian Keegan, 18/05/2014 18:10:
> Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?

A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit
https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts

Nemo

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Brian Keegan-3
Thanks for all the references and excellent advice so far!

I've looked into the Hale Anti-Bot Method™, but because I've sampled my corpus on articles (based on category co-membership), the resulting groupby users gives these semi-automated users more "normal" distributions since their other contributions are censored. In other words, I see only a fraction of these users' contributions and thus the resulting time intervals I observe are spaced farther apart (more typical) than they actually are. It's not feasible for me to get 100k+ users' histories just for the purposes of cleaning up ~6k articles' histories.

Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments, would this be a relatively inexpensive way to filter out revisions rather than users? Some caveats, I'd like to get domain experts' feedback on. I'm not expecting settled research, just input from others' experiences munging the data.

1. Is the inclusion of this markup in revision comments optional? This is a concern that some users may enable or disable it, so I may end up biasing inclusion based on users' preferences. 
2. How have these flags or markup changed over time? This is a concern that Twinke/AWB/etc. may have started/stopped including flags or changed what they included over time. 
3. Are there other API queries or data elsewhere I could use to identify (semi-)automated revisions?


On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo) <[hidden email]> wrote:
Brian Keegan, 18/05/2014 18:10:

Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?

A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts

Nemo


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

M: 617.803.6971
O: 617.373.7200
Skype: bckeegan

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Aaron Halfaker-2
Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments

See Stuarts comments above.  And also the queries he linked too. https://wiki.toolserver.org/view/MySQL_queries#Automated_tool_and_bot_edits  It would be nice if we could get these queries in version control and share them. 

Maybe there is potential for building a hand-curated list of bot user_ids in version control as well.  

-Aaron


On Mon, May 19, 2014 at 10:17 AM, Brian Keegan <[hidden email]> wrote:
Thanks for all the references and excellent advice so far!

I've looked into the Hale Anti-Bot Method™, but because I've sampled my corpus on articles (based on category co-membership), the resulting groupby users gives these semi-automated users more "normal" distributions since their other contributions are censored. In other words, I see only a fraction of these users' contributions and thus the resulting time intervals I observe are spaced farther apart (more typical) than they actually are. It's not feasible for me to get 100k+ users' histories just for the purposes of cleaning up ~6k articles' histories.

Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments, would this be a relatively inexpensive way to filter out revisions rather than users? Some caveats, I'd like to get domain experts' feedback on. I'm not expecting settled research, just input from others' experiences munging the data.

1. Is the inclusion of this markup in revision comments optional? This is a concern that some users may enable or disable it, so I may end up biasing inclusion based on users' preferences. 
2. How have these flags or markup changed over time? This is a concern that Twinke/AWB/etc. may have started/stopped including flags or changed what they included over time. 
3. Are there other API queries or data elsewhere I could use to identify (semi-)automated revisions?


On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo) <[hidden email]> wrote:
Brian Keegan, 18/05/2014 18:10:

Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?

A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts

Nemo


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

M: <a href="tel:617.803.6971" value="+16178036971" target="_blank">617.803.6971
O: <a href="tel:617.373.7200" value="+16173737200" target="_blank">617.373.7200
Skype: bckeegan

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Ann Samoilenko
the Hale Anti-Bot Method™
That's a good one.  =)

I'm a big fan of Scott's method
I second that. Again, great paper, Scott!


On Mon, May 19, 2014 at 5:31 PM, Aaron Halfaker <[hidden email]> wrote:
Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments

See Stuarts comments above.  And also the queries he linked too. https://wiki.toolserver.org/view/MySQL_queries#Automated_tool_and_bot_edits  It would be nice if we could get these queries in version control and share them. 

Maybe there is potential for building a hand-curated list of bot user_ids in version control as well.  

-Aaron


On Mon, May 19, 2014 at 10:17 AM, Brian Keegan <[hidden email]> wrote:
Thanks for all the references and excellent advice so far!

I've looked into the Hale Anti-Bot Method™, but because I've sampled my corpus on articles (based on category co-membership), the resulting groupby users gives these semi-automated users more "normal" distributions since their other contributions are censored. In other words, I see only a fraction of these users' contributions and thus the resulting time intervals I observe are spaced farther apart (more typical) than they actually are. It's not feasible for me to get 100k+ users' histories just for the purposes of cleaning up ~6k articles' histories.

Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments, would this be a relatively inexpensive way to filter out revisions rather than users? Some caveats, I'd like to get domain experts' feedback on. I'm not expecting settled research, just input from others' experiences munging the data.

1. Is the inclusion of this markup in revision comments optional? This is a concern that some users may enable or disable it, so I may end up biasing inclusion based on users' preferences. 
2. How have these flags or markup changed over time? This is a concern that Twinke/AWB/etc. may have started/stopped including flags or changed what they included over time. 
3. Are there other API queries or data elsewhere I could use to identify (semi-)automated revisions?


On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo) <[hidden email]> wrote:
Brian Keegan, 18/05/2014 18:10:

Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?

A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts

Nemo


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

M: <a href="tel:617.803.6971" value="+16178036971" target="_blank">617.803.6971
O: <a href="tel:617.373.7200" value="+16173737200" target="_blank">617.373.7200
Skype: bckeegan

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
-----------------------------------------
Kind regards,
Ann Samoilenko, MSc

Oxford Internet Institute
University of Oxford

Adventures can change your life
 
e-mail: [hidden email]
Skype: ann.samoilenko

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Oliver Keyes-4
In reply to this post by Brian Keegan-3



On 19 May 2014 08:17, Brian Keegan <[hidden email]> wrote:

Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments, would this be a relatively inexpensive way to filter out revisions rather than users? Some caveats, I'd like to get domain experts' feedback on. I'm not expecting settled research, just input from others' experiences munging the data.

1. Is the inclusion of this markup in revision comments optional? This is a concern that some users may enable or disable it, so I may end up biasing inclusion based on users' preferences. 
 
With some tools it is; specifically, I think Twinkle makes the [[WP:TW|TW]] postfix in edits optional.[0]

 
2. How have these flags or markup changed over time? This is a concern that Twinke/AWB/etc. may have started/stopped including flags or changed what they included over time. 

I believe so. This is not so much a problem of tool development (in some cases, i.e. twinkle, it will be, because you've got a tool there that's been operating for ages, but some are quite new); it's more overall changes to the composition of the bot/semi-automated assistance ecosystem. Tools come to exist and are used and die and are replaced.
 
3. Are there other API queries or data elsewhere I could use to identify (semi-)automated revisions?


I'm happy to grab you the full histories of the relevant users/articles in a TSV if you want; hit me up offlist (the same goes to any other non-WMF researchers asking for non-PII; if you don't have labs/toolserver access and need data, ask us!)


[0] see https://en.wikipedia.org/wiki/Wikipedia:TWPREFS
--
Oliver Keyes
Research Analyst
Wikimedia Foundation


On 19 May 2014 08:17, Brian Keegan <[hidden email]> wrote:
Thanks for all the references and excellent advice so far!

I've looked into the Hale Anti-Bot Method™, but because I've sampled my corpus on articles (based on category co-membership), the resulting groupby users gives these semi-automated users more "normal" distributions since their other contributions are censored. In other words, I see only a fraction of these users' contributions and thus the resulting time intervals I observe are spaced farther apart (more typical) than they actually are. It's not feasible for me to get 100k+ users' histories just for the purposes of cleaning up ~6k articles' histories.

Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments, would this be a relatively inexpensive way to filter out revisions rather than users? Some caveats, I'd like to get domain experts' feedback on. I'm not expecting settled research, just input from others' experiences munging the data.

1. Is the inclusion of this markup in revision comments optional? This is a concern that some users may enable or disable it, so I may end up biasing inclusion based on users' preferences. 
2. How have these flags or markup changed over time? This is a concern that Twinke/AWB/etc. may have started/stopped including flags or changed what they included over time. 
3. Are there other API queries or data elsewhere I could use to identify (semi-)automated revisions?


On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo) <[hidden email]> wrote:
Brian Keegan, 18/05/2014 18:10:

Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?

A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts

Nemo


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

M: <a href="tel:617.803.6971" value="+16178036971" target="_blank">617.803.6971
O: <a href="tel:617.373.7200" value="+16173737200" target="_blank">617.373.7200
Skype: bckeegan

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Scott Hale
Thanks all for the comments on my paper, and even more thanks to everyone sharing these super helpful ideas on filtering bots: this is why I love the Wikipedia research committee.

I think Oliver is definitely right that 
 this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas.
"Citation goldmine" as one friend called it, I think.

This won't address edit logs to date, but do  we know if most bots and automated tools use the API to make edits? If so, would it be feasibility to add a flag to each edit as to whether it came through the API or not. This won't stop determined users, but might be a nice way to identify cyborg edits from those made manually by the same user for many of the standard tools going forward. 

The closest thing I found in the bug tracker is [1], but it doesn't address the issue of 'what is a bot' which this thread has clearly shown is quite complex. An API-edit vs. non-API edit might be a way forward unless there are automated tools/bots that don't use the API.




Cheers,
Scott

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Oliver Keyes-4
I think a lot of them use the API, but I don't know off the top of my head if it's all of them. If only we knew somebody who has spent the last 3 months staring into the cthulian nightmare of our request logs and could look this up...

More seriously; drop me a note off-list so that I can try to work out precisely what you need me to find out, and I'll write a quick-and-dirty parser of our sampled logs to drag the answer kicking and screaming into the light.

(sorry, it's annual review season. That always gets me blithe.)


On 19 May 2014 13:03, Scott Hale <[hidden email]> wrote:
Thanks all for the comments on my paper, and even more thanks to everyone sharing these super helpful ideas on filtering bots: this is why I love the Wikipedia research committee.

I think Oliver is definitely right that 
 this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas.
"Citation goldmine" as one friend called it, I think.

This won't address edit logs to date, but do  we know if most bots and automated tools use the API to make edits? If so, would it be feasibility to add a flag to each edit as to whether it came through the API or not. This won't stop determined users, but might be a nice way to identify cyborg edits from those made manually by the same user for many of the standard tools going forward. 

The closest thing I found in the bug tracker is [1], but it doesn't address the issue of 'what is a bot' which this thread has clearly shown is quite complex. An API-edit vs. non-API edit might be a way forward unless there are automated tools/bots that don't use the API.




Cheers,
Scott

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Oliver Keyes-4
Actually, belay that, I have a pretty good idea. I'll fire the log parser up now.


On 20 May 2014 01:21, Oliver Keyes <[hidden email]> wrote:
I think a lot of them use the API, but I don't know off the top of my head if it's all of them. If only we knew somebody who has spent the last 3 months staring into the cthulian nightmare of our request logs and could look this up...

More seriously; drop me a note off-list so that I can try to work out precisely what you need me to find out, and I'll write a quick-and-dirty parser of our sampled logs to drag the answer kicking and screaming into the light.

(sorry, it's annual review season. That always gets me blithe.)


On 19 May 2014 13:03, Scott Hale <[hidden email]> wrote:
Thanks all for the comments on my paper, and even more thanks to everyone sharing these super helpful ideas on filtering bots: this is why I love the Wikipedia research committee.

I think Oliver is definitely right that 
 this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas.
"Citation goldmine" as one friend called it, I think.

This won't address edit logs to date, but do  we know if most bots and automated tools use the API to make edits? If so, would it be feasibility to add a flag to each edit as to whether it came through the API or not. This won't stop determined users, but might be a nice way to identify cyborg edits from those made manually by the same user for many of the standard tools going forward. 

The closest thing I found in the bug tracker is [1], but it doesn't address the issue of 'what is a bot' which this thread has clearly shown is quite complex. An API-edit vs. non-API edit might be a way forward unless there are automated tools/bots that don't use the API.




Cheers,
Scott

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Oliver Keyes-4

Okay. Methodology:

*take the last 5 days of requestlogs;
*Filter them down to text/html requests as a heuristic for non-API requests;
*Run them through the UA parser we use;
*Exclude spiders and things which reported valid browsers;
*Aggregate the user agents left;
*???
*Profit

It looks like there are a relatively small number of bots that browse/interact via the web - ones I can identify include WPCleaner[0], which is semi-automated, something I can't find through WP or google called "DigitalsmithsBot" (could be internal, could be external), and Hoo Bot (run by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general framework that could be masking multiple underlying bots and has ~ 7.4m requests through the web interface in that time period.

Obvious caveat is obvious; the edits from these tools may actually come through the API, but they're choosing to request content through the web interface for some weird reason. I don't know enough about the software behind each bot to comment on that. I can try explicitly looking for web-based edit attempts, but there would be far fewer observations that the bots might appear in, because the underlying dataset is sampled at a 1:1000 rate.

[0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation


On 20 May 2014 07:50, Oliver Keyes <[hidden email]> wrote:
Actually, belay that, I have a pretty good idea. I'll fire the log parser up now.


On 20 May 2014 01:21, Oliver Keyes <[hidden email]> wrote:
I think a lot of them use the API, but I don't know off the top of my head if it's all of them. If only we knew somebody who has spent the last 3 months staring into the cthulian nightmare of our request logs and could look this up...

More seriously; drop me a note off-list so that I can try to work out precisely what you need me to find out, and I'll write a quick-and-dirty parser of our sampled logs to drag the answer kicking and screaming into the light.

(sorry, it's annual review season. That always gets me blithe.)


On 19 May 2014 13:03, Scott Hale <[hidden email]> wrote:
Thanks all for the comments on my paper, and even more thanks to everyone sharing these super helpful ideas on filtering bots: this is why I love the Wikipedia research committee.

I think Oliver is definitely right that 
 this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas.
"Citation goldmine" as one friend called it, I think.

This won't address edit logs to date, but do  we know if most bots and automated tools use the API to make edits? If so, would it be feasibility to add a flag to each edit as to whether it came through the API or not. This won't stop determined users, but might be a nice way to identify cyborg edits from those made manually by the same user for many of the standard tools going forward. 

The closest thing I found in the bug tracker is [1], but it doesn't address the issue of 'what is a bot' which this thread has clearly shown is quite complex. An API-edit vs. non-API edit might be a way forward unless there are automated tools/bots that don't use the API.




Cheers,
Scott

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation



--
Oliver Keyes
Research Analyst
Wikimedia Foundation



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
12