Kill the bots

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Scott Hale
Thank you, Oliver,

This is really interesting and gives some credibility to the idea that the ability to track API/non-API edits could address the bot problem in part, but definitely could miss some bots. Thank you very much for your time to check this and share the results. Anyone think it would be worth requesting feature for logging  API/non-API edits in the issue tracker?

I'm also very interested in continuing to hear suggestions on how everyone is identifying bots in existing data (or if they think this is unnecessary). Apologies if I hijacked the thread slightly by asking about API/non-API bots.

Best wishes,
Scott



On Wed, May 21, 2014 at 11:06 PM, Oliver Keyes <[hidden email]> wrote:

Okay. Methodology:

*take the last 5 days of requestlogs;
*Filter them down to text/html requests as a heuristic for non-API requests;
*Run them through the UA parser we use;
*Exclude spiders and things which reported valid browsers;
*Aggregate the user agents left;
*???
*Profit

It looks like there are a relatively small number of bots that browse/interact via the web - ones I can identify include WPCleaner[0], which is semi-automated, something I can't find through WP or google called "DigitalsmithsBot" (could be internal, could be external), and Hoo Bot (run by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general framework that could be masking multiple underlying bots and has ~ 7.4m requests through the web interface in that time period.

Obvious caveat is obvious; the edits from these tools may actually come through the API, but they're choosing to request content through the web interface for some weird reason. I don't know enough about the software behind each bot to comment on that. I can try explicitly looking for web-based edit attempts, but there would be far fewer observations that the bots might appear in, because the underlying dataset is sampled at a 1:1000 rate.

[0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation


On 20 May 2014 07:50, Oliver Keyes <[hidden email]> wrote:
Actually, belay that, I have a pretty good idea. I'll fire the log parser up now.


On 20 May 2014 01:21, Oliver Keyes <[hidden email]> wrote:
I think a lot of them use the API, but I don't know off the top of my head if it's all of them. If only we knew somebody who has spent the last 3 months staring into the cthulian nightmare of our request logs and could look this up...

More seriously; drop me a note off-list so that I can try to work out precisely what you need me to find out, and I'll write a quick-and-dirty parser of our sampled logs to drag the answer kicking and screaming into the light.

(sorry, it's annual review season. That always gets me blithe.)


On 19 May 2014 13:03, Scott Hale <[hidden email]> wrote:
Thanks all for the comments on my paper, and even more thanks to everyone sharing these super helpful ideas on filtering bots: this is why I love the Wikipedia research committee.

I think Oliver is definitely right that 
 this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas.
"Citation goldmine" as one friend called it, I think.

This won't address edit logs to date, but do  we know if most bots and automated tools use the API to make edits? If so, would it be feasibility to add a flag to each edit as to whether it came through the API or not. This won't stop determined users, but might be a nice way to identify cyborg edits from those made manually by the same user for many of the standard tools going forward. 

The closest thing I found in the bug tracker is [1], but it doesn't address the issue of 'what is a bot' which this thread has clearly shown is quite complex. An API-edit vs. non-API edit might be a way forward unless there are automated tools/bots that don't use the API.




Cheers,
Scott

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation



--
Oliver Keyes
Research Analyst
Wikimedia Foundation



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Kill the bots

Stuart A. Yeates
In reply to this post by Oliver Keyes-4
Another useful trick for spotting bots is to look for things coming
from the IP address blocks assigned to AWS (and other cloud
providers). If a request is coming from a server farm, something, is
sitting between wikipedia and the human; whether that thing is classed
as a bot boils down largely to semantics.

cheers
stuart

On Thu, May 22, 2014 at 10:06 AM, Oliver Keyes <[hidden email]> wrote:

>
> Okay. Methodology:
>
> *take the last 5 days of requestlogs;
> *Filter them down to text/html requests as a heuristic for non-API requests;
> *Run them through the UA parser we use;
> *Exclude spiders and things which reported valid browsers;
> *Aggregate the user agents left;
> *???
> *Profit
>
> It looks like there are a relatively small number of bots that
> browse/interact via the web - ones I can identify include WPCleaner[0],
> which is semi-automated, something I can't find through WP or google called
> "DigitalsmithsBot" (could be internal, could be external), and Hoo Bot (run
> by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general
> framework that could be masking multiple underlying bots and has ~ 7.4m
> requests through the web interface in that time period.
>
> Obvious caveat is obvious; the edits from these tools may actually come
> through the API, but they're choosing to request content through the web
> interface for some weird reason. I don't know enough about the software
> behind each bot to comment on that. I can try explicitly looking for
> web-based edit attempts, but there would be far fewer observations that the
> bots might appear in, because the underlying dataset is sampled at a 1:1000
> rate.
>
> [0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation
>
>
> On 20 May 2014 07:50, Oliver Keyes <[hidden email]> wrote:
>>
>> Actually, belay that, I have a pretty good idea. I'll fire the log parser
>> up now.
>>
>>
>> On 20 May 2014 01:21, Oliver Keyes <[hidden email]> wrote:
>>>
>>> I think a lot of them use the API, but I don't know off the top of my
>>> head if it's all of them. If only we knew somebody who has spent the last 3
>>> months staring into the cthulian nightmare of our request logs and could
>>> look this up...
>>>
>>> More seriously; drop me a note off-list so that I can try to work out
>>> precisely what you need me to find out, and I'll write a quick-and-dirty
>>> parser of our sampled logs to drag the answer kicking and screaming into the
>>> light.
>>>
>>> (sorry, it's annual review season. That always gets me blithe.)
>>>
>>>
>>> On 19 May 2014 13:03, Scott Hale <[hidden email]> wrote:
>>>>
>>>> Thanks all for the comments on my paper, and even more thanks to
>>>> everyone sharing these super helpful ideas on filtering bots: this is why I
>>>> love the Wikipedia research committee.
>>>>
>>>> I think Oliver is definitely right that
>>>>>
>>>>>  this would be a useful topic for some piece of method-comparing
>>>>> research, if anyone is looking for paper ideas.
>>>>
>>>> "Citation goldmine" as one friend called it, I think.
>>>>
>>>> This won't address edit logs to date, but do  we know if most bots and
>>>> automated tools use the API to make edits? If so, would it be feasibility to
>>>> add a flag to each edit as to whether it came through the API or not. This
>>>> won't stop determined users, but might be a nice way to identify cyborg
>>>> edits from those made manually by the same user for many of the standard
>>>> tools going forward.
>>>>
>>>> The closest thing I found in the bug tracker is [1], but it doesn't
>>>> address the issue of 'what is a bot' which this thread has clearly shown is
>>>> quite complex. An API-edit vs. non-API edit might be a way forward unless
>>>> there are automated tools/bots that don't use the API.
>>>>
>>>>
>>>> 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181
>>>>
>>>>
>>>> Cheers,
>>>> Scott
>>>>
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> [hidden email]
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
12