Brian J Mingus

Potthast, Stein, Gerling. (2008). Automatic Vandalism Detection in Wikipedia.

Abstract. We present results of a new approach to detect destructive article revi-
sions, so-called vandalism, in Wikipedia. Vandalism detection is a one-class clas-
sification problem, where vandalism edits are the target to be identified among
all revisions. Interestingly, vandalism detection has not been addressed in the In-
formation Retrieval literature by now. In this paper we discuss the characteristics
of vandalism as humans recognize it and develop features to render vandalism
detection as a machine learning task. We compiled a large number of vandalism
edits in a corpus, which allows for the comparison of existing and new detection
approaches. Using logistic regression we achieve 83% precision at 77% recall
with our model. Compared to the rule-based methods that are currently applied
in Wikipedia, our approach increases the F -Measure performance by 49% while
being faster at the same time.

Open the PDF, scan to page 667. This bot outperforms MartinBot, T-850 Robotic Assistant, WerdnaAntiVandalBot, Xenophon, ClueBot, CounterVandalismBot, PkgBot, MiszaBot, and AntiVandalBot. It outperforms the best of those (AntiVandalBot) by a very wide margin.

So why are you wasting the ISPs time and the police's time when the best of the passive technology routes have not been explored? Using machine learning you pit the vandals against themselves.  Every time they perform a particular kind of vandalism, it can never be performed again because the bot will recognize it.


On Mon, Dec 29, 2008 at 4:15 PM, Brian
By the way, I ask those questions having read the bots user page. It is apparently quite effective,  indicating to me that this user causes minimal disruption.

On Mon, Dec 29, 2008 at 4:11 PM, Brian
What percentage of his page moves were not picked up automatically by a bot?

What percentage of this users vandalism is not picked up by a bot?

Why is the ISP responsible for what he dumps into Wikipedia, rather than Wikipedia, as it allows itself to be a dumping ground? The Viacom/Youtube lawsuit demonstrates that this is a legal grey area, thus, I see little ground on which to punish the entire ip range of the ISP.

Why are machine learning bots that are trained on previous vandalism in order to detect new vandalism not being used? They have been developed. Why is the Foundation not funding their further development?

I believe the direction of this thread has been all wrong.


On Mon, Dec 29, 2008 at 4:07 PM, Soxred93
The problem with that is that many articles we have would not be
found in any dictionary.


On Dec 29, 2008, at 6:02 PM [Dec 29, 2008 ], Ian Woollard:

> On 29/12/2008, Joe Szilagyi <[hidden email]> wrote:
>> Allow blocking on a more granular level, if we know his ISP, and lock
>> out moves and redirects for the whole damn ISPs, and specifically
>> point the finger back in the block message: Blocked because of
>> JarlaxleArtemis/Grawp with a nice shiny link to his long-term abuse
>> page.
> It probably wouldn't work because of proxies and people that would
> emulate/help him.
> Still, ideas that would affect less people rather than more like that
> are almost certainly IMO the way to go; for example restricting the
> range of characters and checking that the move title consists of words
> in a dictionary before permitting non admins or users with a small
> number of edits to complete a move might be desirable.
>> - Joe
> --
> -Ian Woollard
> We live in an imperfectly imperfect world. Life in a perfectly
> imperfect world would be much better.
