Smart machine-learning based anti-spam system (I wish!)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Smart machine-learning based anti-spam system (I wish!)

Daniel Friesen-4
I've had a good idea for an anti-spam system for awhile.
Blocks, Captchas, and local filters, all the tricks we've been using end  
up not working well enough to easily deal with the spam on a lot of wikis.

I know this because I've been continually dealing with the spam on a small  
dead wiki. Simple AntiSpam, AntiBot, Captchas, TorBlock, Abuse Filter...
Time after time I expand my filters more and more. But inevitably a few  
days later spam not covered by my filters comes through and I have to do  
it again.

I ended up having to deal with it more today and then started writing out  
the details I've had for awhile on a machine-learning based anti-spam  
system.

https://www.mediawiki.org/wiki/User:Dantman/Anti-spam_system

Of course. While I have the whole idea for the ui, backend stuff, how to  
handle the service, etc... I haven't done the actual machine-learning  
stuff before.
Also naturally just like Gareth, OAuth, and other things this is just  
another one of my ideas I don't have the time and resources to do and wish  
I had the financial backing to work on.

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Smart machine-learning based anti-spam system (I wish!)

Chris Steipp
Hi Daniel,

A lot of your ideas are covered by
http://en.wikipedia.org/wiki/Wikipedia:STiki. Andrew has done a lot of
great research, if you haven't read his papers yet that might be a
good intro to the type of machine learning approaches that have been
used.

That being said, I would love to have some system that is constantly
learning from the edits that are flagged as spam, that we can query
with new edits from AbuseFilter to get a score of how likely it is
that this new edit is spam. If you get around to working on your
system, it would be great to work out some way to interface.


On Thu, Aug 16, 2012 at 11:16 AM, Daniel Friesen
<[hidden email]> wrote:

> I've had a good idea for an anti-spam system for awhile.
> Blocks, Captchas, and local filters, all the tricks we've been using end up
> not working well enough to easily deal with the spam on a lot of wikis.
>
> I know this because I've been continually dealing with the spam on a small
> dead wiki. Simple AntiSpam, AntiBot, Captchas, TorBlock, Abuse Filter...
> Time after time I expand my filters more and more. But inevitably a few days
> later spam not covered by my filters comes through and I have to do it
> again.
>
> I ended up having to deal with it more today and then started writing out
> the details I've had for awhile on a machine-learning based anti-spam
> system.
>
> https://www.mediawiki.org/wiki/User:Dantman/Anti-spam_system
>
> Of course. While I have the whole idea for the ui, backend stuff, how to
> handle the service, etc... I haven't done the actual machine-learning stuff
> before.
> Also naturally just like Gareth, OAuth, and other things this is just
> another one of my ideas I don't have the time and resources to do and wish I
> had the financial backing to work on.
>
> --
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Smart machine-learning based anti-spam system (I wish!)

Tim Starling-2
In reply to this post by Daniel Friesen-4
On 17/08/12 04:16, Daniel Friesen wrote:
> Of course. While I have the whole idea for the ui, backend stuff, how
> to handle the service, etc... I haven't done the actual
> machine-learning stuff before.

I would think that the actual machine learning stuff would be the hard
part. I stopped using Thunderbird's Bayesian spam tagging feature
years ago, when it started sorting emails from smart people in with
the spam. The computer thought that the smart people were using long
words with a similar frequency to the random dictionary words that
padded out the spam messages.

I haven't worked with machine learning either, but I'm guessing it's
not as simple as feeding a pre-tagged data set into a stock Bayesian
filter library.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Smart machine-learning based anti-spam system (I wish!)

Daniel Friesen-4
In reply to this post by Chris Steipp
Yeah STiki and more importantly ClueBot NG are what I mean when I say  
"outside of Wikimedia (who already have bots for this)".

I looked into them a bit and planned to ask to look at some of the code if  
I went along with the project.
--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

On Thu, 16 Aug 2012 14:59:56 -0700, Chris Steipp <[hidden email]>  
wrote:

> Hi Daniel,
>
> A lot of your ideas are covered by
> http://en.wikipedia.org/wiki/Wikipedia:STiki. Andrew has done a lot of
> great research, if you haven't read his papers yet that might be a
> good intro to the type of machine learning approaches that have been
> used.
>
> That being said, I would love to have some system that is constantly
> learning from the edits that are flagged as spam, that we can query
> with new edits from AbuseFilter to get a score of how likely it is
> that this new edit is spam. If you get around to working on your
> system, it would be great to work out some way to interface.
>
>
> On Thu, Aug 16, 2012 at 11:16 AM, Daniel Friesen
> <[hidden email]> wrote:
>> I've had a good idea for an anti-spam system for awhile.
>> Blocks, Captchas, and local filters, all the tricks we've been using  
>> end up
>> not working well enough to easily deal with the spam on a lot of wikis.
>>
>> I know this because I've been continually dealing with the spam on a  
>> small
>> dead wiki. Simple AntiSpam, AntiBot, Captchas, TorBlock, Abuse Filter...
>> Time after time I expand my filters more and more. But inevitably a few  
>> days
>> later spam not covered by my filters comes through and I have to do it
>> again.
>>
>> I ended up having to deal with it more today and then started writing  
>> out
>> the details I've had for awhile on a machine-learning based anti-spam
>> system.
>>
>> https://www.mediawiki.org/wiki/User:Dantman/Anti-spam_system
>>
>> Of course. While I have the whole idea for the ui, backend stuff, how to
>> handle the service, etc... I haven't done the actual machine-learning  
>> stuff
>> before.
>> Also naturally just like Gareth, OAuth, and other things this is just
>> another one of my ideas I don't have the time and resources to do and  
>> wish I
>> had the financial backing to work on.
>>
>> --
>> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Smart machine-learning based anti-spam system (I wish!)

Daniel Friesen-4
In reply to this post by Tim Starling-2
On Thu, 16 Aug 2012 16:50:27 -0700, Tim Starling <[hidden email]>  
wrote:

> On 17/08/12 04:16, Daniel Friesen wrote:
>> Of course. While I have the whole idea for the ui, backend stuff, how
>> to handle the service, etc... I haven't done the actual
>> machine-learning stuff before.
>
> I would think that the actual machine learning stuff would be the hard
> part. I stopped using Thunderbird's Bayesian spam tagging feature
> years ago, when it started sorting emails from smart people in with
> the spam. The computer thought that the smart people were using long
> words with a similar frequency to the random dictionary words that
> padded out the spam messages.
>
> I haven't worked with machine learning either, but I'm guessing it's
> not as simple as feeding a pre-tagged data set into a stock Bayesian
> filter library.
>
> -- Tim Starling

Yeah, Bayesian is probably too old to use. ClueBot NG appears to be using  
an
Abstract Neural Network [ANN] implementation to do it's spam testing.
 From the documentation [ClueBot NG] it sounds like one of the trickier  
parts
is understanding the WikiText enough to extract the words needed and whanot
out of it.

[ANN] https://en.wikipedia.org/wiki/Artificial_neural_network
[ClueBot NG] https://en.wikipedia.org/wiki/User:ClueBot_NG

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Smart machine-learning based anti-spam system (I wish!)

Platonides

Note that before training any intelligent system, be that Bayesian,
Neural Networks,  or other, you need a good corpus of good and bad
editions, to train with...


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l