cancel
Showing results for 
Search instead for 
Did you mean: 

Bayesian filtering

aldennis
Newbie
Posts: 4
Registered: 01-08-2007

Bayesian filtering

Surely, Bayesian filtering is not an appropriate tool for an ISP to be using for spam detection...:?:

I've had the new spam protection on for a couple of days now and here's my results so far:

Detected spam 510 Missed spam 35 (93.6%)
Genuine mail 73 False positives 4 (94.8%)

This is hardly spectacular and it's early days so I'll give it time to settle down, but, the thing that's bothering me at the moment (and the reason I took the time to check) is the number of false positives.

I've read the Wiki link (http://en.wikipedia.org/wiki/Bayesian_filtering) helpfully posted in the spam protection guide, and I can see how Bayesian filtering can work for an individual training a spam filter with missed spam and false positives, but I can't see how this works for an ISP with a large number of users training it. Ok, I can accept that at the extremes it's easy to tell spam from genuine mail, but surely the grey area in the middle can't work for multiple users? Surely one users "useful" commercial mailing is another users spam? How can one filter for many users tell the difference? :?
3 REPLIES
Community Veteran
Posts: 4,729
Registered: 04-04-2007

Bayesian filtering

Also Bayesian mail filters can be polluted by mailing lists / groups and other e-mail subscription services.
Where a user signs up to a mailing list, gets board, and instead of un-subscribing, simply marks all the mail from the list as spam, and sends it off to the Bayesian filter.

The result is that other users find that the mailing list are rejected as spam.

Chilly
MrToast
Grafter
Posts: 550
Registered: 31-07-2007

Bayesian filtering

I agree that I would expect that Bayesian filtering would be of limited use when the same 'corpus' is applied to the email of many users.


I think a better description of the problem than the Wikipedia entry is Paul Grahams essay A Plan For Spam from 2002.

If you want to set up your own filtering I'd recommend taking a look at PoPFile which I've used to great effect achieving over 99% accuracy. Its no way as good as a 'clean' email box, but if you are getting more that 100 SPAM a week its a great help in sorting through the mess.
Mand
Grafter
Posts: 5,560
Thanks: 1
Registered: 05-04-2007

Bayesian filtering

Hi there,

The possible 'pollution' is the reason that we currently train the spam filter manually (picking 100 or so mails from both the 'spam' and 'notspam' mailboxes each day and using these to train the filter.

By the same token, as this process is manual it can take a few days for spam to start being tagged as such after you've forwarded it to us.

Thunderbird also has a pretty good filter that I use in conjunction with our spam filtering, and get very few spam emails in my inbox now.