Technical Note : Setting up a FortiMail Baseline Bayesian database - FortiMail 4.0

Adrian_Buckley_FTNT · ‎02-09-2010

Purpose

The purpose of this article is to give a reliable way to initially configure a Bayesian database. This applies to FortiMail 4.0.

Bayesian databases are the last line of defense with spam and one of the few ways to mount the decimal points in spam filtering after 99%. In order to begin using a Bayesian database, the FortiMail requires that the database has been trained with minimum amounts of emails as described in the following table:

Global database	Spam 100 Not Spam 200
Domain/Personal Database	Spam 100 Not Spam 200

In order to get to this level the database needs to be trained and the FortiMail has the mechanisms to do this automatically. If the database does not reach these minimum levels then the system will not use the database for email scanning.

These counters will continue to grow even without training. Bayesian scanning will actually refine the database, a process often known as "learning".

Expectations, Requirements

Setup a trained Baseline Bayesian database.

Configuration

Enable antispam options that will not cause false positives. It is important that the initial data the database is trained with is as accurate as possible. The options that are least likely to cause false detections are:

FortiGuard Antispam
DNSBL
SURBL
Heuristic
Image Scan - but only if it does not cause false positives in email
Deep Header (Black IP scanning) - but only if it does not cause false positives in email
Bayesian Scan. Use other techniques for auto training

Screenshot for FortiMail 4.0

abuckley_FD31773_a_FD31773_Fortimail-Antispam-Profile(TB).jpg

It is important to use a spam action of Quarantine and Archiving while training is in progress. This allows the review of the emails that get put into the SPAM and NOT SPAM portions of the database. The reason for this review is that email exist which are specifically designed to corrupt Bayesian databases.

Some email contains several paragraphs/pages from a news article, novel, or random text at the bottom. Bayesian is designed to work so that this extra text is not ignored but goes into the database which then impairs it's ability to properly determine spam and not spam.

Once the database has been trained, it is important to make regular backups when it is working well. In this way if any issues arise due to the email it has been trained with, the database can be cleared out and restored to a working status without completely restarting the training process.

Automatic training should not be left running indefinitely. After reaching the minimum levels for operation it is important to turn off automatic training and begin targeted training of the database. This can be done in 2 ways:

1. From the GUI using email sames that are incorrectly identified by the Bayesian
2. Via email

Failure to disable this option will result in Bayesian functioning incorrectly and possibly causing mail flow issues.

If email training is being used it is also very important to use the addresses properly. If a sample is sent to an incorrect training email then the database will not be correctly updated. This could negatively impact email detection causing spam to get through and produce false positives.

is-spam – correct a false-negative
is-not-spam – correct a false-positive
learn-is-spam – new, never before seen piece of spam
learn-is-not-spam – new never before seen good email