Building a Spam Filter

Can you build your own Spam filter?

Can you build your own Spam filter? | Blog

Can you build your own Spam filter?

What I’m going to discuss here is the benefit of using Machine Learning to improve or automate email handling. The fairly big elephant in the room is just how much effort are you willing to put in?

For those of us with Gmail from Google or similar, the service is already pretty good, but there is still room for improvement, however, in order to deliver this, you’d probably need a gsuite account (Business) to expand on the security already offered so that you can then access the admin console and add your mail-server or model to the mail-flow.

Email Security For the purposes of this article, we’re going to consider spam filtering as a primary target.

mail-flow

Again, for those of you with a managed service (free or otherwise) this is probably already pretty good, but it is still worth poking about to see what you could add to this, or how you might do this using your own mail-service.

Read more about Email Security here: The Holy Trinity of Email Security

Using R

I’m working on this using the R language. For those of you that are not familiar with this, R is a language that was written for and by statisticians.

I’m not going to give you full code on how to do this because I don’t want to find myself trying to support other people with my code as I think I’d very quickly fall outside of my knowledge zone…

I’m going to give extracts and point to useful stuff, but mostly, this is an expanded thought experiment.

Cleaning the Data

We also want to think about cleaning data, or, data scrubbing.

Classification

If we consider Spam then we might expect a binary choice (yes or no, 1 or 0). We can download some raw data from the legacy section of the SpamAssassin website here: Apache SpamAssassin

:

  - spam: 500 spam messages, all received from non-spam-trap sources.
  - easy_ham: 2500 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).
  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.
  - easy_ham_2: 1400 non-spam messages.  A more recent addition to the set.
  - spam_2: 1397 spam messages.  Again, more recent.
Total count: 6047 messages, with about a 31% spam ratio.

SpamAssassin Readme

However, using binary decision alone, we’d struggle to find good patterns. So we’re going to need to think about Bayesian classification.

Extracting Data

Don’t forget to split up your data into training and production data sets.

Initially we need to extract useful information from the data downloaded, to do that, we need to extract the message content rather than the message headings. We can handle headings in a different way.

Using R we might want to first load the file sources, then identify where the text begins, extracting that text and then applying some logic to it.

<snip...>

data-spam.path <- "/Users/onemoredavid/downloads/spam"
data-spam2.path <- "/Users/onemoredavid/downloads/spam2"
data-spam3.path <- "/Users/onemoredavid/downloads/spam3"

get.spam-message <- function(path) {
file-in <- file(path, open="rt", encoding="latin1")
read-file <- readLines(file-in)
spam-message <- read-file[seq(which(txt=="")[1], length(read-file),1)]
close(file-in)
return(paste(spam-message, collapse="\n"))
}

<snip...>

Now we’ve opened the files, we then need to think about things like building a matrix of bad words etc. Using a TDM (Term Document Matrix) we can take a vector of email messages and return a TDM. One of the things we need to think about here is that if the model were to identify common words as being a sign of spam (or not spam), we’d probably find a pattern where non should be found…

The tm package in R to use:

stopwords=TRUE
removePunctuation
removeNumbers

Which will remove common English ‘stop words’, Punctuation & Numbers from the input.

We can also consider setting minimum word counts, so that the TDM is only populated if that word is seen more than X times.

If you are working through this, you’re going to notice that when you first run the model, you get loads of junk results, such as the word ‘html‘ will appear more than any other word. I’m keen to hear from anyone that reads this as to how they plan to (or did) deal with this. To give you a couple of considerations; we could of used some other basic commands before we started looking at the data, i.e. clean the input files before we took them as an input. For a simple training or studying exercise, that would work fine, but if you needed to do this on the fly, with live data, you probably don’t want to modify the original mail nor do you want to add the clean-up steps to your workflow if you plan on handling large volumes of mail.

I think the winning path is to look for correlation among messages…

The training set

We need to review both known good and known bad emails. Luckily, the data linked to above includes numerous examples of both, and it’s already been sorted in most cases for you. The above stages can be repeated on both the spam and the non-spam data types.

Typically, we want to give the training set some the ‘definitely spam‘ and the ‘definitely not spam‘ rather than the ‘maybe spam‘ data as during training, we want the model to find patterns based on known states initially.

We should aim for a training set that consists of the same number of both good and bad data (Spam – Not Spam) therefore the chances that any new mail can be good or bad is 1-in-2.

If we go back to the challenge of dealing with the word ‘html’ we might find that both good and bad emails have this word. Therefore, we might want to consider this as a good way of saying that the word ‘html’ is neither a good, nor a bad classifier for spam.

The rest of the data

After running the test data through your model, you’ll notice that the model starts to find some interesting patterns. We can continue to tune this, but the next step will be to try it on new data (the rest of the emails).

One of the risks with all training data is that the sample data (or training set) is finite and therefore the number of potential patterns is closed in on really quickly. The more data you throw at the model during training, the better the model will get.

It is common to find is that your model will give a really great score for correctly predicting the spam-iness of a message based on training data, you might tune it, get a better score still, then you’ll throw real data at it and the model will fall apart. This is common, there are some interesting ways we might think about tackling that, but I’m not going into that here.

Using Natural Language Processing (NLP)

So, what we’re seeing here is that when we know the basic building blocks for spam, we can detect it with some accuracy and with some ease. I’m doing it on a little old MacBook using internal processing power alone (that said, I’m only using the test data I linked to earlier, I’ve not tried to use it on actual live email.

The problem with this method is that it assumes that the evil spam empire will always use the same patterns. Unfortunately, the evil spam empire is a little cleverer than that. Indeed, spam isn’t even the biggest risk we face when considering email.

Yet, whilst the risk of spam is low, it is still really interesting and clearly, the rewards are still enough that we all receive spam every single day so someone must be falling for it!

The next answer would be to look at Natural Language Processing and I’d love to dive into this, but I am definitely not at a point in my journey to be able to explain how we’ll tackle it and furthermore, my initial testing suggests that my lab (mostly one of a couple of laptops) is not cut-out to do this. I’m open to hearing how others are working with this, please feel free to drop links to blogs or other resources that run through this as a far more powerful way of identifying spam.