ML classifiers for Email spam Filter


In today’s fast paced society, everyone is busy juggling around on their day to day activities. We want to finish our work fast and with undeterred focus. 


Our inbox gets flooded with unnecessary emails and eat up our online real estate. We get these unwanted commercial bulk emails called spam or mails from some of the seemly legit website calling phishing emails.

I have been wondering how these bugging mails can be stopped. I went further to research what are the different ways in which machine learning can help us detect these unnecessary mails and keep them out of our inbox.

I am going to list out some of the feature extraction and classifications what we will be using to detect/classify important/useful emails from spam/junk emails.


Classification in Machine Learning

Classification is a process of categorizing a given set of data into categories. Think of a grain shorting machine wherein we separate some bad/disfigured gains to ripe ones.


Decision Tree Classifier

We are going to explore Decision Tree Classifier.

Some aspects of the Decision Tree Classifier mentioned below are.

  • Decision Trees (DT) can be used both for classification and regression.

  • The advantage of decision trees is that they require very little data preparation.

  • They do not require feature scaling or centering at all.

  • They are also the fundamental components of Random Forests, one of the most powerful ML algorithms.

  • Unlike Random Forests and Neural Networks (which do black-box modeling), Decision Trees are white box models, which means that inner workings of these models are clearly understood.

  • In the case of classification, the data is segregated based on a series of questions.

  • Any new data point is assigned to the selected leaf node.

To classify new email arrived as spam , we can train machine with below features.

Using message body & subject

  • Counting frequency of non-standard punctuation

  •  the ratio of uppercase to lowercase letters in the text

  •  Recipient age, sex and country

  • Recipient replied: This boolean value indicates whether the recipient replied to the message.

Using Sender Account Features

  • Sender Country: By recording IP’s user every time user logs will help in finding out the distribution of countries as stated by users on their profile .

  • Sender IPs: Spammers in general logged in from a smaller number of unique IP addresses than ham users.

  • Sender & Recipient Age: The age on spammer profile higher, reflects targeting of older users who users who are more likely to be financially stable.

  • Sender Birthday: spam users were may have birthdays early in the month, and birth months early in the year.

Based on above classifiers, the linear regression plot using SVM.



This is a sneak peek of my understanding of how to use classification in ML to solve common day to day activities. Feel free to let me know you feedback/comments.

30 views0 comments