Spam emails, also known as junk mail, are sent to a large number of users at once and often contain scams, phishing content, or cryptic messages. Spam emails are sometimes sent manually by a human, but most often they are sent using a bot. Examples of spam emails include fake ads, chain emails, and impersonation attempts. There is a risk that a particularly well-disguised spam email may land in your inbox, which can be dangerous if clicked on. It’s important to take extra precautions to protect your device and sensitive information.
As technology is improving, the detection of spam emails becomes a challenging task due to its changing nature. Spam is quite different from other types of security threats. It may at first appear like an annoying message and not a threat, but it has an immediate effect. Also spammers often adapt new techniques. Organizations who provide email services want to minimize spam as much as possible to avoid any damage to their end customers.
In this post, we show how straightforward it is to build an email spam detector using Amazon SageMaker. The built-in BlazingText algorithm offers optimized implementations of Word2vec and text classification algorithms. Word2vec is useful for various natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, and machine translation. Text classification is essential for applications like web searches, information retrieval, ranking, and document classification.
This post demonstrates how you can set up email spam detector and filter spam emails using SageMaker. Let’s see how a spam detector typically works, as shown in the following diagram.
Emails are sent through a spam detector. An email is sent to the spam folder if the spam detector detects it as spam. Otherwise, it’s sent to the customer’s inbox.
We walk you through the following steps to set up our spam detector model:
Before diving into this use case, complete the following prerequisites:
Download the email_dataset.csv from GitHub and upload the file to the S3 bucket.
The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.
To perform the data load, complete the following steps:
We can see our dataset is balanced.
The BlazingText algorithm expects the data in the following format:
__label__
Here’s an example:
__label__0 “This is HAM” __label__1 “This is SPAM”
Check Training and Validation Data Format for the BlazingText Algorithm.
You now run the data preparation step in the notebook.
To train the model, complete the following steps in the notebook:
BlazingText has both unsupervised and supervised learning modes. Our use case is text classification, which is supervised learning.
In this step, we deploy the trained model as an endpoint. Choose your preferred instance
Let’s provide an example of three email messages that we want to get predictions for:
Tokenize the email message and specify the payload to use when calling the REST API.
Now we can predict the email classification for each email. Call the predict method of the text classifier, passing the tokenized sentence instances (payload) into the data argument.
Finally , you can delete the endpoint to avoid any unexpected cost.
Also, delete the data file from S3 bucket.
In this post, we walked you through the steps to create an email spam detector using the SageMaker BlazingText algorithm. With the BlazingText algorithm, you can scale to large datasets. BlazingText is used for textual analysis and text classification problems, and has both unsupervised and supervised learning modes. You can use the algorithm for use cases like customer sentiment analysis and text classification.
To learn more about the BlazingText algorithm, check out BlazingText algorithm.
Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.