Characterization of Spamming Strategies

Together with the development and popularization of the Internet, spam has become one of the biggest sources of unwanted traffic. Due to its highly negative impact, researchers from different Computer Science areas have worked on how to detect and mitigate spam and its more dangerous facet, phishing. In fact, the spam arms race phenomenon, in which both spammers and anti-spammers evolve, trying to beat each other efforts, is a notable characteristic of the spam problem. Because of that, continuous monitoring and measurement of spam traffic characteristics is necessary. From an operational viewpoint, ISPs and organizations who fight spam must face the problems of identifying spam, determining their origin and how the network infrastructure is being abused by spammers in order to remain anonymous. Because traffic logs collected are usually huge, we employ data mining techniques to unveil useful information from the traffic collected. Also, the spam arms race turns spam filtering and characterization problem into a challenging and interesting problem both from the data mining field and for Adversarial Information Retrieval. Our main motivation in this project is to employ data mining techniques to deal with these challenges and to design new DM algorithms that deal with the evolutionary aspect of spamming.

Efficient tools are required to help solve those problems and to support the establishment of strategies to detect, mitigate and accommodate spam traffic. In other words, ISPs and other spam fighters need to be aware of the most current spamming strategies, whichcorrespond to the various techniques employed by spammers to maximize the effectiveness of their attacks, reducing the probability of the message being blocked by spam filters and preventing their activities from being identified and tracked. In this work, we present Spam Miner, an online, scalable system capable of measuring, monitoring and characterizing spam traffic over the Internet. It processes spam traffic, extracts relevant characteristics and isolates the traffic associated with different abuses. After that, data mining techniques are employed to unveil spamming strategies in real-time.

Clustering Spam Messages into Spam Campaigns

To deeply understand how spammers abuse network resources and obfuscate their messages, an aggregated analysis of spam messages is not enough. Because of that, Spam Miner groups spam messages into spam campaigns, despite the obfuscations spammers apply to each message to evade spam filters. Determination of spam campaigns is important to understand how spammers act, for many reasons:

  • It isolates the traffic associated to each abuse (a spam campaign). Working on the spam message level leads to analysis of a fragmented behavior, while analyzing aggregated numbers lead to average behaviors that may not be representative

  • It creates new dimensions that can be analyzed and correlated (such as abuse duration and volume)

  • Campaigns provide a summarization criterion that reduces the amount of the amount of data to be analyzed (campaigns << messages!)

Mining Evolving Campaign Generation Patterns

We treat spam campaign identification as a data mining clustering problem. Spam Miner implements a spam identification technique based on a Frequent-Pattern Tree (see figures and the animation below), which naturally captures the invariants on message content and detect campaigns that differ only due to obfuscated fragments. After that, we characterize these campaigns both in terms of content obfuscation and exploitation of network resources. Our ultimate goal is to detect, characterize and identify spamming trends and evolutionary patterns that will support the development of anti-spam techniques.

Animation showing the detection of a spam campaign
Zoom in

After spam campaigns are identified, we apply association rule mining algorithms to determine co-occurrence of campaign attributes that unveil diferent spamming strategies. In particular, we found strong relations between the origin of the spam and how it abused the network, and also between operating systems and types of abuse. Our system may be used as an early warning system for spam, notifying administrators as soon as a spam campaign is identified, allowing them to update their defenses earlier, be it by identifying the addresses of new spam generators or by identifying a new pattern used to cloak spam from spam filters, among other possibilities. This site presents a functional prototype of the system. You can understand our approach for online, incremental spam campaign detection, check general statistics about the campaigns being identified , our publications and the project members.

Ongoing Work

Currently, we are working on:

  • Mining spammers' evolution patterns

  • Comparing our data mining-based campaign identification technique with others available on the literature

  • Mining fundamental characteristics that differ spam campaigns from phishing campaigns

  • Insert legitimate messages on our FP-Tree data structure to check if it correctly differentiates spam and non-spams, thus openining space for the development of anti-spam technologies