On the effects of de-obfuscation on spam detection accuracy

Spam contributes to approximately two-thirds of the e-mail traffic over the Internet [4] and is fast becoming a major problem for IT users and network administrators. Spam costs billions in lost productivity [13] and results in more problems than mere annoyance of delayed and lost non-spam emai...

Full description

Saved in:
Bibliographic Details
Main Authors: M. E. Rafiq, A. Newaz, Marsono, Muhammad Nadzir, Gebali, Fayez
Format: Book Section
Published: Penerbit UTM 2007
Subjects:
Online Access:http://eprints.utm.my/id/eprint/13680/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Malaysia
Description
Summary:Spam contributes to approximately two-thirds of the e-mail traffic over the Internet [4] and is fast becoming a major problem for IT users and network administrators. Spam costs billions in lost productivity [13] and results in more problems than mere annoyance of delayed and lost non-spam emails. Naive Bayes classification has widely been used for spam detection and several variations have been proposed [19], [1], [5]. In e-mail content classification (as other supervised-learning techniques), the accuracy (of spam detection) depends on the frequency of spam features observed during training. Spam continuously evolves to circumvent systems and is becoming much more sophisticated [6]. Spammers obfuscate wellknown spam features in different ways to circumvent spam detection [12]. Obfuscating spam features (even by substituting a character with a visually similar one) reduces the frequency and size of features observed during learning. Hence, if obfuscated spam features can be de-obfuscated first before the detection, then the accuracy of spam detection would increase. This statement is proved in this chapter by experimenting with real spam e-mails.