Learning to classify e-mail

In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail f...

Full description

Saved in:

Bibliographic Details
Main Authors:	KOPRINSKA, Irena, POON, Josiah, CLARK, James, CHAN, Jason Yuk Hin
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2007
Subjects:	e-mail classification into folders spam e-mail filtering random forest co-training machine learning Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/7703 https://ink.library.smu.edu.sg/context/sis_research/article/8706/viewcontent/Learning_to_classify_e_mail.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8706
record_format	dspace
spelling	sg-smu-ink.sis_research-87062023-01-10T03:06:35Z Learning to classify e-mail KOPRINSKA, Irena POON, Josiah CLARK, James CHAN, Jason Yuk Hin In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naive Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits. (C) 2006 Elsevier Inc. All rights reserved. 2007-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7703 info:doi/10.1016/j.ins.2006.12.005 https://ink.library.smu.edu.sg/context/sis_research/article/8706/viewcontent/Learning_to_classify_e_mail.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University e-mail classification into folders spam e-mail filtering random forest co-training machine learning Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	e-mail classification into folders spam e-mail filtering random forest co-training machine learning Databases and Information Systems
spellingShingle	e-mail classification into folders spam e-mail filtering random forest co-training machine learning Databases and Information Systems KOPRINSKA, Irena POON, Josiah CLARK, James CHAN, Jason Yuk Hin Learning to classify e-mail
description	In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naive Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits. (C) 2006 Elsevier Inc. All rights reserved.
format	text
author	KOPRINSKA, Irena POON, Josiah CLARK, James CHAN, Jason Yuk Hin
author_facet	KOPRINSKA, Irena POON, Josiah CLARK, James CHAN, Jason Yuk Hin
author_sort	KOPRINSKA, Irena
title	Learning to classify e-mail
title_short	Learning to classify e-mail
title_full	Learning to classify e-mail
title_fullStr	Learning to classify e-mail
title_full_unstemmed	Learning to classify e-mail
title_sort	learning to classify e-mail
publisher	Institutional Knowledge at Singapore Management University
publishDate	2007
url	https://ink.library.smu.edu.sg/sis_research/7703 https://ink.library.smu.edu.sg/context/sis_research/article/8706/viewcontent/Learning_to_classify_e_mail.pdf
_version_	1770576417440923648

Learning to classify e-mail

Similar Items