SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES

<p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver...

Full description

Saved in:

Bibliographic Details
Main Author:	IMANSYAH - NIM: 23514083, RAKHMAN
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/30177
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:30177
spelling	id-itb.:301772018-03-19T11:32:20ZSMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES IMANSYAH - NIM: 23514083, RAKHMAN Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/30177 <p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver. It is becoming increasingly difficult to distinguish SMS Spam from non spam. <br /> <br /> <br /> Much research on SMS spam classification using content feature have been conducted with good results. Meanwhile, research on classification of spam SMS using non-content features is still infrequent. In this study we identify the use of content and non-content features simultaneously for SMS Spam detection using the most used classifiers: Naive Bayes, Support Vector Machine and K-Nearest Neighbour. <br /> <br /> <br /> There are 5 phases performed in this study: data sorting, labeling, preprocessing, feature extraction, and SMS classification. <br /> <br /> <br /> Data sorting is conducted to separate usable data from other. Data labeling is performed to determine whether SMS on dataset includes Spam or non spam type, refers to the term often used on spam SMS. Preprocess is done on content and non-content features. Preprocess on content includes: case-folding, normalization, slang terms replacement, and stopwords removal. Preprocess on non-content features include define scale for each feature (Time, Length, and Region of sender). <br /> <br /> <br /> The content features extracted from text using Bag-of-Words, then selected by using Word Frequency with a minimum number of occurrences of 50. The non-content features selected from attributes that has a distinct value. <br /> <br /> <br /> The experimental scenarios are baseline search using three predefined algorithms using content features, classification using non-content features, and classification using both content and non-content features simultaneously. Experiments to handle imbalance problem on content and non-content features, changing feature scaling and incomplete value on regional features also included in the scenarios. <br /> <br /> The experimental results show that each non-content feature exerts a different effect on classification results if used separately, a combination of two non-content features, or a combination of three non-content features simultaneously with content features. In the Naive Bayes algorithm, the combination of Time and Category features show better results from the baseline when used in conjunction with content features. In the K-NN algorithm, the best classification performance is resulted from the classification using the content features combined with the feature categories. In the SVM algorithm, Time feature affects classification performance when used in conjunction with content features. Among three classification algorithms used, SVM is the most improved algorithm of performance. The classification performance parameters used are: accuracy, precision, F1-score, Spam and Non Spam precision.<p align="justify"> text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	<p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver. It is becoming increasingly difficult to distinguish SMS Spam from non spam. <br /> <br /> <br /> Much research on SMS spam classification using content feature have been conducted with good results. Meanwhile, research on classification of spam SMS using non-content features is still infrequent. In this study we identify the use of content and non-content features simultaneously for SMS Spam detection using the most used classifiers: Naive Bayes, Support Vector Machine and K-Nearest Neighbour. <br /> <br /> <br /> There are 5 phases performed in this study: data sorting, labeling, preprocessing, feature extraction, and SMS classification. <br /> <br /> <br /> Data sorting is conducted to separate usable data from other. Data labeling is performed to determine whether SMS on dataset includes Spam or non spam type, refers to the term often used on spam SMS. Preprocess is done on content and non-content features. Preprocess on content includes: case-folding, normalization, slang terms replacement, and stopwords removal. Preprocess on non-content features include define scale for each feature (Time, Length, and Region of sender). <br /> <br /> <br /> The content features extracted from text using Bag-of-Words, then selected by using Word Frequency with a minimum number of occurrences of 50. The non-content features selected from attributes that has a distinct value. <br /> <br /> <br /> The experimental scenarios are baseline search using three predefined algorithms using content features, classification using non-content features, and classification using both content and non-content features simultaneously. Experiments to handle imbalance problem on content and non-content features, changing feature scaling and incomplete value on regional features also included in the scenarios. <br /> <br /> The experimental results show that each non-content feature exerts a different effect on classification results if used separately, a combination of two non-content features, or a combination of three non-content features simultaneously with content features. In the Naive Bayes algorithm, the combination of Time and Category features show better results from the baseline when used in conjunction with content features. In the K-NN algorithm, the best classification performance is resulted from the classification using the content features combined with the feature categories. In the SVM algorithm, Time feature affects classification performance when used in conjunction with content features. Among three classification algorithms used, SVM is the most improved algorithm of performance. The classification performance parameters used are: accuracy, precision, F1-score, Spam and Non Spam precision.<p align="justify">
format	Theses
author	IMANSYAH - NIM: 23514083, RAKHMAN
spellingShingle	IMANSYAH - NIM: 23514083, RAKHMAN SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
author_facet	IMANSYAH - NIM: 23514083, RAKHMAN
author_sort	IMANSYAH - NIM: 23514083, RAKHMAN
title	SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_short	SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_full	SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_fullStr	SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_full_unstemmed	SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_sort	sms spam detection using content and non content features
url	https://digilib.itb.ac.id/gdl/view/30177
_version_	1822267351561666560

SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES

Similar Items