SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
<p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/30177 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:30177 |
---|---|
spelling |
id-itb.:301772018-03-19T11:32:20ZSMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES IMANSYAH - NIM: 23514083, RAKHMAN Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/30177 <p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver. It is becoming increasingly difficult to distinguish SMS Spam from non spam. <br /> <br /> <br /> Much research on SMS spam classification using content feature have been conducted with good results. Meanwhile, research on classification of spam SMS using non-content features is still infrequent. In this study we identify the use of content and non-content features simultaneously for SMS Spam detection using the most used classifiers: Naive Bayes, Support Vector Machine and K-Nearest Neighbour. <br /> <br /> <br /> There are 5 phases performed in this study: data sorting, labeling, preprocessing, feature extraction, and SMS classification. <br /> <br /> <br /> Data sorting is conducted to separate usable data from other. Data labeling is performed to determine whether SMS on dataset includes Spam or non spam type, refers to the term often used on spam SMS. Preprocess is done on content and non-content features. Preprocess on content includes: case-folding, normalization, slang terms replacement, and stopwords removal. Preprocess on non-content features include define scale for each feature (Time, Length, and Region of sender). <br /> <br /> <br /> The content features extracted from text using Bag-of-Words, then selected by using Word Frequency with a minimum number of occurrences of 50. The non-content features selected from attributes that has a distinct value. <br /> <br /> <br /> The experimental scenarios are baseline search using three predefined algorithms using content features, classification using non-content features, and classification using both content and non-content features simultaneously. Experiments to handle imbalance problem on content and non-content features, changing feature scaling and incomplete value on regional features also included in the scenarios. <br /> <br /> The experimental results show that each non-content feature exerts a different effect on classification results if used separately, a combination of two non-content features, or a combination of three non-content features simultaneously with content features. In the Naive Bayes algorithm, the combination of Time and Category features show better results from the baseline when used in conjunction with content features. In the K-NN algorithm, the best classification performance is resulted from the classification using the content features combined with the feature categories. In the SVM algorithm, Time feature affects classification performance when used in conjunction with content features. Among three classification algorithms used, SVM is the most improved algorithm of performance. The classification performance parameters used are: accuracy, precision, F1-score, Spam and Non Spam precision.<p align="justify"> text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
<p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver. It is becoming increasingly difficult to distinguish SMS Spam from non spam. <br />
<br />
<br />
Much research on SMS spam classification using content feature have been conducted with good results. Meanwhile, research on classification of spam SMS using non-content features is still infrequent. In this study we identify the use of content and non-content features simultaneously for SMS Spam detection using the most used classifiers: Naive Bayes, Support Vector Machine and K-Nearest Neighbour. <br />
<br />
<br />
There are 5 phases performed in this study: data sorting, labeling, preprocessing, feature extraction, and SMS classification. <br />
<br />
<br />
Data sorting is conducted to separate usable data from other. Data labeling is performed to determine whether SMS on dataset includes Spam or non spam type, refers to the term often used on spam SMS. Preprocess is done on content and non-content features. Preprocess on content includes: case-folding, normalization, slang terms replacement, and stopwords removal. Preprocess on non-content features include define scale for each feature (Time, Length, and Region of sender). <br />
<br />
<br />
The content features extracted from text using Bag-of-Words, then selected by using Word Frequency with a minimum number of occurrences of 50. The non-content features selected from attributes that has a distinct value. <br />
<br />
<br />
The experimental scenarios are baseline search using three predefined algorithms using content features, classification using non-content features, and classification using both content and non-content features simultaneously. Experiments to handle imbalance problem on content and non-content features, changing feature scaling and incomplete value on regional features also included in the scenarios. <br />
<br />
The experimental results show that each non-content feature exerts a different effect on classification results if used separately, a combination of two non-content features, or a combination of three non-content features simultaneously with content features. In the Naive Bayes algorithm, the combination of Time and Category features show better results from the baseline when used in conjunction with content features. In the K-NN algorithm, the best classification performance is resulted from the classification using the content features combined with the feature categories. In the SVM algorithm, Time feature affects classification performance when used in conjunction with content features. Among three classification algorithms used, SVM is the most improved algorithm of performance. The classification performance parameters used are: accuracy, precision, F1-score, Spam and Non Spam precision.<p align="justify"> |
format |
Theses |
author |
IMANSYAH - NIM: 23514083, RAKHMAN |
spellingShingle |
IMANSYAH - NIM: 23514083, RAKHMAN SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES |
author_facet |
IMANSYAH - NIM: 23514083, RAKHMAN |
author_sort |
IMANSYAH - NIM: 23514083, RAKHMAN |
title |
SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES |
title_short |
SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES |
title_full |
SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES |
title_fullStr |
SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES |
title_full_unstemmed |
SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES |
title_sort |
sms spam detection using content and non content features |
url |
https://digilib.itb.ac.id/gdl/view/30177 |
_version_ |
1822267351561666560 |