SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES

<p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver...

Full description

Saved in:
Bibliographic Details
Main Author: IMANSYAH - NIM: 23514083, RAKHMAN
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/30177
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:30177
spelling id-itb.:301772018-03-19T11:32:20ZSMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES IMANSYAH - NIM: 23514083, RAKHMAN Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/30177 <p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver. It is becoming increasingly difficult to distinguish SMS Spam from non spam. <br /> <br /> <br /> Much research on SMS spam classification using content feature have been conducted with good results. Meanwhile, research on classification of spam SMS using non-content features is still infrequent. In this study we identify the use of content and non-content features simultaneously for SMS Spam detection using the most used classifiers: Naive Bayes, Support Vector Machine and K-Nearest Neighbour. <br /> <br /> <br /> There are 5 phases performed in this study: data sorting, labeling, preprocessing, feature extraction, and SMS classification. <br /> <br /> <br /> Data sorting is conducted to separate usable data from other. Data labeling is performed to determine whether SMS on dataset includes Spam or non spam type, refers to the term often used on spam SMS. Preprocess is done on content and non-content features. Preprocess on content includes: case-folding, normalization, slang terms replacement, and stopwords removal. Preprocess on non-content features include define scale for each feature (Time, Length, and Region of sender). <br /> <br /> <br /> The content features extracted from text using Bag-of-Words, then selected by using Word Frequency with a minimum number of occurrences of 50. The non-content features selected from attributes that has a distinct value. <br /> <br /> <br /> The experimental scenarios are baseline search using three predefined algorithms using content features, classification using non-content features, and classification using both content and non-content features simultaneously. Experiments to handle imbalance problem on content and non-content features, changing feature scaling and incomplete value on regional features also included in the scenarios. <br /> <br /> The experimental results show that each non-content feature exerts a different effect on classification results if used separately, a combination of two non-content features, or a combination of three non-content features simultaneously with content features. In the Naive Bayes algorithm, the combination of Time and Category features show better results from the baseline when used in conjunction with content features. In the K-NN algorithm, the best classification performance is resulted from the classification using the content features combined with the feature categories. In the SVM algorithm, Time feature affects classification performance when used in conjunction with content features. Among three classification algorithms used, SVM is the most improved algorithm of performance. The classification performance parameters used are: accuracy, precision, F1-score, Spam and Non Spam precision.<p align="justify"> text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description <p align="justify">Spam SMS is unsolicited SMS sent in bulk that target known and unknown mobile phone users. The spam SMS may take on various forms of content, includes: commercial information, bogus contest and other message generally intended to invite a response from the receiver. It is becoming increasingly difficult to distinguish SMS Spam from non spam. <br /> <br /> <br /> Much research on SMS spam classification using content feature have been conducted with good results. Meanwhile, research on classification of spam SMS using non-content features is still infrequent. In this study we identify the use of content and non-content features simultaneously for SMS Spam detection using the most used classifiers: Naive Bayes, Support Vector Machine and K-Nearest Neighbour. <br /> <br /> <br /> There are 5 phases performed in this study: data sorting, labeling, preprocessing, feature extraction, and SMS classification. <br /> <br /> <br /> Data sorting is conducted to separate usable data from other. Data labeling is performed to determine whether SMS on dataset includes Spam or non spam type, refers to the term often used on spam SMS. Preprocess is done on content and non-content features. Preprocess on content includes: case-folding, normalization, slang terms replacement, and stopwords removal. Preprocess on non-content features include define scale for each feature (Time, Length, and Region of sender). <br /> <br /> <br /> The content features extracted from text using Bag-of-Words, then selected by using Word Frequency with a minimum number of occurrences of 50. The non-content features selected from attributes that has a distinct value. <br /> <br /> <br /> The experimental scenarios are baseline search using three predefined algorithms using content features, classification using non-content features, and classification using both content and non-content features simultaneously. Experiments to handle imbalance problem on content and non-content features, changing feature scaling and incomplete value on regional features also included in the scenarios. <br /> <br /> The experimental results show that each non-content feature exerts a different effect on classification results if used separately, a combination of two non-content features, or a combination of three non-content features simultaneously with content features. In the Naive Bayes algorithm, the combination of Time and Category features show better results from the baseline when used in conjunction with content features. In the K-NN algorithm, the best classification performance is resulted from the classification using the content features combined with the feature categories. In the SVM algorithm, Time feature affects classification performance when used in conjunction with content features. Among three classification algorithms used, SVM is the most improved algorithm of performance. The classification performance parameters used are: accuracy, precision, F1-score, Spam and Non Spam precision.<p align="justify">
format Theses
author IMANSYAH - NIM: 23514083, RAKHMAN
spellingShingle IMANSYAH - NIM: 23514083, RAKHMAN
SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
author_facet IMANSYAH - NIM: 23514083, RAKHMAN
author_sort IMANSYAH - NIM: 23514083, RAKHMAN
title SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_short SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_full SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_fullStr SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_full_unstemmed SMS SPAM DETECTION USING CONTENT AND NON CONTENT FEATURES
title_sort sms spam detection using content and non content features
url https://digilib.itb.ac.id/gdl/view/30177
_version_ 1822267351561666560