Spam analysis and detection on microblog

Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities...

Full description

Saved in:
Bibliographic Details
Main Author: Sedhai Surendra
Other Authors: Sun Aixin
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/69594
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-69594
record_format dspace
spelling sg-ntu-dr.10356-695942023-03-04T00:51:44Z Spam analysis and detection on microblog Sedhai Surendra Sun Aixin School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities of spammers on micro-blogging platforms make spam a serious problem which also affect user experience. Due to quick propagation of content in microblogging services, it is highly desirable to detect spam in real-time to minimize its impact. Hence, real-time spam detection technique that leverage fine-grained information is essential. As Twitter is one of the most popular micro-blogging platforms, in this dissertation we used Twitter dataset to study spam on microblog. Spam detection is an active area of research on Twitter. Most of the studies are focused on identifying spammers. Spam issues on Twitter are tackled by blocking spammers detected by the system. A user account that mistakenly grant permission to a malicious third party application may get blocked due to the posts by the application on behalf of the user. Similarly, compromised microblog accounts may get blocked because of the tweets posted by hijackers. Further, spammers are generally detected only after posting many spam tweets. To address the issue of spam on Twitter, we have used tweet-level spam detection is necessary. Tweet-level spam detection is inherently a challenging task as tweet is short and noisy text. Spam tweets heavily exploit hashtags to promote the tweets to the wider audience. Hence, we focused on hashtag oriented spam detection by collecting tweets using trending hashtags as a query. Further, we propose an effective way of labeling tweets for generating a dataset for such task. To the best our knowledge, there was no any benchmark dataset hence we present HSpam14, a public dataset consisting of 14 million tweets labeled with the proposed labeling technique. We made a detailed tweet-level analysis based on hashtags and tweet content, and user-level analysis based on user profiles. Detailed understanding about spam tweets and legitimate tweets, which are also know as ham tweets, are utilized to design spam tweet detection system. Unlabeled tweets are easy to obtain and they can be utilized to improve system performance and to deal with evolving spamming activities. Hence, we proposed semi-supervised real-time spam detection system to effectively identify spam tweets. Spams in a microblog introduce problems in several functionalities such as search, recommendation, and text analysis. As a case study, we analyze the effect of spam tweets on hashtag recommendation using the HSpam14 dataset. We observed that features and methods that are effective for spam tweet collection may not be effective for legitimate tweets. Our study shows that experiment conducted on spammy dataset gives misleading results. Hence, it is crucial to perform spam filtering before conducting any analysis on Twitter. In a nutshell, this dissertation elucidates the effectiveness of tweet-level spam detection and different aspects of spam and ham tweets and also paves the way for further research on fine-grained spam detection on microblog. Doctor of Philosophy (SCE) 2017-02-28T01:27:47Z 2017-02-28T01:27:47Z 2017 Thesis Sedhai Surendra. (2017). Spam analysis and detection on microblog. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/69594 10.32657/10356/69594 en 152 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences
spellingShingle DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences
Sedhai Surendra
Spam analysis and detection on microblog
description Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities of spammers on micro-blogging platforms make spam a serious problem which also affect user experience. Due to quick propagation of content in microblogging services, it is highly desirable to detect spam in real-time to minimize its impact. Hence, real-time spam detection technique that leverage fine-grained information is essential. As Twitter is one of the most popular micro-blogging platforms, in this dissertation we used Twitter dataset to study spam on microblog. Spam detection is an active area of research on Twitter. Most of the studies are focused on identifying spammers. Spam issues on Twitter are tackled by blocking spammers detected by the system. A user account that mistakenly grant permission to a malicious third party application may get blocked due to the posts by the application on behalf of the user. Similarly, compromised microblog accounts may get blocked because of the tweets posted by hijackers. Further, spammers are generally detected only after posting many spam tweets. To address the issue of spam on Twitter, we have used tweet-level spam detection is necessary. Tweet-level spam detection is inherently a challenging task as tweet is short and noisy text. Spam tweets heavily exploit hashtags to promote the tweets to the wider audience. Hence, we focused on hashtag oriented spam detection by collecting tweets using trending hashtags as a query. Further, we propose an effective way of labeling tweets for generating a dataset for such task. To the best our knowledge, there was no any benchmark dataset hence we present HSpam14, a public dataset consisting of 14 million tweets labeled with the proposed labeling technique. We made a detailed tweet-level analysis based on hashtags and tweet content, and user-level analysis based on user profiles. Detailed understanding about spam tweets and legitimate tweets, which are also know as ham tweets, are utilized to design spam tweet detection system. Unlabeled tweets are easy to obtain and they can be utilized to improve system performance and to deal with evolving spamming activities. Hence, we proposed semi-supervised real-time spam detection system to effectively identify spam tweets. Spams in a microblog introduce problems in several functionalities such as search, recommendation, and text analysis. As a case study, we analyze the effect of spam tweets on hashtag recommendation using the HSpam14 dataset. We observed that features and methods that are effective for spam tweet collection may not be effective for legitimate tweets. Our study shows that experiment conducted on spammy dataset gives misleading results. Hence, it is crucial to perform spam filtering before conducting any analysis on Twitter. In a nutshell, this dissertation elucidates the effectiveness of tweet-level spam detection and different aspects of spam and ham tweets and also paves the way for further research on fine-grained spam detection on microblog.
author2 Sun Aixin
author_facet Sun Aixin
Sedhai Surendra
format Theses and Dissertations
author Sedhai Surendra
author_sort Sedhai Surendra
title Spam analysis and detection on microblog
title_short Spam analysis and detection on microblog
title_full Spam analysis and detection on microblog
title_fullStr Spam analysis and detection on microblog
title_full_unstemmed Spam analysis and detection on microblog
title_sort spam analysis and detection on microblog
publishDate 2017
url http://hdl.handle.net/10356/69594
_version_ 1759855154574655488