Spam analysis and detection on microblog

Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities...

Full description

Saved in:

Bibliographic Details
Main Author:	Sedhai Surendra
Other Authors:	Sun Aixin
Format:	Theses and Dissertations
Language:	English
Published:	2017
Subjects:	DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences
Online Access:	http://hdl.handle.net/10356/69594
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-69594
record_format	dspace
spelling	sg-ntu-dr.10356-695942023-03-04T00:51:44Z Spam analysis and detection on microblog Sedhai Surendra Sun Aixin School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities of spammers on micro-blogging platforms make spam a serious problem which also affect user experience. Due to quick propagation of content in microblogging services, it is highly desirable to detect spam in real-time to minimize its impact. Hence, real-time spam detection technique that leverage fine-grained information is essential. As Twitter is one of the most popular micro-blogging platforms, in this dissertation we used Twitter dataset to study spam on microblog. Spam detection is an active area of research on Twitter. Most of the studies are focused on identifying spammers. Spam issues on Twitter are tackled by blocking spammers detected by the system. A user account that mistakenly grant permission to a malicious third party application may get blocked due to the posts by the application on behalf of the user. Similarly, compromised microblog accounts may get blocked because of the tweets posted by hijackers. Further, spammers are generally detected only after posting many spam tweets. To address the issue of spam on Twitter, we have used tweet-level spam detection is necessary. Tweet-level spam detection is inherently a challenging task as tweet is short and noisy text. Spam tweets heavily exploit hashtags to promote the tweets to the wider audience. Hence, we focused on hashtag oriented spam detection by collecting tweets using trending hashtags as a query. Further, we propose an effective way of labeling tweets for generating a dataset for such task. To the best our knowledge, there was no any benchmark dataset hence we present HSpam14, a public dataset consisting of 14 million tweets labeled with the proposed labeling technique. We made a detailed tweet-level analysis based on hashtags and tweet content, and user-level analysis based on user profiles. Detailed understanding about spam tweets and legitimate tweets, which are also know as ham tweets, are utilized to design spam tweet detection system. Unlabeled tweets are easy to obtain and they can be utilized to improve system performance and to deal with evolving spamming activities. Hence, we proposed semi-supervised real-time spam detection system to effectively identify spam tweets. Spams in a microblog introduce problems in several functionalities such as search, recommendation, and text analysis. As a case study, we analyze the effect of spam tweets on hashtag recommendation using the HSpam14 dataset. We observed that features and methods that are effective for spam tweet collection may not be effective for legitimate tweets. Our study shows that experiment conducted on spammy dataset gives misleading results. Hence, it is crucial to perform spam filtering before conducting any analysis on Twitter. In a nutshell, this dissertation elucidates the effectiveness of tweet-level spam detection and different aspects of spam and ham tweets and also paves the way for further research on fine-grained spam detection on microblog. Doctor of Philosophy (SCE) 2017-02-28T01:27:47Z 2017-02-28T01:27:47Z 2017 Thesis Sedhai Surendra. (2017). Spam analysis and detection on microblog. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/69594 10.32657/10356/69594 en 152 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences Sedhai Surendra Spam analysis and detection on microblog
description	Micro-blogging platforms such as Twitter and Weibo are popular platforms for information collection and dissemination. Due to ease of posting content and the huge social graph, micro-blogging platforms have attracted the attention of not only legitimate users but also spammers. Increasing activities of spammers on micro-blogging platforms make spam a serious problem which also affect user experience. Due to quick propagation of content in microblogging services, it is highly desirable to detect spam in real-time to minimize its impact. Hence, real-time spam detection technique that leverage fine-grained information is essential. As Twitter is one of the most popular micro-blogging platforms, in this dissertation we used Twitter dataset to study spam on microblog. Spam detection is an active area of research on Twitter. Most of the studies are focused on identifying spammers. Spam issues on Twitter are tackled by blocking spammers detected by the system. A user account that mistakenly grant permission to a malicious third party application may get blocked due to the posts by the application on behalf of the user. Similarly, compromised microblog accounts may get blocked because of the tweets posted by hijackers. Further, spammers are generally detected only after posting many spam tweets. To address the issue of spam on Twitter, we have used tweet-level spam detection is necessary. Tweet-level spam detection is inherently a challenging task as tweet is short and noisy text. Spam tweets heavily exploit hashtags to promote the tweets to the wider audience. Hence, we focused on hashtag oriented spam detection by collecting tweets using trending hashtags as a query. Further, we propose an effective way of labeling tweets for generating a dataset for such task. To the best our knowledge, there was no any benchmark dataset hence we present HSpam14, a public dataset consisting of 14 million tweets labeled with the proposed labeling technique. We made a detailed tweet-level analysis based on hashtags and tweet content, and user-level analysis based on user profiles. Detailed understanding about spam tweets and legitimate tweets, which are also know as ham tweets, are utilized to design spam tweet detection system. Unlabeled tweets are easy to obtain and they can be utilized to improve system performance and to deal with evolving spamming activities. Hence, we proposed semi-supervised real-time spam detection system to effectively identify spam tweets. Spams in a microblog introduce problems in several functionalities such as search, recommendation, and text analysis. As a case study, we analyze the effect of spam tweets on hashtag recommendation using the HSpam14 dataset. We observed that features and methods that are effective for spam tweet collection may not be effective for legitimate tweets. Our study shows that experiment conducted on spammy dataset gives misleading results. Hence, it is crucial to perform spam filtering before conducting any analysis on Twitter. In a nutshell, this dissertation elucidates the effectiveness of tweet-level spam detection and different aspects of spam and ham tweets and also paves the way for further research on fine-grained spam detection on microblog.
author2	Sun Aixin
author_facet	Sun Aixin Sedhai Surendra
format	Theses and Dissertations
author	Sedhai Surendra
author_sort	Sedhai Surendra
title	Spam analysis and detection on microblog
title_short	Spam analysis and detection on microblog
title_full	Spam analysis and detection on microblog
title_fullStr	Spam analysis and detection on microblog
title_full_unstemmed	Spam analysis and detection on microblog
title_sort	spam analysis and detection on microblog
publishDate	2017
url	http://hdl.handle.net/10356/69594
_version_	1759855154574655488

Spam analysis and detection on microblog

Similar Items