Near duplicate detection on tweets

Social media has been increasing adopted as a mode of communication throughout the world, causing the amount of data to increase at an alarming rate and raising concerns over the management and analysis on big data. This has resulted in data/business analytics to be increasing popular as it seeks to...

Full description

Saved in:

Bibliographic Details
Main Author:	Ng, Alvin Keng Hian
Other Authors:	Yeo Chai Kiat
Format:	Final Year Project
Language:	English
Published:	2016
Subjects:	DRNTU::Engineering
Online Access:	http://hdl.handle.net/10356/66410
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-66410
record_format	dspace
spelling	sg-ntu-dr.10356-664102023-03-03T20:23:08Z Near duplicate detection on tweets Ng, Alvin Keng Hian Yeo Chai Kiat School of Computer Engineering DRNTU::Engineering Social media has been increasing adopted as a mode of communication throughout the world, causing the amount of data to increase at an alarming rate and raising concerns over the management and analysis on big data. This has resulted in data/business analytics to be increasing popular as it seeks to study user’s behaviour, including popular topics and popular users as well as malicious users with bad intentions. Twitter is adopted as a tool to provide users with the latest insights to various incidents or news in the shortest time across the globe, which has also triggered a huge interest in social journalism. It also contains a URL shortening tool, limiting the word count to a maximum of 140 characters for each tweet. This allows Twitter to track user behaviour and prevent users from being targeted by malicious attackers. Big data enables Twitter to generate additional revenue through providing data analytics services to various large organizations as they are interested in the types of information or trends that are popular amongst Twitter users. In this project, Python is used as the preferred language since various libraries such as NLTK are readily available, which allows the analysis of near duplicates and spam detection to be made possible within a short period of time. Several forms of testing have been conducted to identify any potential performance and memory leaks present in the codes. Overall, the objectives of this project have been successfully accomplished on time. However, due to the many types of algorithms that are made available for near duplicate detection during the point of writing, only some of the popular algorithms have been implemented, which specifically tailors to data streaming. Various spam detection tools are looked at, which enables us to identify the types of tweets that will constitute to being identified as a spam tweet. Near duplicate and spam detection are related in such a way that spam is able to detect bots, while near duplicates are able to determine the amount of similarity or differences between tweets, ensuring that no spam has been able to escape the spam detection process unscathed. Bachelor of Engineering (Computer Science) 2016-04-05T05:33:52Z 2016-04-05T05:33:52Z 2016 Final Year Project (FYP) http://hdl.handle.net/10356/66410 en Nanyang Technological University 72 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering
spellingShingle	DRNTU::Engineering Ng, Alvin Keng Hian Near duplicate detection on tweets
description	Social media has been increasing adopted as a mode of communication throughout the world, causing the amount of data to increase at an alarming rate and raising concerns over the management and analysis on big data. This has resulted in data/business analytics to be increasing popular as it seeks to study user’s behaviour, including popular topics and popular users as well as malicious users with bad intentions. Twitter is adopted as a tool to provide users with the latest insights to various incidents or news in the shortest time across the globe, which has also triggered a huge interest in social journalism. It also contains a URL shortening tool, limiting the word count to a maximum of 140 characters for each tweet. This allows Twitter to track user behaviour and prevent users from being targeted by malicious attackers. Big data enables Twitter to generate additional revenue through providing data analytics services to various large organizations as they are interested in the types of information or trends that are popular amongst Twitter users. In this project, Python is used as the preferred language since various libraries such as NLTK are readily available, which allows the analysis of near duplicates and spam detection to be made possible within a short period of time. Several forms of testing have been conducted to identify any potential performance and memory leaks present in the codes. Overall, the objectives of this project have been successfully accomplished on time. However, due to the many types of algorithms that are made available for near duplicate detection during the point of writing, only some of the popular algorithms have been implemented, which specifically tailors to data streaming. Various spam detection tools are looked at, which enables us to identify the types of tweets that will constitute to being identified as a spam tweet. Near duplicate and spam detection are related in such a way that spam is able to detect bots, while near duplicates are able to determine the amount of similarity or differences between tweets, ensuring that no spam has been able to escape the spam detection process unscathed.
author2	Yeo Chai Kiat
author_facet	Yeo Chai Kiat Ng, Alvin Keng Hian
format	Final Year Project
author	Ng, Alvin Keng Hian
author_sort	Ng, Alvin Keng Hian
title	Near duplicate detection on tweets
title_short	Near duplicate detection on tweets
title_full	Near duplicate detection on tweets
title_fullStr	Near duplicate detection on tweets
title_full_unstemmed	Near duplicate detection on tweets
title_sort	near duplicate detection on tweets
publishDate	2016
url	http://hdl.handle.net/10356/66410
_version_	1759854839101128704

Near duplicate detection on tweets

Similar Items