Hadoop on data analytics

Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from...

Full description

Saved in:

Bibliographic Details
Main Author:	Gee, Denny Jee King.
Other Authors:	Lee Bu Sung
Format:	Final Year Project
Language:	English
Published:	2012
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Online Access:	http://hdl.handle.net/10356/48540
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-48540
record_format	dspace
spelling	sg-ntu-dr.10356-485402023-03-03T20:38:15Z Hadoop on data analytics Gee, Denny Jee King. Lee Bu Sung School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop. Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events. The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms. Bachelor of Engineering (Computer Science) 2012-04-26T01:28:04Z 2012-04-26T01:28:04Z 2012 2012 Final Year Project (FYP) http://hdl.handle.net/10356/48540 en Nanyang Technological University 54 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing Gee, Denny Jee King. Hadoop on data analytics
description	Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop. Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events. The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms.
author2	Lee Bu Sung
author_facet	Lee Bu Sung Gee, Denny Jee King.
format	Final Year Project
author	Gee, Denny Jee King.
author_sort	Gee, Denny Jee King.
title	Hadoop on data analytics
title_short	Hadoop on data analytics
title_full	Hadoop on data analytics
title_fullStr	Hadoop on data analytics
title_full_unstemmed	Hadoop on data analytics
title_sort	hadoop on data analytics
publishDate	2012
url	http://hdl.handle.net/10356/48540
_version_	1759858229143142400

Hadoop on data analytics

Similar Items