Hadoop on data analytics

Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from...

Full description

Saved in:
Bibliographic Details
Main Author: Gee, Denny Jee King.
Other Authors: Lee Bu Sung
Format: Final Year Project
Language:English
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/10356/48540
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-48540
record_format dspace
spelling sg-ntu-dr.10356-485402023-03-03T20:38:15Z Hadoop on data analytics Gee, Denny Jee King. Lee Bu Sung School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop. Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events. The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms. Bachelor of Engineering (Computer Science) 2012-04-26T01:28:04Z 2012-04-26T01:28:04Z 2012 2012 Final Year Project (FYP) http://hdl.handle.net/10356/48540 en Nanyang Technological University 54 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Gee, Denny Jee King.
Hadoop on data analytics
description Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop. Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events. The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms.
author2 Lee Bu Sung
author_facet Lee Bu Sung
Gee, Denny Jee King.
format Final Year Project
author Gee, Denny Jee King.
author_sort Gee, Denny Jee King.
title Hadoop on data analytics
title_short Hadoop on data analytics
title_full Hadoop on data analytics
title_fullStr Hadoop on data analytics
title_full_unstemmed Hadoop on data analytics
title_sort hadoop on data analytics
publishDate 2012
url http://hdl.handle.net/10356/48540
_version_ 1759858229143142400