High performance data processing systems in Clouds

A web crawler is capable to surf net and traverse among hyperlinks that it links to. Large amount of data is linked up, process of traversing though these links allowed us to gained data from a webpage to another. Crawling and collecting Portable Document Format (PDF) is the main task in this projec...

Full description

Saved in:
Bibliographic Details
Main Author: Lai, Qi Rong
Other Authors: He Bingsheng
Format: Final Year Project
Language:English
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/10356/63058
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:A web crawler is capable to surf net and traverse among hyperlinks that it links to. Large amount of data is linked up, process of traversing though these links allowed us to gained data from a webpage to another. Crawling and collecting Portable Document Format (PDF) is the main task in this project. PDF is in unstructured data form. Collected PDFs are to be processed and compiled into a more visual friendly form of image, words cloud. Words cloud is created based on words within PDF and is assigned scaling in font size to represent the importance of that word according to frequency in PDF. Word cloud is said to be able to represent the content of a PDF, since most frequent word represent the larger portion within it. Using steganography to concealed message of the words frequency into word cloud image created. This method able to generate a look alike (in human eyes) image which embedded with information. Using the information within word cloud able to be retrieved and compiled to craft the similarity of different words clouds.