Integration and classification of documents from multiple sources

Knowledge lies in various sources and can be found in different format and shape. One of the greatest source of knowledge we often rely on in our daily lives is non-other than the internet. There, information is mostly encoded in unstructured text or documents. In addition, knowledge extraction from...

Full description

Saved in:
Bibliographic Details
Main Author: Kho, William.
Other Authors: Mao Kezhi
Format: Final Year Project
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10356/54429
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-54429
record_format dspace
spelling sg-ntu-dr.10356-544292023-07-07T17:01:46Z Integration and classification of documents from multiple sources Kho, William. Mao Kezhi School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Knowledge lies in various sources and can be found in different format and shape. One of the greatest source of knowledge we often rely on in our daily lives is non-other than the internet. There, information is mostly encoded in unstructured text or documents. In addition, knowledge extraction from text/documents these days relies on manual entry, which is often time-consuming and laborious. In order to solve this problem, a human-like intelligent agent that is capable of reasoning and decision making is built. The objective of this project is to integrate web-documents from multiple sources and classify them using the LSA (Latent Semantic Analysis) technique. A number of websites originated from a Google query input go through several processes such as text parsing, HTML tags removal, TF-IDF term weighting and normalization, also cosine similarity grouping through SVD (Singular Value Decomposition). The system is built as a Java application and able to filter and group closely related documents by building a vector space model. Bachelor of Engineering 2013-06-20T03:32:14Z 2013-06-20T03:32:14Z 2013 2013 Final Year Project (FYP) http://hdl.handle.net/10356/54429 en Nanyang Technological University 62 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
spellingShingle DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
Kho, William.
Integration and classification of documents from multiple sources
description Knowledge lies in various sources and can be found in different format and shape. One of the greatest source of knowledge we often rely on in our daily lives is non-other than the internet. There, information is mostly encoded in unstructured text or documents. In addition, knowledge extraction from text/documents these days relies on manual entry, which is often time-consuming and laborious. In order to solve this problem, a human-like intelligent agent that is capable of reasoning and decision making is built. The objective of this project is to integrate web-documents from multiple sources and classify them using the LSA (Latent Semantic Analysis) technique. A number of websites originated from a Google query input go through several processes such as text parsing, HTML tags removal, TF-IDF term weighting and normalization, also cosine similarity grouping through SVD (Singular Value Decomposition). The system is built as a Java application and able to filter and group closely related documents by building a vector space model.
author2 Mao Kezhi
author_facet Mao Kezhi
Kho, William.
format Final Year Project
author Kho, William.
author_sort Kho, William.
title Integration and classification of documents from multiple sources
title_short Integration and classification of documents from multiple sources
title_full Integration and classification of documents from multiple sources
title_fullStr Integration and classification of documents from multiple sources
title_full_unstemmed Integration and classification of documents from multiple sources
title_sort integration and classification of documents from multiple sources
publishDate 2013
url http://hdl.handle.net/10356/54429
_version_ 1772827872490160128