Multi-layered knowledge representation for semantic searches

The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as t...

Full description

Saved in:

Bibliographic Details
Main Author:	Goh, Xuan Kai
Other Authors:	School of Computer Engineering
Format:	Final Year Project
Language:	English
Published:	2014
Subjects:	DRNTU::Engineering::Computer science and engineering::Information systems
Online Access:	http://hdl.handle.net/10356/59021
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-59021
record_format	dspace
spelling	sg-ntu-dr.10356-590212023-03-03T20:29:45Z Multi-layered knowledge representation for semantic searches Goh, Xuan Kai School of Computer Engineering Kim Jung-Jae DRNTU::Engineering::Computer science and engineering::Information systems The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet. The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users. This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts. The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project. The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section. The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system. Bachelor of Engineering (Computer Engineering) 2014-04-21T05:40:40Z 2014-04-21T05:40:40Z 2014 2014 Final Year Project (FYP) http://hdl.handle.net/10356/59021 en Nanyang Technological University 54 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Information systems
spellingShingle	DRNTU::Engineering::Computer science and engineering::Information systems Goh, Xuan Kai Multi-layered knowledge representation for semantic searches
description	The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet. The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users. This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts. The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project. The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section. The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system.
author2	School of Computer Engineering
author_facet	School of Computer Engineering Goh, Xuan Kai
format	Final Year Project
author	Goh, Xuan Kai
author_sort	Goh, Xuan Kai
title	Multi-layered knowledge representation for semantic searches
title_short	Multi-layered knowledge representation for semantic searches
title_full	Multi-layered knowledge representation for semantic searches
title_fullStr	Multi-layered knowledge representation for semantic searches
title_full_unstemmed	Multi-layered knowledge representation for semantic searches
title_sort	multi-layered knowledge representation for semantic searches
publishDate	2014
url	http://hdl.handle.net/10356/59021
_version_	1759857419035344896

Multi-layered knowledge representation for semantic searches

Similar Items