Multi-layered knowledge representation for semantic searches

The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as t...

Full description

Saved in:
Bibliographic Details
Main Author: Goh, Xuan Kai
Other Authors: School of Computer Engineering
Format: Final Year Project
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/59021
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-59021
record_format dspace
spelling sg-ntu-dr.10356-590212023-03-03T20:29:45Z Multi-layered knowledge representation for semantic searches Goh, Xuan Kai School of Computer Engineering Kim Jung-Jae DRNTU::Engineering::Computer science and engineering::Information systems The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet. The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users. This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts. The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project. The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section. The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system. Bachelor of Engineering (Computer Engineering) 2014-04-21T05:40:40Z 2014-04-21T05:40:40Z 2014 2014 Final Year Project (FYP) http://hdl.handle.net/10356/59021 en Nanyang Technological University 54 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Information systems
spellingShingle DRNTU::Engineering::Computer science and engineering::Information systems
Goh, Xuan Kai
Multi-layered knowledge representation for semantic searches
description The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet. The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users. This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts. The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project. The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section. The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system.
author2 School of Computer Engineering
author_facet School of Computer Engineering
Goh, Xuan Kai
format Final Year Project
author Goh, Xuan Kai
author_sort Goh, Xuan Kai
title Multi-layered knowledge representation for semantic searches
title_short Multi-layered knowledge representation for semantic searches
title_full Multi-layered knowledge representation for semantic searches
title_fullStr Multi-layered knowledge representation for semantic searches
title_full_unstemmed Multi-layered knowledge representation for semantic searches
title_sort multi-layered knowledge representation for semantic searches
publishDate 2014
url http://hdl.handle.net/10356/59021
_version_ 1759857419035344896