Multi-layered knowledge representation for semantic searches
The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as t...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/59021 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-59021 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-590212023-03-03T20:29:45Z Multi-layered knowledge representation for semantic searches Goh, Xuan Kai School of Computer Engineering Kim Jung-Jae DRNTU::Engineering::Computer science and engineering::Information systems The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet. The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users. This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts. The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project. The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section. The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system. Bachelor of Engineering (Computer Engineering) 2014-04-21T05:40:40Z 2014-04-21T05:40:40Z 2014 2014 Final Year Project (FYP) http://hdl.handle.net/10356/59021 en Nanyang Technological University 54 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering::Information systems |
spellingShingle |
DRNTU::Engineering::Computer science and engineering::Information systems Goh, Xuan Kai Multi-layered knowledge representation for semantic searches |
description |
The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet.
The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users.
This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts.
The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project.
The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section.
The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system. |
author2 |
School of Computer Engineering |
author_facet |
School of Computer Engineering Goh, Xuan Kai |
format |
Final Year Project |
author |
Goh, Xuan Kai |
author_sort |
Goh, Xuan Kai |
title |
Multi-layered knowledge representation for semantic searches |
title_short |
Multi-layered knowledge representation for semantic searches |
title_full |
Multi-layered knowledge representation for semantic searches |
title_fullStr |
Multi-layered knowledge representation for semantic searches |
title_full_unstemmed |
Multi-layered knowledge representation for semantic searches |
title_sort |
multi-layered knowledge representation for semantic searches |
publishDate |
2014 |
url |
http://hdl.handle.net/10356/59021 |
_version_ |
1759857419035344896 |