Multi-layered knowledge representation for semantic searches

The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as t...

Full description

Saved in:
Bibliographic Details
Main Author: Goh, Xuan Kai
Other Authors: School of Computer Engineering
Format: Final Year Project
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/59021
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet. The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users. This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts. The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project. The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section. The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system.