Multi-layered knowledge representation for semantic searches

The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as t...

全面介紹

Saved in:

書目詳細資料
主要作者:	Goh, Xuan Kai
其他作者:	School of Computer Engineering
格式:	Final Year Project
語言:	English
出版:	2014
主題:	DRNTU::Engineering::Computer science and engineering::Information systems
在線閱讀:	http://hdl.handle.net/10356/59021
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

實物特徵
總結:	The explosion of digital data in the information age had given rise to a new phenomenon known as Big Data. Big data had resulted in the problem where there is more data than useful information which can be mined from it. A typical source of overflowing data is the World Wide Web, commonly known as the Internet. The objective of this project is to develop a system capable of extracting key concepts from the websites of different organizations, and identify similarities between the extracted concepts. These key concepts and similarity measures can subsequently be used to form a Knowledge Representation System, which is capable of providing useful information to its users. This report details the implementation of the system developed in this project. The first section covers the configuration of an open source Web Crawler – Apache Nutch, which performs a breadth first crawl to extract text information from different websites. This text information is processed to extract the most frequent terms, which are fed into an Ontology to identify the key concepts. The next module discusses the implementation of the preprocessing engine which is required to convert the Wikipedia Dump into a Wikipedia Category Tree. This category tree is stored in an embedded database to facilitate intensive querying, as it is used as an Ontology for this project. The embedded database containing the category tree is intensively queried during the category tree traversal. This traversal is required as part of the key concept identification process. The algorithm used to traverse the Wikipedia Category Tree, the density based measures used to determine the most representative key concepts, and the Jaccard coefficients used to compute the degree of similarity between two sets of key concepts are further elaborated in the last section. The system is tested with a dummy data set consisting of terms from the List of Academic Disciplines. The results of this test had demonstrated that ancestral similarities can be identified from two different sets of terms. These results are reinforced when actual terms extracted from the crawled webpages produces the same results when they are fed into the system.

Multi-layered knowledge representation for semantic searches

相似書籍