Automatic taxonomy construction from textual documents

The explosion of unstructured text data makes it difficult to find information for our interests. To provide access to information effectively, it is important to organize the unstructured data in a structured and meaningful manner. Taxonomies, which serve as the backbone for structured knowledge, a...

Full description

Saved in:

Bibliographic Details
Main Author:	Luu, Anh Tuan
Other Authors:	Ng See Kiong
Format:	Theses and Dissertations
Language:	English
Published:	2017
Subjects:	DRNTU::Engineering::Computer science and engineering
Online Access:	http://hdl.handle.net/10356/69985
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-69985
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Luu, Anh Tuan Automatic taxonomy construction from textual documents
description	The explosion of unstructured text data makes it difficult to find information for our interests. To provide access to information effectively, it is important to organize the unstructured data in a structured and meaningful manner. Taxonomies, which serve as the backbone for structured knowledge, are useful for many NLP applications such as question answering and document clustering by organizing domain knowledge into a hierarchy of ‘is-a’ relations between terms. Currently, there have been an increasing number of public hand-crafted taxonomies available such as WordNet and Freebase. However, it will be more effective to use taxonomies that are created specifically for the domain of interest in practice rather than re-using existing taxonomies created for other tasks or domains. As such, we often face the challenge of creating a brand new taxonomy for a specific domain from scratch. In this thesis, we propose an effective framework for automatic domain-specific taxonomy construction from textual documents, which consists of three steps, namely domain term extraction, taxonomic relation identification and taxonomy induction. Domain term extraction aims to extract the relevant domain terms from a given text collection of specific domain. Taxonomic relation identification aims to identify the taxonomic relations (i.e. ‘is-a’ relations) among domain terms. Taxonomy induction aims to construct the taxonomy structure from the identified taxonomic relations. We use the big data approach which combines linguistics, statistical and deep learning methods to address the challenges in these steps. The main contributions of our research are summarized as follows: - We proposed a Web-based method to extract domain terms from a given text collection. From that, we proposed a method to use the contextual information of the terms in syntactic structures to detect taxonomic relations across sentence boundary. In addition, we also proposed a novel graph-based algorithm to organize the extracted taxonomic relations into an optimal taxonomy tree. The experimental results show that the proposed method is well complementary to the previous methods of linguistic pattern matching and significantly improves recall and F-measure. - We studied two important aspects that can greatly affect the performance of taxonomy construction method. The first one is on the trustiness of individual source texts, which is important to filter out incorrect relations from unreliable sources. The second one is on the collective evidence from synonyms and contrastive terms, where synonyms provide additional supports to taxonomic relation identification, while contrastive terms may contradict them. We proposed an approach to incorporate these features into taxonomy construction, which can improve the performance on F-measure by up to 4%-10%. - We proposed a time-aware approach to extract and integrate temporal information into the process of identifying taxonomic relations, by employing a timestamp contribution function to measure the evidence scores of source texts at a particular time. Experimental results show that our proposed approach outperforms the state-of-the-art methods on F-measure by up to 7%-20%. Furthermore, the proposed approach can incrementally and continuously update the taxonomy by adding fresh relations from new data and removing outdated relations, using a proposed information decay function. It thus avoids rebuilding the whole structure from scratch for every update and maintains the taxonomy up-to-date in order to keep up with the latest information trends quickly. - We proposed a novel unsupervised approach to construct taxonomies based on word embedding clustering, using the following three word embedding measures: semantic clusters, taxonomic centroids and relative distances from the root, for identifying the semantic relationships between terms and their hypernyms. Our proposed approach significantly outperforms the state-of-the-art methods in terms of recall and F-measure. - We proposed an approach to learn word embeddings for taxonomic relations based on the contextual words between the hypernym and hyponym using a dynamic weighting neural network. Our proposed approach significantly outperforms the state-of-the-art methods by 9% to 13% in terms of accuracy for both general and specific domain datasets.
author2	Ng See Kiong
author_facet	Ng See Kiong Luu, Anh Tuan
format	Theses and Dissertations
author	Luu, Anh Tuan
author_sort	Luu, Anh Tuan
title	Automatic taxonomy construction from textual documents
title_short	Automatic taxonomy construction from textual documents
title_full	Automatic taxonomy construction from textual documents
title_fullStr	Automatic taxonomy construction from textual documents
title_full_unstemmed	Automatic taxonomy construction from textual documents
title_sort	automatic taxonomy construction from textual documents
publishDate	2017
url	http://hdl.handle.net/10356/69985
_version_	1759855259661893632
spelling	sg-ntu-dr.10356-699852023-03-04T00:52:05Z Automatic taxonomy construction from textual documents Luu, Anh Tuan Ng See Kiong Hui Siu Cheung School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering The explosion of unstructured text data makes it difficult to find information for our interests. To provide access to information effectively, it is important to organize the unstructured data in a structured and meaningful manner. Taxonomies, which serve as the backbone for structured knowledge, are useful for many NLP applications such as question answering and document clustering by organizing domain knowledge into a hierarchy of ‘is-a’ relations between terms. Currently, there have been an increasing number of public hand-crafted taxonomies available such as WordNet and Freebase. However, it will be more effective to use taxonomies that are created specifically for the domain of interest in practice rather than re-using existing taxonomies created for other tasks or domains. As such, we often face the challenge of creating a brand new taxonomy for a specific domain from scratch. In this thesis, we propose an effective framework for automatic domain-specific taxonomy construction from textual documents, which consists of three steps, namely domain term extraction, taxonomic relation identification and taxonomy induction. Domain term extraction aims to extract the relevant domain terms from a given text collection of specific domain. Taxonomic relation identification aims to identify the taxonomic relations (i.e. ‘is-a’ relations) among domain terms. Taxonomy induction aims to construct the taxonomy structure from the identified taxonomic relations. We use the big data approach which combines linguistics, statistical and deep learning methods to address the challenges in these steps. The main contributions of our research are summarized as follows: - We proposed a Web-based method to extract domain terms from a given text collection. From that, we proposed a method to use the contextual information of the terms in syntactic structures to detect taxonomic relations across sentence boundary. In addition, we also proposed a novel graph-based algorithm to organize the extracted taxonomic relations into an optimal taxonomy tree. The experimental results show that the proposed method is well complementary to the previous methods of linguistic pattern matching and significantly improves recall and F-measure. - We studied two important aspects that can greatly affect the performance of taxonomy construction method. The first one is on the trustiness of individual source texts, which is important to filter out incorrect relations from unreliable sources. The second one is on the collective evidence from synonyms and contrastive terms, where synonyms provide additional supports to taxonomic relation identification, while contrastive terms may contradict them. We proposed an approach to incorporate these features into taxonomy construction, which can improve the performance on F-measure by up to 4%-10%. - We proposed a time-aware approach to extract and integrate temporal information into the process of identifying taxonomic relations, by employing a timestamp contribution function to measure the evidence scores of source texts at a particular time. Experimental results show that our proposed approach outperforms the state-of-the-art methods on F-measure by up to 7%-20%. Furthermore, the proposed approach can incrementally and continuously update the taxonomy by adding fresh relations from new data and removing outdated relations, using a proposed information decay function. It thus avoids rebuilding the whole structure from scratch for every update and maintains the taxonomy up-to-date in order to keep up with the latest information trends quickly. - We proposed a novel unsupervised approach to construct taxonomies based on word embedding clustering, using the following three word embedding measures: semantic clusters, taxonomic centroids and relative distances from the root, for identifying the semantic relationships between terms and their hypernyms. Our proposed approach significantly outperforms the state-of-the-art methods in terms of recall and F-measure. - We proposed an approach to learn word embeddings for taxonomic relations based on the contextual words between the hypernym and hyponym using a dynamic weighting neural network. Our proposed approach significantly outperforms the state-of-the-art methods by 9% to 13% in terms of accuracy for both general and specific domain datasets. Doctor of Philosophy (SCE) 2017-04-07T01:38:53Z 2017-04-07T01:38:53Z 2017 Thesis Luu, A. T. (2017). Automatic taxonomy construction from textual documents. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/69985 10.32657/10356/69985 en 143 p. application/pdf

Automatic taxonomy construction from textual documents

Similar Items