Concept-based embeddings for natural language processing

Concepts are critical semantics capturing the high-level knowledge of human language. As a way to go beyond the word-level analysis, representing and leveraging the concept-level information is an important add-on to existing natural language processing (NLP) systems. More specifically, the concepts...

Full description

Saved in:

Bibliographic Details
Main Author:	Ma, Yukun
Other Authors:	Erik Cambria
Format:	Theses and Dissertations
Language:	English
Published:	2018
Subjects:	DRNTU::Engineering::Computer science and engineering
Online Access:	http://hdl.handle.net/10356/75838
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-75838
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Ma, Yukun Concept-based embeddings for natural language processing
description	Concepts are critical semantics capturing the high-level knowledge of human language. As a way to go beyond the word-level analysis, representing and leveraging the concept-level information is an important add-on to existing natural language processing (NLP) systems. More specifically, the concepts are critical for understanding opinions of people. For example, people express their opinion towards particular entities such as products or sentiment aspects in online reviews, where these entities are mentions of concepts rather than just words. As compared with words, the mentions of abstract concepts may be compounded phrases (either consecutive or non-consecutive) that are likely to form a large vocabulary. Furthermore, there might be semantic properties (e.g., relations or attributes) attached to the concepts, which increases the dimensionality of concepts. In short, using concepts is faced with the curse of dimensionality. On the other hand, information from only a single level does not suffice for a thorough understanding of human language, and meaningful representation is required at any point to encode the correlation and dependency between abstract concepts and words. In this thesis, we thus focus on effectively leveraging and integrating information from concept-level as well as word-level via projecting concepts and words into a lower dimensional space while retaining most critical semantics. In a broad context of opinion understanding system, we investigate the use of the fused embedding for several core NLP tasks: named entity detection and classification, automatic speech recognition reranking, and targeted sentiment analysis. We first propose a novel method to inject the entity-based information into a word embedding space. The word embeddings are learned from a set of named entity features instead of merely contextual words. We demonstrate that the new word embedding is a better feature representation for detecting and classifying named entities from the stream of telephone conversations. Apart from learning input feature embeddings, we then explore encoding the entity types (i.e., concept categories) in a label embedding space. Our label embeddings mainly leverage two types of information: label hierarchy and label prototype. Since our label embedding is computed prior to the training process, it has exactly the same computation complexity at run-time. We evaluate the resulting label embeddings on multiple large-scale datasets built for the task of fine-grained named entity typing. As compared with the state-of-the-art methods, our label embedding method can achieve superior performance. Next, we demonstrate that a binary embedding of the named entities can help reranking the speech-to-text hypothesis. Named entities are encoded using a Restricted Boltzmann Machine (RBM) and used as a prior knowledge in the discriminative reranking model. We also extend the training of RBM to work with speech recognition hypothesis. Finally, we investigate the problem of using embeddings of commonsense concepts for the task of targeted sentiment analysis. The task is also entity-centered. Namely, given a targeted entity in a sentence, the task is to resolve the correct aspects categories and corresponding sentiment polarity of the target. We propose a new computation structure of Long Short-Term Memory (LSTM) that can more effectively incorporate the embeddings of commonsense knowledge. In summary, this thesis proposes novel solutions of representing and leveraging concept-level and word-level information in a series of NLP tasks that are key to understanding the opinion of people.
author2	Erik Cambria
author_facet	Erik Cambria Ma, Yukun
format	Theses and Dissertations
author	Ma, Yukun
author_sort	Ma, Yukun
title	Concept-based embeddings for natural language processing
title_short	Concept-based embeddings for natural language processing
title_full	Concept-based embeddings for natural language processing
title_fullStr	Concept-based embeddings for natural language processing
title_full_unstemmed	Concept-based embeddings for natural language processing
title_sort	concept-based embeddings for natural language processing
publishDate	2018
url	http://hdl.handle.net/10356/75838
_version_	1759857244042690560
spelling	sg-ntu-dr.10356-758382023-03-04T00:48:33Z Concept-based embeddings for natural language processing Ma, Yukun Erik Cambria School of Computer Science and Engineering Rolls-Royce@NTU Corporate Lab DRNTU::Engineering::Computer science and engineering Concepts are critical semantics capturing the high-level knowledge of human language. As a way to go beyond the word-level analysis, representing and leveraging the concept-level information is an important add-on to existing natural language processing (NLP) systems. More specifically, the concepts are critical for understanding opinions of people. For example, people express their opinion towards particular entities such as products or sentiment aspects in online reviews, where these entities are mentions of concepts rather than just words. As compared with words, the mentions of abstract concepts may be compounded phrases (either consecutive or non-consecutive) that are likely to form a large vocabulary. Furthermore, there might be semantic properties (e.g., relations or attributes) attached to the concepts, which increases the dimensionality of concepts. In short, using concepts is faced with the curse of dimensionality. On the other hand, information from only a single level does not suffice for a thorough understanding of human language, and meaningful representation is required at any point to encode the correlation and dependency between abstract concepts and words. In this thesis, we thus focus on effectively leveraging and integrating information from concept-level as well as word-level via projecting concepts and words into a lower dimensional space while retaining most critical semantics. In a broad context of opinion understanding system, we investigate the use of the fused embedding for several core NLP tasks: named entity detection and classification, automatic speech recognition reranking, and targeted sentiment analysis. We first propose a novel method to inject the entity-based information into a word embedding space. The word embeddings are learned from a set of named entity features instead of merely contextual words. We demonstrate that the new word embedding is a better feature representation for detecting and classifying named entities from the stream of telephone conversations. Apart from learning input feature embeddings, we then explore encoding the entity types (i.e., concept categories) in a label embedding space. Our label embeddings mainly leverage two types of information: label hierarchy and label prototype. Since our label embedding is computed prior to the training process, it has exactly the same computation complexity at run-time. We evaluate the resulting label embeddings on multiple large-scale datasets built for the task of fine-grained named entity typing. As compared with the state-of-the-art methods, our label embedding method can achieve superior performance. Next, we demonstrate that a binary embedding of the named entities can help reranking the speech-to-text hypothesis. Named entities are encoded using a Restricted Boltzmann Machine (RBM) and used as a prior knowledge in the discriminative reranking model. We also extend the training of RBM to work with speech recognition hypothesis. Finally, we investigate the problem of using embeddings of commonsense concepts for the task of targeted sentiment analysis. The task is also entity-centered. Namely, given a targeted entity in a sentence, the task is to resolve the correct aspects categories and corresponding sentiment polarity of the target. We propose a new computation structure of Long Short-Term Memory (LSTM) that can more effectively incorporate the embeddings of commonsense knowledge. In summary, this thesis proposes novel solutions of representing and leveraging concept-level and word-level information in a series of NLP tasks that are key to understanding the opinion of people. Doctor of Philosophy (SCE) 2018-06-19T03:26:23Z 2018-06-19T03:26:23Z 2018 Thesis Ma, Y. (2018). Concept-based embeddings for natural language processing. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/75838 10.32657/10356/75838 en 128 p. application/pdf

Concept-based embeddings for natural language processing

Similar Items