From knowledge augmentation to multi-tasking: towards human-like dialogue systems

The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishabili...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Tianji
Other Authors: Erik Cambria
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/168465
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-168465
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Yang, Tianji
From knowledge augmentation to multi-tasking: towards human-like dialogue systems
description The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. The works covered in this thesis originated in an era where data-driven deep learning based dialogue systems were beginning to take off. Dialogue systems trained on message-response pairs found in social media began to show abilities of conducting natural conversations. But they were limited in many ways such as lacking knowledge grounding, multimodality and multi-utility. In this thesis, we focus on methods that address these numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess. First of all, we expand the variety of information that dialogue systems can be dependent on. In its simplest and most common form, a dialogue consists of responses and their preceding textual context. This representation, however, falls short compared to real-world human conversation, which is often dependent on other modalities and specific knowledge bases. To the end of conditioning dialogues on more modalities, we explore dialogue generation augmented by the audio representation of the input. We design an auxiliary response classification task to learn suitable audio representation for our dialogue generation objective. We use word-level modality fusion for integrating audio features into the Sequence to Sequence learning framework. Our model can generate appropriate responses corresponding to the emotion and emphasis expressed in the audio. Commonsense knowledge has to be integrated into the dialogue system effectively for it to respond to human utterances in an interesting and engaging way. As the first attempt to integrating a large commonsense knowledge base into end-to-end conversational models, we propose a model to jointly take into account the context and its related commonsense knowledge for selecting an appropriate response. We demonstrate that the knowledge-augmented models are superior to their knowledge-free counterparts. While the two directions mentioned above endeavor to ground the dialogues on various new information, they are not the only challenges that dialogue systems face. Traditionally, the goal of building intelligent dialogue systems has largely been separately pursued assuming two separate utilities: task-oriented dialogue systems, which perform task-specific functions, and open-domain dialogue systems, which focus on non-goal-oriented chitchat. The two dialogue modes can potentially be intertwined together seamlessly in the same conversation, as easily done by a friendly human assistant. This thesis also covers our effort on addressing the problem of fusing the two dialogue modes in multi-turn dialogues. We build a new dataset FusedChat, which contains conversation sessions containing exchanges from both dialogue modes with inter-mode contextual dependency. We propose two baseline models on this task and analyze their accuracy. Last but not least, we demonstrate our effort on addressing the computational efficiency issue that large-scale retrieval-based dialogue systems face. Strong retrieval-based dialogue systems that are based on a large natural candidate set can produce diverse and controllable responses. However, a large candidate set could be computationally costly. We propose methods that support a fast and accurate response retrieval system. To boost accuracy, we adopt a knowledge distillation approach where a very strong yet computationally expensive joint encoding model is used to facilitate training our encoders. We then boost the retrieval speed by adopting a learning-based candidate screening method to further reduce inference time. We demonstrate that our model performs strongly in terms of retrieval accuracy and speed trade-off. In summary, this thesis systematically demonstrates our effort on innovating dialogue systems. Through our experiments, we found that through new designs based upon general state-of-the-art NLP methodologies, dialogue systems can be made faster, multimodal, capable of multiple utilities and grounded on useful external information. We believe that the research questions that we focused on are important aspects for ultimately improving automated dialogue agents to human-level. The main contribution of the works covered in the thesis lies in their initializing effects (to a certain degree) on these directions that have been continuously worked on by researchers till this day. With our effort of innovating dialogue systems spanning the last 4 years, and state-of-the-art NLP models fast evolving year by year, we note that the models used in some of our works in the earlier years (e.g., LSTMs) cannot compete with the state-of-the-art models available today (e.g., GPT4). In such cases, we briefly and systematically explain following works (current state-of-the-art) that stemmed from the methodologies shown in our work, especially those based on recent advances of large language models.
author2 Erik Cambria
author_facet Erik Cambria
Yang, Tianji
format Thesis-Doctor of Philosophy
author Yang, Tianji
author_sort Yang, Tianji
title From knowledge augmentation to multi-tasking: towards human-like dialogue systems
title_short From knowledge augmentation to multi-tasking: towards human-like dialogue systems
title_full From knowledge augmentation to multi-tasking: towards human-like dialogue systems
title_fullStr From knowledge augmentation to multi-tasking: towards human-like dialogue systems
title_full_unstemmed From knowledge augmentation to multi-tasking: towards human-like dialogue systems
title_sort from knowledge augmentation to multi-tasking: towards human-like dialogue systems
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/168465
_version_ 1772825341297950720
spelling sg-ntu-dr.10356-1684652023-07-04T01:52:13Z From knowledge augmentation to multi-tasking: towards human-like dialogue systems Yang, Tianji Erik Cambria School of Computer Science and Engineering Computational Intelligence Lab cambria@ntu.edu.sg Engineering::Computer science and engineering The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. The works covered in this thesis originated in an era where data-driven deep learning based dialogue systems were beginning to take off. Dialogue systems trained on message-response pairs found in social media began to show abilities of conducting natural conversations. But they were limited in many ways such as lacking knowledge grounding, multimodality and multi-utility. In this thesis, we focus on methods that address these numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess. First of all, we expand the variety of information that dialogue systems can be dependent on. In its simplest and most common form, a dialogue consists of responses and their preceding textual context. This representation, however, falls short compared to real-world human conversation, which is often dependent on other modalities and specific knowledge bases. To the end of conditioning dialogues on more modalities, we explore dialogue generation augmented by the audio representation of the input. We design an auxiliary response classification task to learn suitable audio representation for our dialogue generation objective. We use word-level modality fusion for integrating audio features into the Sequence to Sequence learning framework. Our model can generate appropriate responses corresponding to the emotion and emphasis expressed in the audio. Commonsense knowledge has to be integrated into the dialogue system effectively for it to respond to human utterances in an interesting and engaging way. As the first attempt to integrating a large commonsense knowledge base into end-to-end conversational models, we propose a model to jointly take into account the context and its related commonsense knowledge for selecting an appropriate response. We demonstrate that the knowledge-augmented models are superior to their knowledge-free counterparts. While the two directions mentioned above endeavor to ground the dialogues on various new information, they are not the only challenges that dialogue systems face. Traditionally, the goal of building intelligent dialogue systems has largely been separately pursued assuming two separate utilities: task-oriented dialogue systems, which perform task-specific functions, and open-domain dialogue systems, which focus on non-goal-oriented chitchat. The two dialogue modes can potentially be intertwined together seamlessly in the same conversation, as easily done by a friendly human assistant. This thesis also covers our effort on addressing the problem of fusing the two dialogue modes in multi-turn dialogues. We build a new dataset FusedChat, which contains conversation sessions containing exchanges from both dialogue modes with inter-mode contextual dependency. We propose two baseline models on this task and analyze their accuracy. Last but not least, we demonstrate our effort on addressing the computational efficiency issue that large-scale retrieval-based dialogue systems face. Strong retrieval-based dialogue systems that are based on a large natural candidate set can produce diverse and controllable responses. However, a large candidate set could be computationally costly. We propose methods that support a fast and accurate response retrieval system. To boost accuracy, we adopt a knowledge distillation approach where a very strong yet computationally expensive joint encoding model is used to facilitate training our encoders. We then boost the retrieval speed by adopting a learning-based candidate screening method to further reduce inference time. We demonstrate that our model performs strongly in terms of retrieval accuracy and speed trade-off. In summary, this thesis systematically demonstrates our effort on innovating dialogue systems. Through our experiments, we found that through new designs based upon general state-of-the-art NLP methodologies, dialogue systems can be made faster, multimodal, capable of multiple utilities and grounded on useful external information. We believe that the research questions that we focused on are important aspects for ultimately improving automated dialogue agents to human-level. The main contribution of the works covered in the thesis lies in their initializing effects (to a certain degree) on these directions that have been continuously worked on by researchers till this day. With our effort of innovating dialogue systems spanning the last 4 years, and state-of-the-art NLP models fast evolving year by year, we note that the models used in some of our works in the earlier years (e.g., LSTMs) cannot compete with the state-of-the-art models available today (e.g., GPT4). In such cases, we briefly and systematically explain following works (current state-of-the-art) that stemmed from the methodologies shown in our work, especially those based on recent advances of large language models. Doctor of Philosophy 2023-06-28T06:12:58Z 2023-06-28T06:12:58Z 2023 Thesis-Doctor of Philosophy Yang, T. (2023). From knowledge augmentation to multi-tasking: towards human-like dialogue systems. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168465 https://hdl.handle.net/10356/168465 10.32657/10356/168465 en 04SBP000598C130 10.21979/N9/QWEBOS This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University