From knowledge augmentation to multi-tasking: towards human-like dialogue systems

The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishabili...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Tianji
Other Authors: Erik Cambria
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/168465
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. The works covered in this thesis originated in an era where data-driven deep learning based dialogue systems were beginning to take off. Dialogue systems trained on message-response pairs found in social media began to show abilities of conducting natural conversations. But they were limited in many ways such as lacking knowledge grounding, multimodality and multi-utility. In this thesis, we focus on methods that address these numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess. First of all, we expand the variety of information that dialogue systems can be dependent on. In its simplest and most common form, a dialogue consists of responses and their preceding textual context. This representation, however, falls short compared to real-world human conversation, which is often dependent on other modalities and specific knowledge bases. To the end of conditioning dialogues on more modalities, we explore dialogue generation augmented by the audio representation of the input. We design an auxiliary response classification task to learn suitable audio representation for our dialogue generation objective. We use word-level modality fusion for integrating audio features into the Sequence to Sequence learning framework. Our model can generate appropriate responses corresponding to the emotion and emphasis expressed in the audio. Commonsense knowledge has to be integrated into the dialogue system effectively for it to respond to human utterances in an interesting and engaging way. As the first attempt to integrating a large commonsense knowledge base into end-to-end conversational models, we propose a model to jointly take into account the context and its related commonsense knowledge for selecting an appropriate response. We demonstrate that the knowledge-augmented models are superior to their knowledge-free counterparts. While the two directions mentioned above endeavor to ground the dialogues on various new information, they are not the only challenges that dialogue systems face. Traditionally, the goal of building intelligent dialogue systems has largely been separately pursued assuming two separate utilities: task-oriented dialogue systems, which perform task-specific functions, and open-domain dialogue systems, which focus on non-goal-oriented chitchat. The two dialogue modes can potentially be intertwined together seamlessly in the same conversation, as easily done by a friendly human assistant. This thesis also covers our effort on addressing the problem of fusing the two dialogue modes in multi-turn dialogues. We build a new dataset FusedChat, which contains conversation sessions containing exchanges from both dialogue modes with inter-mode contextual dependency. We propose two baseline models on this task and analyze their accuracy. Last but not least, we demonstrate our effort on addressing the computational efficiency issue that large-scale retrieval-based dialogue systems face. Strong retrieval-based dialogue systems that are based on a large natural candidate set can produce diverse and controllable responses. However, a large candidate set could be computationally costly. We propose methods that support a fast and accurate response retrieval system. To boost accuracy, we adopt a knowledge distillation approach where a very strong yet computationally expensive joint encoding model is used to facilitate training our encoders. We then boost the retrieval speed by adopting a learning-based candidate screening method to further reduce inference time. We demonstrate that our model performs strongly in terms of retrieval accuracy and speed trade-off. In summary, this thesis systematically demonstrates our effort on innovating dialogue systems. Through our experiments, we found that through new designs based upon general state-of-the-art NLP methodologies, dialogue systems can be made faster, multimodal, capable of multiple utilities and grounded on useful external information. We believe that the research questions that we focused on are important aspects for ultimately improving automated dialogue agents to human-level. The main contribution of the works covered in the thesis lies in their initializing effects (to a certain degree) on these directions that have been continuously worked on by researchers till this day. With our effort of innovating dialogue systems spanning the last 4 years, and state-of-the-art NLP models fast evolving year by year, we note that the models used in some of our works in the earlier years (e.g., LSTMs) cannot compete with the state-of-the-art models available today (e.g., GPT4). In such cases, we briefly and systematically explain following works (current state-of-the-art) that stemmed from the methodologies shown in our work, especially those based on recent advances of large language models.