Deep learning for video-grounded dialogue systems

In recent years, we have witnessed significant progress in building systems with artificial intelligence. However, despite advancements in machine learning and deep learning, we are still far from achieving autonomous agents that can perceive multi-dimensional information from the surrounding world...

Full description

Saved in:
Bibliographic Details
Main Author: LE, Hung
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/388
https://ink.library.smu.edu.sg/context/etd_coll/article/1386/viewcontent/SMU_Dissertation__2_.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.etd_coll-1386
record_format dspace
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic deep learning
neural networks
dialogue systems
video-grounded
response generation
and dialogue state tracking
Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle deep learning
neural networks
dialogue systems
video-grounded
response generation
and dialogue state tracking
Databases and Information Systems
Graphics and Human Computer Interfaces
LE, Hung
Deep learning for video-grounded dialogue systems
description In recent years, we have witnessed significant progress in building systems with artificial intelligence. However, despite advancements in machine learning and deep learning, we are still far from achieving autonomous agents that can perceive multi-dimensional information from the surrounding world and converse with humans in natural language. Towards this goal, this thesis is dedicated to building intelligent systems in the task of video-grounded dialogues. Specifically, in a video-grounded dialogue, a system is required to hold a multi-turn conversation with humans about the content of a video. Given an input video, a dialogue history, and a question about the video, the system has to understand contextual information of dialogue, extract relevant information from the video, and construct a dialogue response that is both contextually relevant and video-grounded. Compared to related research domains in computer vision and natural language processing, the video-grounded dialogue task raises challenging requirements, including: (1) language reasoning in multiple turns: the ability to understand contextual information from dialogues, which often consist of linguistic dependencies from turn to turn; (2) visual reasoning in spatio-temporal space: the ability to extract information from videos, which contain both spatial and temporal variations that characterize object appearance and actions; and (3) language generation: the ability to acquire natural language and generate responses with both contextually relevant and video-grounded information. Towards building an intelligent system for the video-grounded dialogue task, we introduced a neural model, Multimodal Transformer Network (MTN), that can be trained in an end-to-end manner to reason over both dialogue and video inputs and decode a natural language response. The architecture was tested against the established benchmark Audio-Visual Scene-Aware Dialogue (AVSD) and achieved superior performance from other neural-based systems. Despite this success, we found that MTN is not specifically designed for scenarios that require sophisticated visual or language reasoning. To further improve the reasoning capability of models in visual aspects, we introduced BiST, a Bidirectional Spatio-Temporal Reasoning approach that can extract relevant visual cues in videos in both spatial and temporal dimensions. This approach achieved consistent performance in both quantitative and qualitative results. However, our findings show that in many scenarios, systems failed to learn the contextual information of dialogue, which may lead to incorrect or incoherent system responses. To address this limitation, we focused our attention on the language reasoning capability of models. We proposed PDC, a path-based reasoning approach for dialogue context. PDC requires systems to learn to extract a traversal path among dialogue turns in the dialogue context. Our findings demonstrate the performance gains of this approach as compared to sequential or graph-based learning approaches. To combine both visual and language reasoning, we adopted compositionality to encode questions as a sequential reasoning program. The program is parameterized by entities and actions which are used to extract more refined features from video inputs. We denoted this approach as Video-grounded Neural Module Network (VGNMN). From experiments with VGNMN, we found not only potential performance gains in automatic metrics but also improved interpretability through learned reasoning programs. In video-grounded dialogue research, we found a major obstacle that hindered our progress: limitation of data. While there are very limited video-grounded dialogue data available, developing a new benchmark involves costly and time-consuming manual annotation efforts. The data limitation essentially prevents a system from acquiring sufficient natural language understanding. We then proposed to make use of pretrained language models such as GPT, to leverage their linguistic dependencies learned from large-scale text data. In another work, we adopted causality to augment current data with counterfactual samples that support model training. Our findings show that both pretrained systems and data augmentation are effective strategies to alleviate the data limitation. To facilitate further research in this field, we developed DVD, a Diagnostic Video-grounded Dialogue benchmark. We built DVD as a diagnostic and synthetic benchmark to fairly evaluate systems by visual and textual complexity. We tested several baselines, from simple heuristic models to complex neural networks, and found that all models are inefficient in different aspects, from multi-turn textual references to visual object tracking. Our findings suggest that current approaches still perform poorly in DVD and future approaches should be integrated with multistep and multi-modal reasoning capabilities. In view of the above findings, we developed a new sub-task within video-grounded dialogue systems. We introduced Multimodal Dialogue State Tracking (MM-DST) task, which requires a system to maintain a recurring memory or state of all visual objects that are mentioned in dialogue context. At each dialogue turn, dialogue utterances may introduce new visual objects or new object attributes, and a dialogue system are required to update the states of these objects. We leveraged techniques from the research of task-oriented dialogues, introduced a new baseline, and discussed our findings. Finally, we concluded the dissertation with a summary of our contributions and a discussion of potential future directions in video-grounded dialogue research.
format text
author LE, Hung
author_facet LE, Hung
author_sort LE, Hung
title Deep learning for video-grounded dialogue systems
title_short Deep learning for video-grounded dialogue systems
title_full Deep learning for video-grounded dialogue systems
title_fullStr Deep learning for video-grounded dialogue systems
title_full_unstemmed Deep learning for video-grounded dialogue systems
title_sort deep learning for video-grounded dialogue systems
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/etd_coll/388
https://ink.library.smu.edu.sg/context/etd_coll/article/1386/viewcontent/SMU_Dissertation__2_.pdf
_version_ 1770567682703228928
spelling sg-smu-ink.etd_coll-13862022-06-22T03:08:19Z Deep learning for video-grounded dialogue systems LE, Hung In recent years, we have witnessed significant progress in building systems with artificial intelligence. However, despite advancements in machine learning and deep learning, we are still far from achieving autonomous agents that can perceive multi-dimensional information from the surrounding world and converse with humans in natural language. Towards this goal, this thesis is dedicated to building intelligent systems in the task of video-grounded dialogues. Specifically, in a video-grounded dialogue, a system is required to hold a multi-turn conversation with humans about the content of a video. Given an input video, a dialogue history, and a question about the video, the system has to understand contextual information of dialogue, extract relevant information from the video, and construct a dialogue response that is both contextually relevant and video-grounded. Compared to related research domains in computer vision and natural language processing, the video-grounded dialogue task raises challenging requirements, including: (1) language reasoning in multiple turns: the ability to understand contextual information from dialogues, which often consist of linguistic dependencies from turn to turn; (2) visual reasoning in spatio-temporal space: the ability to extract information from videos, which contain both spatial and temporal variations that characterize object appearance and actions; and (3) language generation: the ability to acquire natural language and generate responses with both contextually relevant and video-grounded information. Towards building an intelligent system for the video-grounded dialogue task, we introduced a neural model, Multimodal Transformer Network (MTN), that can be trained in an end-to-end manner to reason over both dialogue and video inputs and decode a natural language response. The architecture was tested against the established benchmark Audio-Visual Scene-Aware Dialogue (AVSD) and achieved superior performance from other neural-based systems. Despite this success, we found that MTN is not specifically designed for scenarios that require sophisticated visual or language reasoning. To further improve the reasoning capability of models in visual aspects, we introduced BiST, a Bidirectional Spatio-Temporal Reasoning approach that can extract relevant visual cues in videos in both spatial and temporal dimensions. This approach achieved consistent performance in both quantitative and qualitative results. However, our findings show that in many scenarios, systems failed to learn the contextual information of dialogue, which may lead to incorrect or incoherent system responses. To address this limitation, we focused our attention on the language reasoning capability of models. We proposed PDC, a path-based reasoning approach for dialogue context. PDC requires systems to learn to extract a traversal path among dialogue turns in the dialogue context. Our findings demonstrate the performance gains of this approach as compared to sequential or graph-based learning approaches. To combine both visual and language reasoning, we adopted compositionality to encode questions as a sequential reasoning program. The program is parameterized by entities and actions which are used to extract more refined features from video inputs. We denoted this approach as Video-grounded Neural Module Network (VGNMN). From experiments with VGNMN, we found not only potential performance gains in automatic metrics but also improved interpretability through learned reasoning programs. In video-grounded dialogue research, we found a major obstacle that hindered our progress: limitation of data. While there are very limited video-grounded dialogue data available, developing a new benchmark involves costly and time-consuming manual annotation efforts. The data limitation essentially prevents a system from acquiring sufficient natural language understanding. We then proposed to make use of pretrained language models such as GPT, to leverage their linguistic dependencies learned from large-scale text data. In another work, we adopted causality to augment current data with counterfactual samples that support model training. Our findings show that both pretrained systems and data augmentation are effective strategies to alleviate the data limitation. To facilitate further research in this field, we developed DVD, a Diagnostic Video-grounded Dialogue benchmark. We built DVD as a diagnostic and synthetic benchmark to fairly evaluate systems by visual and textual complexity. We tested several baselines, from simple heuristic models to complex neural networks, and found that all models are inefficient in different aspects, from multi-turn textual references to visual object tracking. Our findings suggest that current approaches still perform poorly in DVD and future approaches should be integrated with multistep and multi-modal reasoning capabilities. In view of the above findings, we developed a new sub-task within video-grounded dialogue systems. We introduced Multimodal Dialogue State Tracking (MM-DST) task, which requires a system to maintain a recurring memory or state of all visual objects that are mentioned in dialogue context. At each dialogue turn, dialogue utterances may introduce new visual objects or new object attributes, and a dialogue system are required to update the states of these objects. We leveraged techniques from the research of task-oriented dialogues, introduced a new baseline, and discussed our findings. Finally, we concluded the dissertation with a summary of our contributions and a discussion of potential future directions in video-grounded dialogue research. 2022-01-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/388 https://ink.library.smu.edu.sg/context/etd_coll/article/1386/viewcontent/SMU_Dissertation__2_.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University deep learning neural networks dialogue systems video-grounded response generation and dialogue state tracking Databases and Information Systems Graphics and Human Computer Interfaces