Towards robust and efficient multimodal representation learning and fusion

In the past few years, multimodal learning has made significant progress. The goal of multimodal learning is to create models that can relate and process data from various modalities. One of the challenges is to learn useful representations efficiently given the heterogeneity of the data. Another is...

Full description

Saved in:
Bibliographic Details
Main Author: Guo, Xiaobao
Other Authors: Kong Wai-Kin Adams
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182226
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In the past few years, multimodal learning has made significant progress. The goal of multimodal learning is to create models that can relate and process data from various modalities. One of the challenges is to learn useful representations efficiently given the heterogeneity of the data. Another is how to fuse the information from two or more modalities to perform a prediction, which is robust against possibly missing modalities. To reduce these research gaps, this dissertation attempts to develop effective and efficient network modules for both unimodal learning and crossmodal fusion. It also aims to improve the robustness of the fused features for different downstream tasks. In multimodal representation learning, both complementary crossmodal representation fusion and effective unimodal representation are crucial. Some prior works try to modulate one modal feature to another directly. Although it can be effective in aligning the multimodal features, it will ignore both unimodal and crossmodal representation refinements, which is important for multimodal fusion. In this dissertation, we introduce the Unimodal and Crossmodal Refinement Network (UCRN) to enhance both unimodal and crossmodal representations in multimodal learning. We propose a unimodal refinement module that iteratively updates modality-specific representations using transformer-based attention layers, followed by self-quality improvement layers. These refined unimodal representations are then projected into a common latent space and further tuned using a crossmodal refinement module. The results in multiple benchmark datasets show improved performance and robustness against missing modalities and noisy data in multimodal sequence fusion scenarios. Besides representation refinement for better fusion performance, it is also important to reduce the overfitting issue during learning. As the predictive powers between modalities are different, the existing modality gap can lead to overfitting and undermine the fusion performance. This dissertation aims to improve unimodal and crossmodal representations by the proposed regularized expressive representation distillation (RERD) approach. To improve crossmodal optimization and minimize modality gaps before fusion, a multimodal Sinkhorn distance regularizer is introduced, and multi-head distillation encoders with iterative updates are used to refine unimodal representations. We evaluate the proposed method on a range of benchmark datasets. The results show that RERD performs better than current baselines, proving to be an effective method for deep multimodal fusion on sequence datasets. To further improve the robustness of multimodal representations against noisy inputs, we study the robustness in the context of multimodal contrastive learning (MCL), as contrastive learning is effective at discriminating coexisting semantic features (positive) from irrelative ones (negative) in multimodal signals. To address weakness in MCL, this dissertation presents Pace-adaptive and Noise-resistant Noise-Contrastive Estimation (PN-NCE) as a novel self-supervised method for multimodal fusion. We propose to adaptively optimize the similarity between positive and negative pairs and improve robustness against noisy inputs during training. By integrating an estimator to measure modality invariance, PN-NCE achieves consistent performance improvements across various multimodal tasks and datasets and comparable results with supervised learning approaches. To gain more insight into effective and reliable multimodal learning in practical applications, we examine the proposed method of audio-visual deception detection in videos. Deception detection in conversations is a challenging yet important task, having pivotal applications in various fields. The first challenge is the scarcity of high-quality datasets in deception detection research. In this dissertation, we introduce a large gameshow deception detection dataset, DOLOS, with rich multimodal annotations. DOLOS comprises 1,675 video clips with audio-visual annotations featuring 213 subjects. We benchmark deception detection approaches on the DOLOS dataset. Additionally, we propose Parameter-Efficient Crossmodal Learning (PECL), where we propose a Uniform Temporal Adapter and a Plug-in Audio-Visual Fusion module, to enhance performance with fewer parameters and exploit multi-task learning for improved deception detection performance. The Uniform Temporal Adapter module is different from the previous ones in UCRN and RERD because it is lightweight and plug-and-play. In summary, this dissertation focuses on efficient and robust multimodal learning and fusion. To achieve these goals, different methods and modules are proposed to enhance the performance of fused features for downstream tasks. Experimental results on different benchmark datasets and real-world applications show the effectiveness of the proposed method compared with state-of-the-art approaches.