Data efficient deep multimodal learning

Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its...

Full description

Saved in:
Bibliographic Details
Main Author: Shen, Meng
Other Authors: Deepu Rajan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182346
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing uni-modal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of data-efficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.