Unified generative and discriminative training for multi-modal Large Language Models

In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative trainin...

Full description

Saved in:
Bibliographic Details
Main Authors: CHOW, Wei, LI, Juncheng, PAN, Kaihang, YU, Qifan, FEI, Hao, GE, Zhiqi, YANG, Shuai, TENG, Siliang, ZHANG, Hanwang, Qianru SUN
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9743
https://ink.library.smu.edu.sg/context/sis_research/article/10743/viewcontent/NeurIPS_2024_Sugar.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10743
record_format dspace
spelling sg-smu-ink.sis_research-107432024-12-16T03:33:42Z Unified generative and discriminative training for multi-modal Large Language Models CHOW, Wei LI, Juncheng PAN, Kaihang YU, Qifan FEI, Hao GE, Zhiqi YANG, Shuai TENG, Siliang ZHANG, Hanwang Qianru SUN, In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation. This paper addresses these challenges by proposing a unified approach that integrates the strengths of both paradigms. Considering interleaved image-text sequences as the general format of input samples, we introduce a structure-induced training strategy that imposes semantic relationships between input samples and the MLLM’s hidden state. This approach enhances the MLLM’s ability to capture global semantics and distinguish fine-grained semantics. By leveraging dynamic sequence alignment within the Dynamic Time Warping framework and integrating a novel kernel for fine-grained semantic differentiation, our method effectively balances generative and discriminative tasks. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks. By employing a retrieval-augmented generation strategy, our approach further enhances performance in some generative tasks within one model, offering a promising direction for future research in vision-language modeling. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9743 https://ink.library.smu.edu.sg/context/sis_research/article/10743/viewcontent/NeurIPS_2024_Sugar.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Machine learning Generative training Multimodal Large Language Models Semantics extraction Artificial Intelligence and Robotics Computer Sciences
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Machine learning
Generative training
Multimodal Large Language Models
Semantics extraction
Artificial Intelligence and Robotics
Computer Sciences
spellingShingle Machine learning
Generative training
Multimodal Large Language Models
Semantics extraction
Artificial Intelligence and Robotics
Computer Sciences
CHOW, Wei
LI, Juncheng
PAN, Kaihang
YU, Qifan
FEI, Hao
GE, Zhiqi
YANG, Shuai
TENG, Siliang
ZHANG, Hanwang
Qianru SUN,
Unified generative and discriminative training for multi-modal Large Language Models
description In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation. This paper addresses these challenges by proposing a unified approach that integrates the strengths of both paradigms. Considering interleaved image-text sequences as the general format of input samples, we introduce a structure-induced training strategy that imposes semantic relationships between input samples and the MLLM’s hidden state. This approach enhances the MLLM’s ability to capture global semantics and distinguish fine-grained semantics. By leveraging dynamic sequence alignment within the Dynamic Time Warping framework and integrating a novel kernel for fine-grained semantic differentiation, our method effectively balances generative and discriminative tasks. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks. By employing a retrieval-augmented generation strategy, our approach further enhances performance in some generative tasks within one model, offering a promising direction for future research in vision-language modeling.
format text
author CHOW, Wei
LI, Juncheng
PAN, Kaihang
YU, Qifan
FEI, Hao
GE, Zhiqi
YANG, Shuai
TENG, Siliang
ZHANG, Hanwang
Qianru SUN,
author_facet CHOW, Wei
LI, Juncheng
PAN, Kaihang
YU, Qifan
FEI, Hao
GE, Zhiqi
YANG, Shuai
TENG, Siliang
ZHANG, Hanwang
Qianru SUN,
author_sort CHOW, Wei
title Unified generative and discriminative training for multi-modal Large Language Models
title_short Unified generative and discriminative training for multi-modal Large Language Models
title_full Unified generative and discriminative training for multi-modal Large Language Models
title_fullStr Unified generative and discriminative training for multi-modal Large Language Models
title_full_unstemmed Unified generative and discriminative training for multi-modal Large Language Models
title_sort unified generative and discriminative training for multi-modal large language models
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9743
https://ink.library.smu.edu.sg/context/sis_research/article/10743/viewcontent/NeurIPS_2024_Sugar.pdf
_version_ 1819113125304074240