Unified generative and discriminative training for multi-modal Large Language Models

Unified generative and discriminative training for multi-modal Large Language Models

In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative trainin...

Full description

Saved in:

Bibliographic Details
Main Authors:	CHOW, Wei, LI, Juncheng, PAN, Kaihang, YU, Qifan, FEI, Hao, GE, Zhiqi, YANG, Shuai, TENG, Siliang, ZHANG, Hanwang, Qianru SUN
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Machine learning Generative training Multimodal Large Language Models Semantics extraction Artificial Intelligence and Robotics Computer Sciences
Online Access:	https://ink.library.smu.edu.sg/sis_research/9743 https://ink.library.smu.edu.sg/context/sis_research/article/10743/viewcontent/NeurIPS_2024_Sugar.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Similar Items

Towards unified multimodal editing with enhanced knowledge collaboration
by: PAN, Kaihang, et al.
Published: (2024)

Genixer : Empowering multimodal Large Language Models as a powerful data generator
by: ZHAO, Henry Hengyuan, et al.
Published: (2024)

SELF-SUPERVISED MODELING FOR MULTI-MODAL UNDERSTANDING
by: YUE XIANGHU
Published: (2024)

Multi modal video analysis with LLM for descriptive emotion and expression annotation
by: Fan, Yupei
Published: (2024)

Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
by: CAO, Rui, et al.
Published: (2023)

MULTIMODAL INSTRUCTION IN INITIAL TEACHER TRAINING: PROSPECTS AND CHALLENGES
by: Trần, Thị Hiếu Thủy
Published: (2019)

MVGamba : Unify 3D content generation as state space sequence modeling
by: YI, Xuanyu, et al.
Published: (2024)

MERMAID: A dataset and framework for multimodal meme semantic understanding
by: TOH, Shaun, et al.
Published: (2023)

LEVERAGING MULTIMODAL INFORMATION IN SEMANTICS AND SENTICS ANALYSIS OF USER-GENERATED CONTENT
by: RAJIV RATN SHAH
Published: (2019)

EmpathyEar : An open-source avatar multimodal empathetic chatbot
by: FEI, Hao, et al.
Published: (2024)

CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding
by: LI, Dong, et al.
Published: (2024)

Modality-aware discriminative fusion network for integrated analysis of brain imaging genomics
by: SHENG, Xiaoqi, et al.
Published: (2024)

The semantic priming project
by: Hutchison, K.A., et al.
Published: (2016)

Fully-Synthesizable Unified True Random Number Generator and Cryptographic Core
by: SACHIN TANEJA, et al.
Published: (2021)

A column generation based approach for the Train Network Design Optimization problem
by: Jin, J.G., et al.
Published: (2014)

A semantic foundation for TCOZ in unifying theories of programming
by: Qin, S., et al.
Published: (2013)

Fusing pairwise modalities for emotion recognition in conversations
by: Fan, Chunxiao, et al.
Published: (2024)

Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition
by: LI, Bobo, et al.
Published: (2023)

Edge detection guide network for semantic segmentation of remote-sensing images
by: Jin, Jianhui, et al.
Published: (2023)

A hybrid approach for detecting prerequisite relations in multi-modal food recipes
by: PAN, Liangming, et al.
Published: (2020)

Multi-modal mixed reality human computer interfaces
by: ZHOU ZHIYING
Published: (2019)

Hyperbolic Visual Embedding Learning for Zero-Shot Recognition
by: Shaoteng Liu, et al.
Published: (2020)

A Cardan's discriminant approach to predicting currency crashes
by: Koh, S.K., et al.
Published: (2013)

Attribute-based Image Retrieval: Towards Bridging the Semantic and Intention Gaps
by: ZHANG HANWANG
Published: (2014)

Haptic voice recognition: Augmenting speech modality with touch events for efficient speech recognition
by: Sim, K.C.
Published: (2013)

LLMs-as-instructors : Learning from errors toward automating model improvement
by: YING, Jiahao, et al.
Published: (2024)

Retrieval augmented recipe generation
by: LIU, Guoshan, et al.
Published: (2025)

Performance analysis of Llama 2 among other LLMs
by: HUANG, Donghao, et al.
Published: (2024)

Efficient cross-modal video retrieval with meta-optimized frames
by: HAN, Ning, et al.
Published: (2024)

Sparse representation classifier steered discriminative projection
by: Yang, J., et al.
Published: (2014)

REPRESENTATION LEARNING OF DATA WITH MULTIPLE MODALITIES WITH APPLICATIONS TO VISUAL QUESTION ANSWERING
by: ILIEVSKI ILIJA
Published: (2018)

Matryoshka Peek: Toward Learning Fine-Grained,Robust, Discriminative Features for Product Search
by: Zaw lin Kyaw, et al.
Published: (2020)

BND*-DDQN: learn to steer autonomously through deep reinforcement learning
by: Wu, Keyu, et al.
Published: (2022)

Enhancing event-based semantics in the ontology of picture books 2
by: Ang, Karen S.
Published: (2012)

Face recognition using recursive fisher linear discriminant
by: Xiang, C., et al.
Published: (2014)

Spatial pedagogy: Mapping meanings in the use of classroom space
by: Lim, F.V., et al.
Published: (2014)

Towards robust and efficient multimodal representation learning and fusion
by: Guo, Xiaobao
Published: (2025)

Combining relations for information extraction from free text
by: Maslennikov, M., et al.
Published: (2013)

LOVA3 : Learning to visual question answering, asking and assessment
by: ZHAO, Henry Hengyuan, et al.
Published: (2024)

Integrated coherent Raman scattering and multiphoton microscopy for label-free imaging of the dentin in the tooth
by: Wang, Z., et al.
Published: (2014)