Large multimodal models for visual reasoning

Large multimodal models for visual reasoning

This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instruct...

Full description

Saved in:

Bibliographic Details
Main Author:	Duong, Ngoc Yen
Other Authors:	Luu Anh Tuan
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Natural language processing Multimodal learning
Online Access:	https://hdl.handle.net/10356/181503
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Similar Items

Towards robust and efficient multimodal representation learning and fusion
by: Guo, Xiaobao
Published: (2025)

T-SciQ: Teaching multimodal Chain-of-Thought reasoning via large language model signals for science question answering
by: WANG, Lei, et al.
Published: (2024)

LOVA3 : Learning to visual question answering, asking and assessment
by: ZHAO, Henry Hengyuan, et al.
Published: (2024)

Multimodal few-shot classification without attribute embedding
by: Chang, Jun Qing, et al.
Published: (2024)

A multimodal approach to automatic personality recognition on Filipino social media data
by: Secuya, Alfonso C.
Published: (2021)

Event extraction and beyond: from conventional NLP to large language models
by: Zhou, Hanzhang
Published: (2025)

M2Lens: Visualizing and explaining multimodal models for sentiment analysis
by: WANG, Xingbo, et al.
Published: (2022)

Enhancing contextual understanding in NLP: adapting state-of-the-art models for improved sentiment analysis of informal language
by: Sneha Ravisankar
Published: (2024)

An enhanced deep reinforcement learning ensemble empowered by large language model
by: Li, Xinyi
Published: (2024)

Leveraging large language models for effective user interaction via conversations
by: Zhang, Mengao
Published: (2024)

Punctuation restoration for speech transcripts using large language models
by: Liu, Changsong
Published: (2024)

Diffusion models for natural language processing
by: Hoang, Minh Nhat
Published: (2024)

Data efficient deep multimodal learning
by: Shen, Meng
Published: (2025)

Fusion of multimodal embeddings for ad-hoc video search
by: FRANCIS, Danny, et al.
Published: (2019)

Discovering hidden visual concepts beyond linguistic input in Infant learning
by: Ke, Xueyi
Published: (2025)

KnowleNet: knowledge fusion network for multimodal sarcasm detection
by: Yue, Tan, et al.
Published: (2023)

Topic-Aware Deep Compositional Models for Sentence Classification
by: Zhao, Rui, et al.
Published: (2017)

Model-driven smart contract generation leveraging pretrained large language models
by: Jiang, Qinbo
Published: (2024)

Look, read and feel : benchmarking ads understanding with multimodal multitask learning
by: Zhang, Huaizheng, et al.
Published: (2021)

On explaining multimodal hateful meme detection models
by: HEE, Ming Shan, et al.
Published: (2022)

MULTIMODAL INSTRUCTION IN INITIAL TEACHER TRAINING: PROSPECTS AND CHALLENGES
by: Trần, Thị Hiếu Thủy
Published: (2019)

Interactive state-transition diagrams for visualization of multimodal annotation
by: Podlasov, A., et al.
Published: (2014)

Genixer : Empowering multimodal Large Language Models as a powerful data generator
by: ZHAO, Henry Hengyuan, et al.
Published: (2024)

Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification
by: YU, Jianfei, et al.
Published: (2020)

Multimode process monitoring based on robust dictionary learning with application to aluminium electrolysis process
by: Yang, Chunhua, et al.
Published: (2020)

Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
by: Te, Gian Marco I.
Published: (2022)

Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition
by: LI, Bobo, et al.
Published: (2023)

Cross-modal credibility modelling for EEG-based multimodal emotion recognition
by: Zhang, Yuzhe, et al.
Published: (2024)

Communicating effectively with the hearing impaired
by: Cheng, Eddy Kuan Quan
Published: (2024)

Perception coordination network : a neuro framework for multimodal concept acquisition and binding
by: Xing, You-Lu, et al.
Published: (2020)

Investigations into semantic role labeling of propbank and nombank
by: JIANG ZHENG PING
Published: (2010)

Intelligent robot grasp planning with multimodal large language model
by: Liu, Songting
Published: (2024)

Multimodal fusion for multimedia analysis: A survey
by: Atrey, P.K., et al.
Published: (2013)

Semantic, syntactic and joint deep learning of event extraction
by: Hao, Anran
Published: (2025)

VistaNet: Visual Aspect Attention Network for multimodal sentiment analysis
by: TRUONG, Quoc Tuan, et al.
Published: (2019)

Use of word and character N-grams for low-resourced local languages
by: Regalado, Ralph Vincent, et al.
Published: (2019)

DIALOG SYSTEMS GO MULTIMODAL
by: LIAO LIZI
Published: (2019)

Modeling sentiments and preferences from multimodal data
by: TRUONG, Quoc Tuan
Published: (2022)

VISUAL CAUSAL INFERENCE
by: YICONG LI
Published: (2024)

Active learning for ontological event extraction incorporating named entity recognition and unknown word handling
by: Han, Xu, et al.
Published: (2016)