Large multimodal models for visual reasoning

This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instruct...

Full description

Saved in:

Bibliographic Details
Main Author:	Duong, Ngoc Yen
Other Authors:	Luu Anh Tuan
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Natural language processing Multimodal learning
Online Access:	https://hdl.handle.net/10356/181503
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs.

Large multimodal models for visual reasoning

Similar Items