Large multimodal models for visual reasoning

This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instruct...

Full description

Saved in:
Bibliographic Details
Main Author: Duong, Ngoc Yen
Other Authors: Luu Anh Tuan
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181503
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs.