Large multimodal models for visual reasoning
This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instruct...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181503 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs. |
---|