Large multimodal models for visual reasoning
This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instruct...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181503 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-181503 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1815032024-12-09T01:34:08Z Large multimodal models for visual reasoning Duong, Ngoc Yen Luu Anh Tuan College of Computing and Data Science anhtuan.luu@ntu.edu.sg Computer and Information Science Natural language processing Multimodal learning This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs. Bachelor's degree 2024-12-09T01:34:08Z 2024-12-09T01:34:08Z 2024 Final Year Project (FYP) Duong, N. Y. (2024). Large multimodal models for visual reasoning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181503 https://hdl.handle.net/10356/181503 en SCSE23-1074 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Natural language processing Multimodal learning |
spellingShingle |
Computer and Information Science Natural language processing Multimodal learning Duong, Ngoc Yen Large multimodal models for visual reasoning |
description |
This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs. |
author2 |
Luu Anh Tuan |
author_facet |
Luu Anh Tuan Duong, Ngoc Yen |
format |
Final Year Project |
author |
Duong, Ngoc Yen |
author_sort |
Duong, Ngoc Yen |
title |
Large multimodal models for visual reasoning |
title_short |
Large multimodal models for visual reasoning |
title_full |
Large multimodal models for visual reasoning |
title_fullStr |
Large multimodal models for visual reasoning |
title_full_unstemmed |
Large multimodal models for visual reasoning |
title_sort |
large multimodal models for visual reasoning |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/181503 |
_version_ |
1819113065380052992 |