Large multimodal models for visual reasoning

This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instruct...

Full description

Saved in:

Bibliographic Details
Main Author:	Duong, Ngoc Yen
Other Authors:	Luu Anh Tuan
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Natural language processing Multimodal learning
Online Access:	https://hdl.handle.net/10356/181503
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-181503
record_format	dspace
spelling	sg-ntu-dr.10356-1815032024-12-09T01:34:08Z Large multimodal models for visual reasoning Duong, Ngoc Yen Luu Anh Tuan College of Computing and Data Science anhtuan.luu@ntu.edu.sg Computer and Information Science Natural language processing Multimodal learning This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs. Bachelor's degree 2024-12-09T01:34:08Z 2024-12-09T01:34:08Z 2024 Final Year Project (FYP) Duong, N. Y. (2024). Large multimodal models for visual reasoning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181503 https://hdl.handle.net/10356/181503 en SCSE23-1074 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Natural language processing Multimodal learning
spellingShingle	Computer and Information Science Natural language processing Multimodal learning Duong, Ngoc Yen Large multimodal models for visual reasoning
description	This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs.
author2	Luu Anh Tuan
author_facet	Luu Anh Tuan Duong, Ngoc Yen
format	Final Year Project
author	Duong, Ngoc Yen
author_sort	Duong, Ngoc Yen
title	Large multimodal models for visual reasoning
title_short	Large multimodal models for visual reasoning
title_full	Large multimodal models for visual reasoning
title_fullStr	Large multimodal models for visual reasoning
title_full_unstemmed	Large multimodal models for visual reasoning
title_sort	large multimodal models for visual reasoning
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/181503
_version_	1819113065380052992

Large multimodal models for visual reasoning

Similar Items