Large multimodal models for visual reasoning

This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instruct...

Full description

Saved in:
Bibliographic Details
Main Author: Duong, Ngoc Yen
Other Authors: Luu Anh Tuan
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181503
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-181503
record_format dspace
spelling sg-ntu-dr.10356-1815032024-12-09T01:34:08Z Large multimodal models for visual reasoning Duong, Ngoc Yen Luu Anh Tuan College of Computing and Data Science anhtuan.luu@ntu.edu.sg Computer and Information Science Natural language processing Multimodal learning This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs. Bachelor's degree 2024-12-09T01:34:08Z 2024-12-09T01:34:08Z 2024 Final Year Project (FYP) Duong, N. Y. (2024). Large multimodal models for visual reasoning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181503 https://hdl.handle.net/10356/181503 en SCSE23-1074 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Natural language processing
Multimodal learning
spellingShingle Computer and Information Science
Natural language processing
Multimodal learning
Duong, Ngoc Yen
Large multimodal models for visual reasoning
description This paper introduces a novel framework for enhancing visual spatial reasoning by leveraging the strengths of Large Language Models (LLMs) and Vision-Language Models (VLMs). We propose two complementary methods: LLMGuide and LLMVerify. LLMGuide uses the LLM to generate detailed step-by-step instructions, guiding the VLM to focus on key spatial elements within images. The LLM then combines its reasoning with the VLM’s output to produce a final, well-reasoned answer. LLMVerify, on the other hand, prompts the VLM to consider multiple perspectives on a problem, with the LLM verifying and aggregating responses to ensure consistency and accuracy. Both methods were tested on benchmarks including Visual Spatial Reasoning (VSR), EmbSpatial-Bench, CLEVR, and SEEDBench, achieving up to 11% improvement in accuracy over traditional approaches. These results demonstrate how LLMGuide and LLMVerify enable more precise and robust spatial reasoning by harnessing the complementary strengths of LLMs and VLMs.
author2 Luu Anh Tuan
author_facet Luu Anh Tuan
Duong, Ngoc Yen
format Final Year Project
author Duong, Ngoc Yen
author_sort Duong, Ngoc Yen
title Large multimodal models for visual reasoning
title_short Large multimodal models for visual reasoning
title_full Large multimodal models for visual reasoning
title_fullStr Large multimodal models for visual reasoning
title_full_unstemmed Large multimodal models for visual reasoning
title_sort large multimodal models for visual reasoning
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/181503
_version_ 1819113065380052992