VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL

In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration o...

Full description

Saved in:
Bibliographic Details
Main Author: Amadeus Irawan, Patrick
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/86165
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:86165
spelling id-itb.:861652024-09-15T05:27:35ZVISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL Amadeus Irawan, Patrick Indonesia Final Project Synthetic data generation, VQA reasoning, LVLM, LLaVA, prompt. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86165 In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration of more efficient data creation methods. This thesis aims to explore the use of Large Vision Language Models (LVLM) to generate high-quality synthetic VQA reasoning data more efficiently. Experiments were conducted by combining three variants of the LLaVA model with three different prompting techniques. The first approach utilized a single naïve instruction, the second employed an ensembling technique on outputs from various more complex instructions, and the third used naive instructions complemented by object location annotations within the images. The synthetic data was evaluated in terms of quality and structural similarity to human-generated data. The data generation process using the developed system was up to 19.8 times more time-efficient, with only a 4% decrease in quality compared to human-created data. The findings highlight the potential of leveraging LVLM with appropriate prompting techniques to produce high-quality VQA reasoning data. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration of more efficient data creation methods. This thesis aims to explore the use of Large Vision Language Models (LVLM) to generate high-quality synthetic VQA reasoning data more efficiently. Experiments were conducted by combining three variants of the LLaVA model with three different prompting techniques. The first approach utilized a single naïve instruction, the second employed an ensembling technique on outputs from various more complex instructions, and the third used naive instructions complemented by object location annotations within the images. The synthetic data was evaluated in terms of quality and structural similarity to human-generated data. The data generation process using the developed system was up to 19.8 times more time-efficient, with only a 4% decrease in quality compared to human-created data. The findings highlight the potential of leveraging LVLM with appropriate prompting techniques to produce high-quality VQA reasoning data.
format Final Project
author Amadeus Irawan, Patrick
spellingShingle Amadeus Irawan, Patrick
VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
author_facet Amadeus Irawan, Patrick
author_sort Amadeus Irawan, Patrick
title VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
title_short VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
title_full VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
title_fullStr VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
title_full_unstemmed VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
title_sort visual question answering reasoning synthetic data generation using large vision language model
url https://digilib.itb.ac.id/gdl/view/86165
_version_ 1822283344694476800