VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration o...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86165 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:86165 |
---|---|
spelling |
id-itb.:861652024-09-15T05:27:35ZVISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL Amadeus Irawan, Patrick Indonesia Final Project Synthetic data generation, VQA reasoning, LVLM, LLaVA, prompt. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86165 In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration of more efficient data creation methods. This thesis aims to explore the use of Large Vision Language Models (LVLM) to generate high-quality synthetic VQA reasoning data more efficiently. Experiments were conducted by combining three variants of the LLaVA model with three different prompting techniques. The first approach utilized a single naïve instruction, the second employed an ensembling technique on outputs from various more complex instructions, and the third used naive instructions complemented by object location annotations within the images. The synthetic data was evaluated in terms of quality and structural similarity to human-generated data. The data generation process using the developed system was up to 19.8 times more time-efficient, with only a 4% decrease in quality compared to human-created data. The findings highlight the potential of leveraging LVLM with appropriate prompting techniques to produce high-quality VQA reasoning data. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
In the realm of Visual Question Answering (VQA), a substantial amount of data
with reasoning aspects is required to ensure the development of systems capable of
generating rational and reliable outputs. However, the large resources needed to
create VQA reasoning data have driven the exploration of more efficient data
creation methods. This thesis aims to explore the use of Large Vision Language
Models (LVLM) to generate high-quality synthetic VQA reasoning data more
efficiently.
Experiments were conducted by combining three variants of the LLaVA model with
three different prompting techniques. The first approach utilized a single naïve
instruction, the second employed an ensembling technique on outputs from various
more complex instructions, and the third used naive instructions complemented by
object location annotations within the images. The synthetic data was evaluated in
terms of quality and structural similarity to human-generated data.
The data generation process using the developed system was up to 19.8 times more
time-efficient, with only a 4% decrease in quality compared to human-created data.
The findings highlight the potential of leveraging LVLM with appropriate
prompting techniques to produce high-quality VQA reasoning data. |
format |
Final Project |
author |
Amadeus Irawan, Patrick |
spellingShingle |
Amadeus Irawan, Patrick VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL |
author_facet |
Amadeus Irawan, Patrick |
author_sort |
Amadeus Irawan, Patrick |
title |
VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL |
title_short |
VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL |
title_full |
VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL |
title_fullStr |
VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL |
title_full_unstemmed |
VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL |
title_sort |
visual question answering reasoning synthetic data generation using large vision language model |
url |
https://digilib.itb.ac.id/gdl/view/86165 |
_version_ |
1822283344694476800 |