VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL
In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration o...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86165 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | In the realm of Visual Question Answering (VQA), a substantial amount of data
with reasoning aspects is required to ensure the development of systems capable of
generating rational and reliable outputs. However, the large resources needed to
create VQA reasoning data have driven the exploration of more efficient data
creation methods. This thesis aims to explore the use of Large Vision Language
Models (LVLM) to generate high-quality synthetic VQA reasoning data more
efficiently.
Experiments were conducted by combining three variants of the LLaVA model with
three different prompting techniques. The first approach utilized a single naïve
instruction, the second employed an ensembling technique on outputs from various
more complex instructions, and the third used naive instructions complemented by
object location annotations within the images. The synthetic data was evaluated in
terms of quality and structural similarity to human-generated data.
The data generation process using the developed system was up to 19.8 times more
time-efficient, with only a 4% decrease in quality compared to human-created data.
The findings highlight the potential of leveraging LVLM with appropriate
prompting techniques to produce high-quality VQA reasoning data. |
---|