VISUAL QUESTION ANSWERING REASONING SYNTHETIC DATA GENERATION USING LARGE VISION LANGUAGE MODEL

In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration o...

Full description

Saved in:
Bibliographic Details
Main Author: Amadeus Irawan, Patrick
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/86165
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:In the realm of Visual Question Answering (VQA), a substantial amount of data with reasoning aspects is required to ensure the development of systems capable of generating rational and reliable outputs. However, the large resources needed to create VQA reasoning data have driven the exploration of more efficient data creation methods. This thesis aims to explore the use of Large Vision Language Models (LVLM) to generate high-quality synthetic VQA reasoning data more efficiently. Experiments were conducted by combining three variants of the LLaVA model with three different prompting techniques. The first approach utilized a single naïve instruction, the second employed an ensembling technique on outputs from various more complex instructions, and the third used naive instructions complemented by object location annotations within the images. The synthetic data was evaluated in terms of quality and structural similarity to human-generated data. The data generation process using the developed system was up to 19.8 times more time-efficient, with only a 4% decrease in quality compared to human-created data. The findings highlight the potential of leveraging LVLM with appropriate prompting techniques to produce high-quality VQA reasoning data.