Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites

Large language models (LLMs) have shown remarkable performance in natural language processing (NLP) tasks. To comprehend and execute diverse human instructions over image data, instruction-tuned large vision-language models (LVLMs) have been introduced. However, LVLMs may suffer from different types...

Full description

Saved in:
Bibliographic Details
Main Authors: WANG, Lei, HE, Jiabang, LI, Shenshen, LIU, Ning, LIM, Ee-peng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8750
https://ink.library.smu.edu.sg/context/sis_research/article/9753/viewcontent/MitigatingFine_GrainedHallucination_av.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9753
record_format dspace
spelling sg-smu-ink.sis_research-97532024-05-03T07:00:47Z Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites WANG, Lei HE, Jiabang LI, Shenshen LIU, Ning LIM, Ee-peng Large language models (LLMs) have shown remarkable performance in natural language processing (NLP) tasks. To comprehend and execute diverse human instructions over image data, instruction-tuned large vision-language models (LVLMs) have been introduced. However, LVLMs may suffer from different types of object hallucinations. Nevertheless, LVLMs are evaluated for coarse-grained object hallucinations only (i.e., generated objects non-existent in the input image). The fine-grained object attributes and behaviors non-existent in the image may still be generated but not measured by the current evaluation methods. In this paper, we thus focus on reducing fine-grained hallucinations of LVLMs. We propose ReCaption, a framework that consists of two components: rewriting captions using ChatGPT and fine-tuning the instruction-tuned LVLMs on the rewritten captions. We also propose a fine-grained probing-based evaluation method named Fine-Grained Object Hallucination Evaluation (FGHE). Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality. The code can be found at https://github.com/Anonymousanoy/FOHE. 2024-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8750 info:doi/10.1007/978-3-031-53302-0_3 https://ink.library.smu.edu.sg/context/sis_research/article/9753/viewcontent/MitigatingFine_GrainedHallucination_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Hallucination Mitigation Large Vision-Language Models Artificial Intelligence and Robotics Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Hallucination Mitigation
Large Vision-Language Models
Artificial Intelligence and Robotics
Databases and Information Systems
spellingShingle Hallucination Mitigation
Large Vision-Language Models
Artificial Intelligence and Robotics
Databases and Information Systems
WANG, Lei
HE, Jiabang
LI, Shenshen
LIU, Ning
LIM, Ee-peng
Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
description Large language models (LLMs) have shown remarkable performance in natural language processing (NLP) tasks. To comprehend and execute diverse human instructions over image data, instruction-tuned large vision-language models (LVLMs) have been introduced. However, LVLMs may suffer from different types of object hallucinations. Nevertheless, LVLMs are evaluated for coarse-grained object hallucinations only (i.e., generated objects non-existent in the input image). The fine-grained object attributes and behaviors non-existent in the image may still be generated but not measured by the current evaluation methods. In this paper, we thus focus on reducing fine-grained hallucinations of LVLMs. We propose ReCaption, a framework that consists of two components: rewriting captions using ChatGPT and fine-tuning the instruction-tuned LVLMs on the rewritten captions. We also propose a fine-grained probing-based evaluation method named Fine-Grained Object Hallucination Evaluation (FGHE). Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality. The code can be found at https://github.com/Anonymousanoy/FOHE.
format text
author WANG, Lei
HE, Jiabang
LI, Shenshen
LIU, Ning
LIM, Ee-peng
author_facet WANG, Lei
HE, Jiabang
LI, Shenshen
LIU, Ning
LIM, Ee-peng
author_sort WANG, Lei
title Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
title_short Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
title_full Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
title_fullStr Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
title_full_unstemmed Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
title_sort mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/8750
https://ink.library.smu.edu.sg/context/sis_research/article/9753/viewcontent/MitigatingFine_GrainedHallucination_av.pdf
_version_ 1814047501159759872