Evaluating vision-language models long-chain reasoning ability with multiple ground truths

With the recent advancements in vision-language models, many researchers start to evaluate their various zero-shot capabilities to answer questions given a video input. However, there has not been a standardised and “best practice” method to evaluate the quality of a model’s open-ended answer given...

Full description

Saved in:

Bibliographic Details
Main Author:	Setiadharma, Christopher Arif
Other Authors:	Liu Ziwei
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science
Online Access:	https://hdl.handle.net/10356/175186
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-175186
record_format	dspace
spelling	sg-ntu-dr.10356-1751862024-04-19T15:42:37Z Evaluating vision-language models long-chain reasoning ability with multiple ground truths Setiadharma, Christopher Arif Liu Ziwei School of Computer Science and Engineering ziwei.liu@ntu.edu.sg Computer and Information Science With the recent advancements in vision-language models, many researchers start to evaluate their various zero-shot capabilities to answer questions given a video input. However, there has not been a standardised and “best practice” method to evaluate the quality of a model’s open-ended answer given a question and multiple ground truths. We reviewed some current methods which includes using n-gram based metrics and using LLM (Large Language Model) as a judge. While n-gram based metrics scored some models answer on par with a human’s answer, these scores do not have high correlation with humans preference when used to rank the models from best to worst. The highest scoring models are found to only have 0.21 Spearman correlation score with human preference. We also designed prompts to get LLM to judge which model answers is better given multiple reference answers through (1) head-to-head which found to have some consistency with human preference (2) ranking all possible answers which found to have higher correlation than n-gram based metrics. We offer a perspective that while additional ground truth would be useful for traditional (n- grams based) metrics, but given a sophiscated LLM, one ground truth might be sufficient to judge the quality of a model’s answer. This is especially moving forward with the rapid advancement of capability of such Language Models. Bachelor's degree 2024-04-19T12:33:13Z 2024-04-19T12:33:13Z 2024 Final Year Project (FYP) Setiadharma, C. A. (2024). Evaluating vision-language models long-chain reasoning ability with multiple ground truths. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175186 https://hdl.handle.net/10356/175186 en SCSE23-0243 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science
spellingShingle	Computer and Information Science Setiadharma, Christopher Arif Evaluating vision-language models long-chain reasoning ability with multiple ground truths
description	With the recent advancements in vision-language models, many researchers start to evaluate their various zero-shot capabilities to answer questions given a video input. However, there has not been a standardised and “best practice” method to evaluate the quality of a model’s open-ended answer given a question and multiple ground truths. We reviewed some current methods which includes using n-gram based metrics and using LLM (Large Language Model) as a judge. While n-gram based metrics scored some models answer on par with a human’s answer, these scores do not have high correlation with humans preference when used to rank the models from best to worst. The highest scoring models are found to only have 0.21 Spearman correlation score with human preference. We also designed prompts to get LLM to judge which model answers is better given multiple reference answers through (1) head-to-head which found to have some consistency with human preference (2) ranking all possible answers which found to have higher correlation than n-gram based metrics. We offer a perspective that while additional ground truth would be useful for traditional (n- grams based) metrics, but given a sophiscated LLM, one ground truth might be sufficient to judge the quality of a model’s answer. This is especially moving forward with the rapid advancement of capability of such Language Models.
author2	Liu Ziwei
author_facet	Liu Ziwei Setiadharma, Christopher Arif
format	Final Year Project
author	Setiadharma, Christopher Arif
author_sort	Setiadharma, Christopher Arif
title	Evaluating vision-language models long-chain reasoning ability with multiple ground truths
title_short	Evaluating vision-language models long-chain reasoning ability with multiple ground truths
title_full	Evaluating vision-language models long-chain reasoning ability with multiple ground truths
title_fullStr	Evaluating vision-language models long-chain reasoning ability with multiple ground truths
title_full_unstemmed	Evaluating vision-language models long-chain reasoning ability with multiple ground truths
title_sort	evaluating vision-language models long-chain reasoning ability with multiple ground truths
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/175186
_version_	1814047395153969152

Evaluating vision-language models long-chain reasoning ability with multiple ground truths

Similar Items