Benchmarking foundation models with language-model-as-an-examiner

Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model’s ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets,...

Full description

Saved in:

Bibliographic Details
Main Authors:	BAI, Yushi, YING, Jiahao, CAO, Yixin, LV, Xin, HE, Yuze, WANG, Xiaozhi, YU, Jifan, ZENG, Kaisheng, XIAO, Yijia, LYU, Haozhe, ZHANG, Jiayin, LI, Juanzi, HOU, Lei
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Databases and Information Systems Programming Languages and Compilers
Online Access:	https://ink.library.smu.edu.sg/sis_research/8392 https://ink.library.smu.edu.sg/context/sis_research/article/9395/viewcontent/2306.04181.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9395
record_format	dspace
spelling	sg-smu-ink.sis_research-93952024-01-09T03:53:59Z Benchmarking foundation models with language-model-as-an-examiner BAI, Yushi YING, Jiahao CAO, Yixin LV, Xin HE, Yuze WANG, Xiaozhi YU, Jifan ZENG, Kaisheng XIAO, Yijia LYU, Haozhe ZHANG, Jiayin LI, Juanzi HOU, Lei Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model’s ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: http://lmexam.xlore.cn. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8392 https://ink.library.smu.edu.sg/context/sis_research/article/9395/viewcontent/2306.04181.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Programming Languages and Compilers
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Databases and Information Systems Programming Languages and Compilers
spellingShingle	Databases and Information Systems Programming Languages and Compilers BAI, Yushi YING, Jiahao CAO, Yixin LV, Xin HE, Yuze WANG, Xiaozhi YU, Jifan ZENG, Kaisheng XIAO, Yijia LYU, Haozhe ZHANG, Jiayin LI, Juanzi HOU, Lei Benchmarking foundation models with language-model-as-an-examiner
description	Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model’s ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: http://lmexam.xlore.cn.
format	text
author	BAI, Yushi YING, Jiahao CAO, Yixin LV, Xin HE, Yuze WANG, Xiaozhi YU, Jifan ZENG, Kaisheng XIAO, Yijia LYU, Haozhe ZHANG, Jiayin LI, Juanzi HOU, Lei
author_facet	BAI, Yushi YING, Jiahao CAO, Yixin LV, Xin HE, Yuze WANG, Xiaozhi YU, Jifan ZENG, Kaisheng XIAO, Yijia LYU, Haozhe ZHANG, Jiayin LI, Juanzi HOU, Lei
author_sort	BAI, Yushi
title	Benchmarking foundation models with language-model-as-an-examiner
title_short	Benchmarking foundation models with language-model-as-an-examiner
title_full	Benchmarking foundation models with language-model-as-an-examiner
title_fullStr	Benchmarking foundation models with language-model-as-an-examiner
title_full_unstemmed	Benchmarking foundation models with language-model-as-an-examiner
title_sort	benchmarking foundation models with language-model-as-an-examiner
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/8392 https://ink.library.smu.edu.sg/context/sis_research/article/9395/viewcontent/2306.04181.pdf
_version_	1787590767807561728

Benchmarking foundation models with language-model-as-an-examiner

Similar Items