Answering patterns in SBA items: students, GPT3.5, and Gemini

While large language models (LLMs) are often used to generate and answer exam questions, limited work compares their performance across multiple iterations using item statistics. This study aims to fill that gap by investigating answering patterns of how LLMs respond to single-best answer (SBA) ques...

Full description

Saved in:
Bibliographic Details
Main Authors: Ng, Olivia, Phua, Dong Haur, Chu, Jowe, Wilding, Lucy V. E., Mogali, Sreenivasulu Reddy, Cleland, Jennifer
Other Authors: Lee Kong Chian School of Medicine (LKCMedicine)
Format: Article
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/10356/181959
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:While large language models (LLMs) are often used to generate and answer exam questions, limited work compares their performance across multiple iterations using item statistics. This study aims to fill that gap by investigating answering patterns of how LLMs respond to single-best answer (SBA) questions, comparing their performance to that of students. Forty-one SBA questions for first-year medical students were assessed using the most easily assessable and free-to-use GPT3.5 and Gemini across 100 iterations. Both LLMs exhibited more repetitive and clustered answering patterns compared to students, which can be problematic as it may compound mistakes by repeating error selection. Distractor analysis revealed that students performed better when managing multiple options in the SBA format. We found that these free-to-use LLMs are inferior to well-trained students or specialists in handling technical questions. We have also highlighted concerns on LLMs’ contextual interpretation of these items and the need of human oversight in the medical education assessment process.