A comprehensive evaluation of large language models on legal judgment prediction

Large language models (LLMs) have demonstrated great potential for domain-specific applications, such as the law domain. However, recent disputes over GPT-4’s law evaluation raise questions concerning their performance in real-world legal tasks. To systematically investigate their competency in the...

Full description

Saved in:
Bibliographic Details
Main Authors: SHUI, Ruihao, CAO, Yixin, WANG, Xiang, CHUA, Tat-Seng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8396
https://ink.library.smu.edu.sg/context/sis_research/article/9399/viewcontent/2310.11761.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9399
record_format dspace
spelling sg-smu-ink.sis_research-93992024-01-09T03:52:26Z A comprehensive evaluation of large language models on legal judgment prediction SHUI, Ruihao CAO, Yixin WANG, Xiang CHUA, Tat-Seng Large language models (LLMs) have demonstrated great potential for domain-specific applications, such as the law domain. However, recent disputes over GPT-4’s law evaluation raise questions concerning their performance in real-world legal tasks. To systematically investigate their competency in the law, we design practical baseline solutions based on LLMs and test on the task of legal judgment prediction. In our solutions, LLMs can work alone to answer open questions or coordinate with an information retrieval (IR) system to learn from similar cases or solve simplified multi-choice questions. We show that similar cases and multi-choice options, namely label candidates, included in prompts can help LLMs recall domain knowledge that is critical for expertise legal reasoning. We additionally present an intriguing paradox wherein an IR system surpasses the performance of LLM+IR due to limited gains acquired by weaker LLMs from powerful IR systems. In such cases, the role of LLMs becomes redundant. Our evaluation pipeline can be easily extended into other tasks to facilitate evaluations in other domains. Code is available at https://github.com/srhthu/ LM-CompEval-Legal 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8396 info:doi/10.18653/v1/2023.findings-emnlp.490 https://ink.library.smu.edu.sg/context/sis_research/article/9399/viewcontent/2310.11761.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Programming Languages and Compilers
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
Programming Languages and Compilers
spellingShingle Databases and Information Systems
Programming Languages and Compilers
SHUI, Ruihao
CAO, Yixin
WANG, Xiang
CHUA, Tat-Seng
A comprehensive evaluation of large language models on legal judgment prediction
description Large language models (LLMs) have demonstrated great potential for domain-specific applications, such as the law domain. However, recent disputes over GPT-4’s law evaluation raise questions concerning their performance in real-world legal tasks. To systematically investigate their competency in the law, we design practical baseline solutions based on LLMs and test on the task of legal judgment prediction. In our solutions, LLMs can work alone to answer open questions or coordinate with an information retrieval (IR) system to learn from similar cases or solve simplified multi-choice questions. We show that similar cases and multi-choice options, namely label candidates, included in prompts can help LLMs recall domain knowledge that is critical for expertise legal reasoning. We additionally present an intriguing paradox wherein an IR system surpasses the performance of LLM+IR due to limited gains acquired by weaker LLMs from powerful IR systems. In such cases, the role of LLMs becomes redundant. Our evaluation pipeline can be easily extended into other tasks to facilitate evaluations in other domains. Code is available at https://github.com/srhthu/ LM-CompEval-Legal
format text
author SHUI, Ruihao
CAO, Yixin
WANG, Xiang
CHUA, Tat-Seng
author_facet SHUI, Ruihao
CAO, Yixin
WANG, Xiang
CHUA, Tat-Seng
author_sort SHUI, Ruihao
title A comprehensive evaluation of large language models on legal judgment prediction
title_short A comprehensive evaluation of large language models on legal judgment prediction
title_full A comprehensive evaluation of large language models on legal judgment prediction
title_fullStr A comprehensive evaluation of large language models on legal judgment prediction
title_full_unstemmed A comprehensive evaluation of large language models on legal judgment prediction
title_sort comprehensive evaluation of large language models on legal judgment prediction
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8396
https://ink.library.smu.edu.sg/context/sis_research/article/9399/viewcontent/2310.11761.pdf
_version_ 1787590768493330432