大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties

尽管所有华语区的汉语源于同一种早期现代汉语，但受不同社会语言环境的影响，逐渐演变出具有地区特色的变体。近三四十年来，汉语变体研究已取得诸多成果，但大语言模型处理不同汉语变体的能力仍有待考证。因此，本研究以四个汉语变体（中国大陆普通话、新加坡华语、香港国语以及台湾国语）的词汇为对象，考察了当前主流大语言模型处理汉语变体的能力。本研究分为两大部分：（1）大语言模型对汉语变体的识别能力、以及（2）大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后，本研究发现：（1）大语言模型有将某汉语变体识别为另一汉语变体的倾向，尤其是将新加坡华语识别为中国大陆普通话；（2）...

Full description

Saved in:

Bibliographic Details
Main Author:	方乔 Fang, Qiao
Other Authors:	Lin Jingxia
Format:	Final Year Project
Language:	Chinese
Published:	Nanyang Technological University 2025
Subjects:	Arts and Humanities 汉语变体词汇差异大语言模型 Varieties of Mandarin Lexical difference Large language model
Online Access:	https://hdl.handle.net/10356/183019
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	Chinese

id	sg-ntu-dr.10356-183019
record_format	dspace
spelling	sg-ntu-dr.10356-1830192025-03-17T05:06:28Z 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties 方乔 Fang, Qiao Lin Jingxia School of Humanities JingxiaLin@ntu.edu.sg Arts and Humanities 汉语变体词汇差异大语言模型 Varieties of Mandarin Lexical difference Large language model 尽管所有华语区的汉语源于同一种早期现代汉语，但受不同社会语言环境的影响，逐渐演变出具有地区特色的变体。近三四十年来，汉语变体研究已取得诸多成果，但大语言模型处理不同汉语变体的能力仍有待考证。因此，本研究以四个汉语变体（中国大陆普通话、新加坡华语、香港国语以及台湾国语）的词汇为对象，考察了当前主流大语言模型处理汉语变体的能力。本研究分为两大部分：（1）大语言模型对汉语变体的识别能力、以及（2）大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后，本研究发现：（1）大语言模型有将某汉语变体识别为另一汉语变体的倾向，尤其是将新加坡华语识别为中国大陆普通话；（2）大语言模型在理解词汇在新加坡华语中的含义时，往往会生成出超出现有文献对于该词汇在新加坡华语中的释义。最后，本研究希望通过大语言模型对自然语言的分析，为跨汉语变体的语言模型建构和跨变体自然语言处理优化提供实证支持，并为未来汉语变体词汇的研究提供参考。 Although all Mandarin Chinese in various Chinese-speaking regions originated from the same early Morden Chinese, they have been influenced by different sociolinguistic environments, and have gradually evolved into regionally distinctive variants. In the past three to four decades, many achievements have been made in the study of Chinese variants, but the ability of the Large Language Models (LLMs) to process different Chinese variants remains to be proven. Therefore, this study selects the lexicons from four Chinese variants (Mainland, Singaporean, Hong Kongese, and Taiwanese Chinese) as case study subjects to assess the ability of mainstream LLMs to process varieties of Chinese. This study consists of two components: (1) the ability of LLMs to identify different Chinese variants, and (2) the understanding and generative capacities of LLMs across Chinese variants. Through further analysis and discussion of the results, this study reveals that: (1) the LLMs have a tendency to misclassify one Chinese variant as another, particularly misclassifying Singaporean Chinese as Mainland Chinese; (2) the LLMs tend to generate meanings beyond the interpretations of existing literature of Singaporean Chinese lexicons. Finally, this study aims to provide empirical support for cross-Chinese variants language modelling and optimization of cross-Chinese variant natural language processing. Additionally, we also hope to offer valuable insights for future studies on Chinese variant lexicons. Bachelor's degree 2025-03-17T05:06:28Z 2025-03-17T05:06:28Z 2025 Final Year Project (FYP) 方乔 Fang, Q. (2025). 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/183019 https://hdl.handle.net/10356/183019 zh SoH24028 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	Chinese
topic	Arts and Humanities 汉语变体词汇差异大语言模型 Varieties of Mandarin Lexical difference Large language model
spellingShingle	Arts and Humanities 汉语变体词汇差异大语言模型 Varieties of Mandarin Lexical difference Large language model 方乔 Fang, Qiao 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
description	尽管所有华语区的汉语源于同一种早期现代汉语，但受不同社会语言环境的影响，逐渐演变出具有地区特色的变体。近三四十年来，汉语变体研究已取得诸多成果，但大语言模型处理不同汉语变体的能力仍有待考证。因此，本研究以四个汉语变体（中国大陆普通话、新加坡华语、香港国语以及台湾国语）的词汇为对象，考察了当前主流大语言模型处理汉语变体的能力。本研究分为两大部分：（1）大语言模型对汉语变体的识别能力、以及（2）大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后，本研究发现：（1）大语言模型有将某汉语变体识别为另一汉语变体的倾向，尤其是将新加坡华语识别为中国大陆普通话；（2）大语言模型在理解词汇在新加坡华语中的含义时，往往会生成出超出现有文献对于该词汇在新加坡华语中的释义。最后，本研究希望通过大语言模型对自然语言的分析，为跨汉语变体的语言模型建构和跨变体自然语言处理优化提供实证支持，并为未来汉语变体词汇的研究提供参考。 Although all Mandarin Chinese in various Chinese-speaking regions originated from the same early Morden Chinese, they have been influenced by different sociolinguistic environments, and have gradually evolved into regionally distinctive variants. In the past three to four decades, many achievements have been made in the study of Chinese variants, but the ability of the Large Language Models (LLMs) to process different Chinese variants remains to be proven. Therefore, this study selects the lexicons from four Chinese variants (Mainland, Singaporean, Hong Kongese, and Taiwanese Chinese) as case study subjects to assess the ability of mainstream LLMs to process varieties of Chinese. This study consists of two components: (1) the ability of LLMs to identify different Chinese variants, and (2) the understanding and generative capacities of LLMs across Chinese variants. Through further analysis and discussion of the results, this study reveals that: (1) the LLMs have a tendency to misclassify one Chinese variant as another, particularly misclassifying Singaporean Chinese as Mainland Chinese; (2) the LLMs tend to generate meanings beyond the interpretations of existing literature of Singaporean Chinese lexicons. Finally, this study aims to provide empirical support for cross-Chinese variants language modelling and optimization of cross-Chinese variant natural language processing. Additionally, we also hope to offer valuable insights for future studies on Chinese variant lexicons.
author2	Lin Jingxia
author_facet	Lin Jingxia 方乔 Fang, Qiao
format	Final Year Project
author	方乔 Fang, Qiao
author_sort	方乔 Fang, Qiao
title	大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_short	大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_full	大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_fullStr	大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_full_unstemmed	大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_sort	大语言模型对汉语变体的自动识别 = an evaluation of large language models for identifying mandarin varieties
publisher	Nanyang Technological University
publishDate	2025
url	https://hdl.handle.net/10356/183019
_version_	1827070718130520064

大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties

Similar Items