大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties

尽管所有华语区的汉语源于同一种早期现代汉语,但受不同社会语言环境的影响,逐渐演变出具有地区特色的变体。近三四十年来,汉语变体研究已取得诸多成果,但大语言模型处理不同汉语变体的能力仍有待考证。因此,本研究以四个汉语变体(中国大陆普通话、新加坡华语、香港国语以及台湾国语)的词汇为对象,考察了当前主流大语言模型处理汉语变体的能力。 本研究分为两大部分:(1)大语言模型对汉语变体的识别能力、以及(2)大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后,本研究发现:(1)大语言模型有将某汉语变体识别为另一汉语变体的倾向,尤其是将新加坡华语识别为中国大陆普通话;(2)...

Full description

Saved in:
Bibliographic Details
Main Author: 方乔 Fang, Qiao
Other Authors: Lin Jingxia
Format: Final Year Project
Language:Chinese
Published: Nanyang Technological University 2025
Subjects:
Online Access:https://hdl.handle.net/10356/183019
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: Chinese
id sg-ntu-dr.10356-183019
record_format dspace
spelling sg-ntu-dr.10356-1830192025-03-17T05:06:28Z 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties 方乔 Fang, Qiao Lin Jingxia School of Humanities JingxiaLin@ntu.edu.sg Arts and Humanities 汉语变体 词汇差异 大语言模型 Varieties of Mandarin Lexical difference Large language model 尽管所有华语区的汉语源于同一种早期现代汉语,但受不同社会语言环境的影响,逐渐演变出具有地区特色的变体。近三四十年来,汉语变体研究已取得诸多成果,但大语言模型处理不同汉语变体的能力仍有待考证。因此,本研究以四个汉语变体(中国大陆普通话、新加坡华语、香港国语以及台湾国语)的词汇为对象,考察了当前主流大语言模型处理汉语变体的能力。 本研究分为两大部分:(1)大语言模型对汉语变体的识别能力、以及(2)大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后,本研究发现:(1)大语言模型有将某汉语变体识别为另一汉语变体的倾向,尤其是将新加坡华语识别为中国大陆普通话;(2)大语言模型在理解词汇在新加坡华语中的含义时,往往会生成出超出现有文献对于该词汇在新加坡华语中的释义。 最后,本研究希望通过大语言模型对自然语言的分析,为跨汉语变体的语言模型建构和跨变体自然语言处理优化提供实证支持,并为未来汉语变体词汇的研究提供参考。 Although all Mandarin Chinese in various Chinese-speaking regions originated from the same early Morden Chinese, they have been influenced by different sociolinguistic environments, and have gradually evolved into regionally distinctive variants. In the past three to four decades, many achievements have been made in the study of Chinese variants, but the ability of the Large Language Models (LLMs) to process different Chinese variants remains to be proven. Therefore, this study selects the lexicons from four Chinese variants (Mainland, Singaporean, Hong Kongese, and Taiwanese Chinese) as case study subjects to assess the ability of mainstream LLMs to process varieties of Chinese. This study consists of two components: (1) the ability of LLMs to identify different Chinese variants, and (2) the understanding and generative capacities of LLMs across Chinese variants. Through further analysis and discussion of the results, this study reveals that: (1) the LLMs have a tendency to misclassify one Chinese variant as another, particularly misclassifying Singaporean Chinese as Mainland Chinese; (2) the LLMs tend to generate meanings beyond the interpretations of existing literature of Singaporean Chinese lexicons. Finally, this study aims to provide empirical support for cross-Chinese variants language modelling and optimization of cross-Chinese variant natural language processing. Additionally, we also hope to offer valuable insights for future studies on Chinese variant lexicons. Bachelor's degree 2025-03-17T05:06:28Z 2025-03-17T05:06:28Z 2025 Final Year Project (FYP) 方乔 Fang, Q. (2025). 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/183019 https://hdl.handle.net/10356/183019 zh SoH24028 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language Chinese
topic Arts and Humanities
汉语变体
词汇差异
大语言模型
Varieties of Mandarin
Lexical difference
Large language model
spellingShingle Arts and Humanities
汉语变体
词汇差异
大语言模型
Varieties of Mandarin
Lexical difference
Large language model
方乔 Fang, Qiao
大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
description 尽管所有华语区的汉语源于同一种早期现代汉语,但受不同社会语言环境的影响,逐渐演变出具有地区特色的变体。近三四十年来,汉语变体研究已取得诸多成果,但大语言模型处理不同汉语变体的能力仍有待考证。因此,本研究以四个汉语变体(中国大陆普通话、新加坡华语、香港国语以及台湾国语)的词汇为对象,考察了当前主流大语言模型处理汉语变体的能力。 本研究分为两大部分:(1)大语言模型对汉语变体的识别能力、以及(2)大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后,本研究发现:(1)大语言模型有将某汉语变体识别为另一汉语变体的倾向,尤其是将新加坡华语识别为中国大陆普通话;(2)大语言模型在理解词汇在新加坡华语中的含义时,往往会生成出超出现有文献对于该词汇在新加坡华语中的释义。 最后,本研究希望通过大语言模型对自然语言的分析,为跨汉语变体的语言模型建构和跨变体自然语言处理优化提供实证支持,并为未来汉语变体词汇的研究提供参考。 Although all Mandarin Chinese in various Chinese-speaking regions originated from the same early Morden Chinese, they have been influenced by different sociolinguistic environments, and have gradually evolved into regionally distinctive variants. In the past three to four decades, many achievements have been made in the study of Chinese variants, but the ability of the Large Language Models (LLMs) to process different Chinese variants remains to be proven. Therefore, this study selects the lexicons from four Chinese variants (Mainland, Singaporean, Hong Kongese, and Taiwanese Chinese) as case study subjects to assess the ability of mainstream LLMs to process varieties of Chinese. This study consists of two components: (1) the ability of LLMs to identify different Chinese variants, and (2) the understanding and generative capacities of LLMs across Chinese variants. Through further analysis and discussion of the results, this study reveals that: (1) the LLMs have a tendency to misclassify one Chinese variant as another, particularly misclassifying Singaporean Chinese as Mainland Chinese; (2) the LLMs tend to generate meanings beyond the interpretations of existing literature of Singaporean Chinese lexicons. Finally, this study aims to provide empirical support for cross-Chinese variants language modelling and optimization of cross-Chinese variant natural language processing. Additionally, we also hope to offer valuable insights for future studies on Chinese variant lexicons.
author2 Lin Jingxia
author_facet Lin Jingxia
方乔 Fang, Qiao
format Final Year Project
author 方乔 Fang, Qiao
author_sort 方乔 Fang, Qiao
title 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_short 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_full 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_fullStr 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_full_unstemmed 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
title_sort 大语言模型对汉语变体的自动识别 = an evaluation of large language models for identifying mandarin varieties
publisher Nanyang Technological University
publishDate 2025
url https://hdl.handle.net/10356/183019
_version_ 1827070718130520064