大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties
尽管所有华语区的汉语源于同一种早期现代汉语,但受不同社会语言环境的影响,逐渐演变出具有地区特色的变体。近三四十年来,汉语变体研究已取得诸多成果,但大语言模型处理不同汉语变体的能力仍有待考证。因此,本研究以四个汉语变体(中国大陆普通话、新加坡华语、香港国语以及台湾国语)的词汇为对象,考察了当前主流大语言模型处理汉语变体的能力。 本研究分为两大部分:(1)大语言模型对汉语变体的识别能力、以及(2)大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后,本研究发现:(1)大语言模型有将某汉语变体识别为另一汉语变体的倾向,尤其是将新加坡华语识别为中国大陆普通话;(2)...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | Chinese |
Published: |
Nanyang Technological University
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/183019 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | Chinese |
id |
sg-ntu-dr.10356-183019 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1830192025-03-17T05:06:28Z 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties 方乔 Fang, Qiao Lin Jingxia School of Humanities JingxiaLin@ntu.edu.sg Arts and Humanities 汉语变体 词汇差异 大语言模型 Varieties of Mandarin Lexical difference Large language model 尽管所有华语区的汉语源于同一种早期现代汉语,但受不同社会语言环境的影响,逐渐演变出具有地区特色的变体。近三四十年来,汉语变体研究已取得诸多成果,但大语言模型处理不同汉语变体的能力仍有待考证。因此,本研究以四个汉语变体(中国大陆普通话、新加坡华语、香港国语以及台湾国语)的词汇为对象,考察了当前主流大语言模型处理汉语变体的能力。 本研究分为两大部分:(1)大语言模型对汉语变体的识别能力、以及(2)大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后,本研究发现:(1)大语言模型有将某汉语变体识别为另一汉语变体的倾向,尤其是将新加坡华语识别为中国大陆普通话;(2)大语言模型在理解词汇在新加坡华语中的含义时,往往会生成出超出现有文献对于该词汇在新加坡华语中的释义。 最后,本研究希望通过大语言模型对自然语言的分析,为跨汉语变体的语言模型建构和跨变体自然语言处理优化提供实证支持,并为未来汉语变体词汇的研究提供参考。 Although all Mandarin Chinese in various Chinese-speaking regions originated from the same early Morden Chinese, they have been influenced by different sociolinguistic environments, and have gradually evolved into regionally distinctive variants. In the past three to four decades, many achievements have been made in the study of Chinese variants, but the ability of the Large Language Models (LLMs) to process different Chinese variants remains to be proven. Therefore, this study selects the lexicons from four Chinese variants (Mainland, Singaporean, Hong Kongese, and Taiwanese Chinese) as case study subjects to assess the ability of mainstream LLMs to process varieties of Chinese. This study consists of two components: (1) the ability of LLMs to identify different Chinese variants, and (2) the understanding and generative capacities of LLMs across Chinese variants. Through further analysis and discussion of the results, this study reveals that: (1) the LLMs have a tendency to misclassify one Chinese variant as another, particularly misclassifying Singaporean Chinese as Mainland Chinese; (2) the LLMs tend to generate meanings beyond the interpretations of existing literature of Singaporean Chinese lexicons. Finally, this study aims to provide empirical support for cross-Chinese variants language modelling and optimization of cross-Chinese variant natural language processing. Additionally, we also hope to offer valuable insights for future studies on Chinese variant lexicons. Bachelor's degree 2025-03-17T05:06:28Z 2025-03-17T05:06:28Z 2025 Final Year Project (FYP) 方乔 Fang, Q. (2025). 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/183019 https://hdl.handle.net/10356/183019 zh SoH24028 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
Chinese |
topic |
Arts and Humanities 汉语变体 词汇差异 大语言模型 Varieties of Mandarin Lexical difference Large language model |
spellingShingle |
Arts and Humanities 汉语变体 词汇差异 大语言模型 Varieties of Mandarin Lexical difference Large language model 方乔 Fang, Qiao 大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties |
description |
尽管所有华语区的汉语源于同一种早期现代汉语,但受不同社会语言环境的影响,逐渐演变出具有地区特色的变体。近三四十年来,汉语变体研究已取得诸多成果,但大语言模型处理不同汉语变体的能力仍有待考证。因此,本研究以四个汉语变体(中国大陆普通话、新加坡华语、香港国语以及台湾国语)的词汇为对象,考察了当前主流大语言模型处理汉语变体的能力。
本研究分为两大部分:(1)大语言模型对汉语变体的识别能力、以及(2)大语言模型对不同汉语变体词汇的理解和生成能力。通过对搜集的数据进行进一步的分析与讨论后,本研究发现:(1)大语言模型有将某汉语变体识别为另一汉语变体的倾向,尤其是将新加坡华语识别为中国大陆普通话;(2)大语言模型在理解词汇在新加坡华语中的含义时,往往会生成出超出现有文献对于该词汇在新加坡华语中的释义。
最后,本研究希望通过大语言模型对自然语言的分析,为跨汉语变体的语言模型建构和跨变体自然语言处理优化提供实证支持,并为未来汉语变体词汇的研究提供参考。
Although all Mandarin Chinese in various Chinese-speaking regions originated from the same early Morden Chinese, they have been influenced by different sociolinguistic environments, and have gradually evolved into regionally distinctive variants. In the past three to four decades, many achievements have been made in the study of Chinese variants, but the ability of the Large Language Models (LLMs) to process different Chinese variants remains to be proven. Therefore, this study selects the lexicons from four Chinese variants (Mainland, Singaporean, Hong Kongese, and Taiwanese Chinese) as case study subjects to assess the ability of mainstream LLMs to process varieties of Chinese.
This study consists of two components: (1) the ability of LLMs to identify different Chinese variants, and (2) the understanding and generative capacities of LLMs across Chinese variants. Through further analysis and discussion of the results, this study reveals that: (1) the LLMs have a tendency to misclassify one Chinese variant as another, particularly misclassifying Singaporean Chinese as Mainland Chinese; (2) the LLMs tend to generate meanings beyond the interpretations of existing literature of Singaporean Chinese lexicons.
Finally, this study aims to provide empirical support for cross-Chinese variants language modelling and optimization of cross-Chinese variant natural language processing. Additionally, we also hope to offer valuable insights for future studies on Chinese variant lexicons. |
author2 |
Lin Jingxia |
author_facet |
Lin Jingxia 方乔 Fang, Qiao |
format |
Final Year Project |
author |
方乔 Fang, Qiao |
author_sort |
方乔 Fang, Qiao |
title |
大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties |
title_short |
大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties |
title_full |
大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties |
title_fullStr |
大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties |
title_full_unstemmed |
大语言模型对汉语变体的自动识别 = An evaluation of large language models for identifying Mandarin varieties |
title_sort |
大语言模型对汉语变体的自动识别 = an evaluation of large language models for identifying mandarin varieties |
publisher |
Nanyang Technological University |
publishDate |
2025 |
url |
https://hdl.handle.net/10356/183019 |
_version_ |
1827070718130520064 |