Chinese idiom understanding with transformer-based pretrained language models

In this dissertation, I study the understanding of Chinese idioms using transformer-based pretrained language models. By ``understanding", I confine the topics to word embeddings learning, contextualized word representations learning, multiple-choice cloze-test reading comprehension and conditi...

Full description

Saved in:

Bibliographic Details
Main Author:	TAN, Minghuan
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	natural language processing multiword expressions Chinese idioms Databases and Information Systems East Asian Languages and Societies
Online Access:	https://ink.library.smu.edu.sg/etd_coll/410 https://ink.library.smu.edu.sg/context/etd_coll/article/1408/viewcontent/GPIS_AY2017_PhD_Minghuan_TAN.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.etd_coll-1408
record_format	dspace
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	natural language processing multiword expressions Chinese idioms Databases and Information Systems East Asian Languages and Societies
spellingShingle	natural language processing multiword expressions Chinese idioms Databases and Information Systems East Asian Languages and Societies TAN, Minghuan Chinese idiom understanding with transformer-based pretrained language models
description	In this dissertation, I study the understanding of Chinese idioms using transformer-based pretrained language models. By ``understanding", I confine the topics to word embeddings learning, contextualized word representations learning, multiple-choice cloze-test reading comprehension and conditional text generation. Chinese idioms are fixed phrases that have special meanings usually derived from an ancient story. The meanings of these idioms are oftentimes not directly related to their component characters, which makes it hard to model them compared with standard phrases whose meanings are compositional. We initiate the work with studying idiom representations derived from pretrained language models, in particular, BERT. We adopt probing-based methods to investigate to what extent BERT can encode an idiom's meaning. We design two probing tasks to test whether idiom encodings through pretrained language models can be used to (1) classify the usage of a potential idiomatic expression as either idiomatic or literal and (2) identify idiom paraphrases. Then we propose a BERT-based method to better learn Chinese idioms' embeddings and evaluate the embeddings using our newly constructed dataset of Chinese idiom synonyms and antonyms. We further study Chinese idiom prediction based on a context. We first propose a BERT-based dual embedding model for the Chinese idiom prediction task, where given a context with a missing Chinese idiom and a set of candidate idioms, the model needs to find the correct idiom to fill in the blank. Our method is based on the observation that part of an idiom's meaning comes from a long-range context that contains topical information, and part of its meaning comes from a local context that encodes more of its syntactic usage. We use BERT to process the contextual words and to match the embedding of each candidate idiom with both the hidden representation corresponding to the blank in the context and the hidden representations of all the tokens in the context through context pooling. We also propose to use two separate idiom embeddings for the two kinds of matching. Experiments on ChID, a recently released Chinese idiom cloze test dataset, show that our proposed method performs better than existing state of the art. Ablation experiments also show that both context pooling and dual embedding contribute to the performance improvement. Observing some of the limitations with existing work, we further propose a two-stage model, where during the first stage we retrain a Chinese BERT model by masking out idioms from a large Chinese corpus with a wide coverage of idioms. During the second stage, we fine-tune the retrained, idioms-oriented BERT on a specific idiom recommendation dataset.We evaluate this method on the ChID dataset and find that it can achieve the state of the art. Ablation studies show that both stages of training are critical for the performance gain. We also propose a new task called Chengyu-oriented text polishing. This task is based on the hypothesis that using Chengyu properly usually can enhance the elegance and conciseness of the Chinese language. We formulate the task as a context-dependent text generation problem and construct a dataset with 1.5 million automatically generated instances for training and 4K human-annotated examples for evaluation. The study offers solid baselines built with the latest pretrained encoder-decoder transformer models. We finally conclude the thesis by summarizing the contributions of this thesis and pointing out potential future directions to explore related to Chinese idiom understanding, namely, sentiment analysis with idioms and explaining Chinese Chengyu recommendation models.
format	text
author	TAN, Minghuan
author_facet	TAN, Minghuan
author_sort	TAN, Minghuan
title	Chinese idiom understanding with transformer-based pretrained language models
title_short	Chinese idiom understanding with transformer-based pretrained language models
title_full	Chinese idiom understanding with transformer-based pretrained language models
title_fullStr	Chinese idiom understanding with transformer-based pretrained language models
title_full_unstemmed	Chinese idiom understanding with transformer-based pretrained language models
title_sort	chinese idiom understanding with transformer-based pretrained language models
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/etd_coll/410 https://ink.library.smu.edu.sg/context/etd_coll/article/1408/viewcontent/GPIS_AY2017_PhD_Minghuan_TAN.pdf
_version_	1770567691681136640
spelling	sg-smu-ink.etd_coll-14082022-07-20T09:04:02Z Chinese idiom understanding with transformer-based pretrained language models TAN, Minghuan In this dissertation, I study the understanding of Chinese idioms using transformer-based pretrained language models. By ``understanding", I confine the topics to word embeddings learning, contextualized word representations learning, multiple-choice cloze-test reading comprehension and conditional text generation. Chinese idioms are fixed phrases that have special meanings usually derived from an ancient story. The meanings of these idioms are oftentimes not directly related to their component characters, which makes it hard to model them compared with standard phrases whose meanings are compositional. We initiate the work with studying idiom representations derived from pretrained language models, in particular, BERT. We adopt probing-based methods to investigate to what extent BERT can encode an idiom's meaning. We design two probing tasks to test whether idiom encodings through pretrained language models can be used to (1) classify the usage of a potential idiomatic expression as either idiomatic or literal and (2) identify idiom paraphrases. Then we propose a BERT-based method to better learn Chinese idioms' embeddings and evaluate the embeddings using our newly constructed dataset of Chinese idiom synonyms and antonyms. We further study Chinese idiom prediction based on a context. We first propose a BERT-based dual embedding model for the Chinese idiom prediction task, where given a context with a missing Chinese idiom and a set of candidate idioms, the model needs to find the correct idiom to fill in the blank. Our method is based on the observation that part of an idiom's meaning comes from a long-range context that contains topical information, and part of its meaning comes from a local context that encodes more of its syntactic usage. We use BERT to process the contextual words and to match the embedding of each candidate idiom with both the hidden representation corresponding to the blank in the context and the hidden representations of all the tokens in the context through context pooling. We also propose to use two separate idiom embeddings for the two kinds of matching. Experiments on ChID, a recently released Chinese idiom cloze test dataset, show that our proposed method performs better than existing state of the art. Ablation experiments also show that both context pooling and dual embedding contribute to the performance improvement. Observing some of the limitations with existing work, we further propose a two-stage model, where during the first stage we retrain a Chinese BERT model by masking out idioms from a large Chinese corpus with a wide coverage of idioms. During the second stage, we fine-tune the retrained, idioms-oriented BERT on a specific idiom recommendation dataset.We evaluate this method on the ChID dataset and find that it can achieve the state of the art. Ablation studies show that both stages of training are critical for the performance gain. We also propose a new task called Chengyu-oriented text polishing. This task is based on the hypothesis that using Chengyu properly usually can enhance the elegance and conciseness of the Chinese language. We formulate the task as a context-dependent text generation problem and construct a dataset with 1.5 million automatically generated instances for training and 4K human-annotated examples for evaluation. The study offers solid baselines built with the latest pretrained encoder-decoder transformer models. We finally conclude the thesis by summarizing the contributions of this thesis and pointing out potential future directions to explore related to Chinese idiom understanding, namely, sentiment analysis with idioms and explaining Chinese Chengyu recommendation models. 2022-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/410 https://ink.library.smu.edu.sg/context/etd_coll/article/1408/viewcontent/GPIS_AY2017_PhD_Minghuan_TAN.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University natural language processing multiword expressions Chinese idioms Databases and Information Systems East Asian Languages and Societies

Chinese idiom understanding with transformer-based pretrained language models

Similar Items