Exploiting long context using joint distance and occurrence information for language modeling

This thesis investigates an approach to exploiting the long context based on the information about the distance and occurrence. By modeling the joint event of distance and occurrence, this approach attempts to incorporate the inter-dependencies into the model, such that information captured from the...

Full description

Saved in:
Bibliographic Details
Main Author: Chong, Tze Yuang
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/75876
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-75876
record_format dspace
spelling sg-ntu-dr.10356-758762023-03-04T00:47:22Z Exploiting long context using joint distance and occurrence information for language modeling Chong, Tze Yuang Chng Eng Siong School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering This thesis investigates an approach to exploiting the long context based on the information about the distance and occurrence. By modeling the joint event of distance and occurrence, this approach attempts to incorporate the inter-dependencies into the model, such that information captured from the long context can be more optimally made use. This thesis addresses the problem with the conventional language modeling approaches that tend to neglect the inter-dependencies. Based on the proposed approach, a novel language model, referred to as the term-distance term-occurrence (TDTO) model, is formulated. The TDTO model estimates probabilities based on the events of term-distance (TD) and term-occurrence (TO) that correspond to the distances and occurrences of words in the context. By expressing the TDTO model in terms of a log-linear interpolation framework, the impact of the TD and TO towards the final estimation can be tuned. Specifically, as the TD events, i.e. positions, within a long context are likely rare or unseen, the weight of the TD component can be tuned down accordingly to alleviate the data scarcity problem. Through a series of experiments, the TDTO model has been shown to be capable of exploiting the long context to reduce the perplexities of the language models. On the BLLIP Wall Street Journal (WSJ) and Switchboard-1 (SWB) corpora, perplexity reductions up to 11.2% and 6.5% were obtained, with the context lengths of seven and eight, respectively. In addition, the TDTO model has been shown to outperform other conventional models used to exploit the long context, such as the distant-bigram, trigger and BOW models – the TDTO model consistently showed lower perplexities. The applicability of the TDTO model has been examined on several tasks, such as the speech recognition, text classification and word prediction. The TDTO model has been shown to improve the baseline performance on all the considered tasks. Furthermore, this thesis proposes a neural network implementation of the TDTO model. The aim is to provide a better smoothing mechanism for TDTO modeling. The resulted model, referred to as the neural network based TDTO (NN-TDTO) model, has been empirically shown to outperform the baseline TDTO model in both perplexity and speech recognition accuracy. On the WSJ corpus, the NN-TDTO model yielded up to 9.2% lower perplexity as compared to the TDTO model. On the Aurora-4 speech recognition task, the NN-TDTO model obtained up to 12.9% relatively lower word error rate. Doctor of Philosophy (SCE) 2018-07-04T12:01:48Z 2018-07-04T12:01:48Z 2018 Thesis Chong, T. Y. (2018). Exploiting long context using joint distance and occurrence information for language modeling. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/75876 10.32657/10356/75876 en 121 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Chong, Tze Yuang
Exploiting long context using joint distance and occurrence information for language modeling
description This thesis investigates an approach to exploiting the long context based on the information about the distance and occurrence. By modeling the joint event of distance and occurrence, this approach attempts to incorporate the inter-dependencies into the model, such that information captured from the long context can be more optimally made use. This thesis addresses the problem with the conventional language modeling approaches that tend to neglect the inter-dependencies. Based on the proposed approach, a novel language model, referred to as the term-distance term-occurrence (TDTO) model, is formulated. The TDTO model estimates probabilities based on the events of term-distance (TD) and term-occurrence (TO) that correspond to the distances and occurrences of words in the context. By expressing the TDTO model in terms of a log-linear interpolation framework, the impact of the TD and TO towards the final estimation can be tuned. Specifically, as the TD events, i.e. positions, within a long context are likely rare or unseen, the weight of the TD component can be tuned down accordingly to alleviate the data scarcity problem. Through a series of experiments, the TDTO model has been shown to be capable of exploiting the long context to reduce the perplexities of the language models. On the BLLIP Wall Street Journal (WSJ) and Switchboard-1 (SWB) corpora, perplexity reductions up to 11.2% and 6.5% were obtained, with the context lengths of seven and eight, respectively. In addition, the TDTO model has been shown to outperform other conventional models used to exploit the long context, such as the distant-bigram, trigger and BOW models – the TDTO model consistently showed lower perplexities. The applicability of the TDTO model has been examined on several tasks, such as the speech recognition, text classification and word prediction. The TDTO model has been shown to improve the baseline performance on all the considered tasks. Furthermore, this thesis proposes a neural network implementation of the TDTO model. The aim is to provide a better smoothing mechanism for TDTO modeling. The resulted model, referred to as the neural network based TDTO (NN-TDTO) model, has been empirically shown to outperform the baseline TDTO model in both perplexity and speech recognition accuracy. On the WSJ corpus, the NN-TDTO model yielded up to 9.2% lower perplexity as compared to the TDTO model. On the Aurora-4 speech recognition task, the NN-TDTO model obtained up to 12.9% relatively lower word error rate.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Chong, Tze Yuang
format Theses and Dissertations
author Chong, Tze Yuang
author_sort Chong, Tze Yuang
title Exploiting long context using joint distance and occurrence information for language modeling
title_short Exploiting long context using joint distance and occurrence information for language modeling
title_full Exploiting long context using joint distance and occurrence information for language modeling
title_fullStr Exploiting long context using joint distance and occurrence information for language modeling
title_full_unstemmed Exploiting long context using joint distance and occurrence information for language modeling
title_sort exploiting long context using joint distance and occurrence information for language modeling
publishDate 2018
url http://hdl.handle.net/10356/75876
_version_ 1759854485380792320