Acoustic modeling for speech recognition under limited training data conditions

The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority l...

Full description

Saved in:

Bibliographic Details
Main Author:	Do, Van Hai
Other Authors:	Chng Eng Siong
Format:	Theses and Dissertations
Language:	English
Published:	2015
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Online Access:	https://hdl.handle.net/10356/65409
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-65409
record_format	dspace
spelling	sg-ntu-dr.10356-654092023-03-04T00:50:22Z Acoustic modeling for speech recognition under limited training data conditions Do, Van Hai Chng Eng Siong Li Haizhou School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority languages, these resources are limited or even unavailable - we label these languages as under-resourced. In this thesis, the focus is to develop reliable acoustic models for under-resourced languages. The following three works have been proposed. In the first work, reliable acoustic models are built by transferring acoustic information from well-resourced languages (source) to under-resourced languages (target). Specifically, the phone models of the source language are reused to form the phone models of the target language. This is motivated by the fact that all human languages share a similar acoustic space, and hence some acoustic units e.g. phones, of two languages may have high correspondence and therefore allows the mapping of phones between languages. Unlike previous studies which examined only context-independent phone mapping, the thesis extends the studies to use context-dependent triphone states as the units to achieve higher acoustic resolution. In addition, linear and nonlinear mapping models with different training algorithms are also investigated. The results show that the nonlinear mapping with discriminative training criterion achieves the best performance in the proposed work. In the second work, rather than increasing the mapping resolution, the focus is to improve the quality of the cross-lingual feature used for mapping. Two approaches based on deep neural networks (DNNs) are examined. First, DNNs are used as the source language acoustic model to generate posterior features for phone mapping. Second, DNNs are used to replace multilayer perceptrons (MLPs) to realize the phone mapping. Experimental results show that better phone posteriors generated from the source DNNs result in a significant improvement in cross-lingual phone mapping, while deep structures for phone mapping are only useful when sufficient target language training data are available. The third work focuses on building a robust acoustic model using the exemplar-based modeling technique. Exemplar-based model is non-parametric and uses the training samples directly during recognition without training model parameters. This study uses a specific exemplar-based model, called kernel density, to estimate the likelihood of target language triphone states. To improve performance for under-resourced languages, cross-lingual bottleneck feature is used. In the exemplar-based technique, the major design consideration is the choice of distance function used to measure the similarity of a test sample and a training sample. This work proposed a Mahalanobis distance based metric optimized by minimizing the classification error rate on the training data. Results show that the proposed distance produces better results than the Euclidean distance. In addition, a discriminative score tuning network, using the same principle of minimizing training classification error, is also proposed. DOCTOR OF PHILOSOPHY (SCE) 2015-09-08T07:24:56Z 2015-09-08T07:24:56Z 2015 2015 Thesis Do, V. H. (2015). Acoustic modeling for speech recognition under limited training data conditions. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/65409 10.32657/10356/65409 en 144 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Do, Van Hai Acoustic modeling for speech recognition under limited training data conditions
description	The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority languages, these resources are limited or even unavailable - we label these languages as under-resourced. In this thesis, the focus is to develop reliable acoustic models for under-resourced languages. The following three works have been proposed. In the first work, reliable acoustic models are built by transferring acoustic information from well-resourced languages (source) to under-resourced languages (target). Specifically, the phone models of the source language are reused to form the phone models of the target language. This is motivated by the fact that all human languages share a similar acoustic space, and hence some acoustic units e.g. phones, of two languages may have high correspondence and therefore allows the mapping of phones between languages. Unlike previous studies which examined only context-independent phone mapping, the thesis extends the studies to use context-dependent triphone states as the units to achieve higher acoustic resolution. In addition, linear and nonlinear mapping models with different training algorithms are also investigated. The results show that the nonlinear mapping with discriminative training criterion achieves the best performance in the proposed work. In the second work, rather than increasing the mapping resolution, the focus is to improve the quality of the cross-lingual feature used for mapping. Two approaches based on deep neural networks (DNNs) are examined. First, DNNs are used as the source language acoustic model to generate posterior features for phone mapping. Second, DNNs are used to replace multilayer perceptrons (MLPs) to realize the phone mapping. Experimental results show that better phone posteriors generated from the source DNNs result in a significant improvement in cross-lingual phone mapping, while deep structures for phone mapping are only useful when sufficient target language training data are available. The third work focuses on building a robust acoustic model using the exemplar-based modeling technique. Exemplar-based model is non-parametric and uses the training samples directly during recognition without training model parameters. This study uses a specific exemplar-based model, called kernel density, to estimate the likelihood of target language triphone states. To improve performance for under-resourced languages, cross-lingual bottleneck feature is used. In the exemplar-based technique, the major design consideration is the choice of distance function used to measure the similarity of a test sample and a training sample. This work proposed a Mahalanobis distance based metric optimized by minimizing the classification error rate on the training data. Results show that the proposed distance produces better results than the Euclidean distance. In addition, a discriminative score tuning network, using the same principle of minimizing training classification error, is also proposed.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Do, Van Hai
format	Theses and Dissertations
author	Do, Van Hai
author_sort	Do, Van Hai
title	Acoustic modeling for speech recognition under limited training data conditions
title_short	Acoustic modeling for speech recognition under limited training data conditions
title_full	Acoustic modeling for speech recognition under limited training data conditions
title_fullStr	Acoustic modeling for speech recognition under limited training data conditions
title_full_unstemmed	Acoustic modeling for speech recognition under limited training data conditions
title_sort	acoustic modeling for speech recognition under limited training data conditions
publishDate	2015
url	https://hdl.handle.net/10356/65409
_version_	1759856941440434176

Acoustic modeling for speech recognition under limited training data conditions

Similar Items