Acoustic modeling for speech recognition under limited training data conditions

The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority l...

Full description

Saved in:
Bibliographic Details
Main Author: Do, Van Hai
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/65409
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-65409
record_format dspace
spelling sg-ntu-dr.10356-654092023-03-04T00:50:22Z Acoustic modeling for speech recognition under limited training data conditions Do, Van Hai Chng Eng Siong Li Haizhou School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority languages, these resources are limited or even unavailable - we label these languages as under-resourced. In this thesis, the focus is to develop reliable acoustic models for under-resourced languages. The following three works have been proposed. In the first work, reliable acoustic models are built by transferring acoustic information from well-resourced languages (source) to under-resourced languages (target). Specifically, the phone models of the source language are reused to form the phone models of the target language. This is motivated by the fact that all human languages share a similar acoustic space, and hence some acoustic units e.g. phones, of two languages may have high correspondence and therefore allows the mapping of phones between languages. Unlike previous studies which examined only context-independent phone mapping, the thesis extends the studies to use context-dependent triphone states as the units to achieve higher acoustic resolution. In addition, linear and nonlinear mapping models with different training algorithms are also investigated. The results show that the nonlinear mapping with discriminative training criterion achieves the best performance in the proposed work. In the second work, rather than increasing the mapping resolution, the focus is to improve the quality of the cross-lingual feature used for mapping. Two approaches based on deep neural networks (DNNs) are examined. First, DNNs are used as the source language acoustic model to generate posterior features for phone mapping. Second, DNNs are used to replace multilayer perceptrons (MLPs) to realize the phone mapping. Experimental results show that better phone posteriors generated from the source DNNs result in a significant improvement in cross-lingual phone mapping, while deep structures for phone mapping are only useful when sufficient target language training data are available. The third work focuses on building a robust acoustic model using the exemplar-based modeling technique. Exemplar-based model is non-parametric and uses the training samples directly during recognition without training model parameters. This study uses a specific exemplar-based model, called kernel density, to estimate the likelihood of target language triphone states. To improve performance for under-resourced languages, cross-lingual bottleneck feature is used. In the exemplar-based technique, the major design consideration is the choice of distance function used to measure the similarity of a test sample and a training sample. This work proposed a Mahalanobis distance based metric optimized by minimizing the classification error rate on the training data. Results show that the proposed distance produces better results than the Euclidean distance. In addition, a discriminative score tuning network, using the same principle of minimizing training classification error, is also proposed. DOCTOR OF PHILOSOPHY (SCE) 2015-09-08T07:24:56Z 2015-09-08T07:24:56Z 2015 2015 Thesis Do, V. H. (2015). Acoustic modeling for speech recognition under limited training data conditions. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/65409 10.32657/10356/65409 en 144 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Do, Van Hai
Acoustic modeling for speech recognition under limited training data conditions
description The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority languages, these resources are limited or even unavailable - we label these languages as under-resourced. In this thesis, the focus is to develop reliable acoustic models for under-resourced languages. The following three works have been proposed. In the first work, reliable acoustic models are built by transferring acoustic information from well-resourced languages (source) to under-resourced languages (target). Specifically, the phone models of the source language are reused to form the phone models of the target language. This is motivated by the fact that all human languages share a similar acoustic space, and hence some acoustic units e.g. phones, of two languages may have high correspondence and therefore allows the mapping of phones between languages. Unlike previous studies which examined only context-independent phone mapping, the thesis extends the studies to use context-dependent triphone states as the units to achieve higher acoustic resolution. In addition, linear and nonlinear mapping models with different training algorithms are also investigated. The results show that the nonlinear mapping with discriminative training criterion achieves the best performance in the proposed work. In the second work, rather than increasing the mapping resolution, the focus is to improve the quality of the cross-lingual feature used for mapping. Two approaches based on deep neural networks (DNNs) are examined. First, DNNs are used as the source language acoustic model to generate posterior features for phone mapping. Second, DNNs are used to replace multilayer perceptrons (MLPs) to realize the phone mapping. Experimental results show that better phone posteriors generated from the source DNNs result in a significant improvement in cross-lingual phone mapping, while deep structures for phone mapping are only useful when sufficient target language training data are available. The third work focuses on building a robust acoustic model using the exemplar-based modeling technique. Exemplar-based model is non-parametric and uses the training samples directly during recognition without training model parameters. This study uses a specific exemplar-based model, called kernel density, to estimate the likelihood of target language triphone states. To improve performance for under-resourced languages, cross-lingual bottleneck feature is used. In the exemplar-based technique, the major design consideration is the choice of distance function used to measure the similarity of a test sample and a training sample. This work proposed a Mahalanobis distance based metric optimized by minimizing the classification error rate on the training data. Results show that the proposed distance produces better results than the Euclidean distance. In addition, a discriminative score tuning network, using the same principle of minimizing training classification error, is also proposed.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Do, Van Hai
format Theses and Dissertations
author Do, Van Hai
author_sort Do, Van Hai
title Acoustic modeling for speech recognition under limited training data conditions
title_short Acoustic modeling for speech recognition under limited training data conditions
title_full Acoustic modeling for speech recognition under limited training data conditions
title_fullStr Acoustic modeling for speech recognition under limited training data conditions
title_full_unstemmed Acoustic modeling for speech recognition under limited training data conditions
title_sort acoustic modeling for speech recognition under limited training data conditions
publishDate 2015
url https://hdl.handle.net/10356/65409
_version_ 1759856941440434176