Acoustic modeling for speech recognition under limited training data conditions

The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority l...

Full description

Saved in:
Bibliographic Details
Main Author: Do, Van Hai
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/65409
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The development of a speech recognition system requires at least three resources: a large labeled speech corpus to build the acoustic model, a pronunciation lexicon to map words to phone sequences, and a large text corpus to build the language model. For many languages such as dialects or minority languages, these resources are limited or even unavailable - we label these languages as under-resourced. In this thesis, the focus is to develop reliable acoustic models for under-resourced languages. The following three works have been proposed. In the first work, reliable acoustic models are built by transferring acoustic information from well-resourced languages (source) to under-resourced languages (target). Specifically, the phone models of the source language are reused to form the phone models of the target language. This is motivated by the fact that all human languages share a similar acoustic space, and hence some acoustic units e.g. phones, of two languages may have high correspondence and therefore allows the mapping of phones between languages. Unlike previous studies which examined only context-independent phone mapping, the thesis extends the studies to use context-dependent triphone states as the units to achieve higher acoustic resolution. In addition, linear and nonlinear mapping models with different training algorithms are also investigated. The results show that the nonlinear mapping with discriminative training criterion achieves the best performance in the proposed work. In the second work, rather than increasing the mapping resolution, the focus is to improve the quality of the cross-lingual feature used for mapping. Two approaches based on deep neural networks (DNNs) are examined. First, DNNs are used as the source language acoustic model to generate posterior features for phone mapping. Second, DNNs are used to replace multilayer perceptrons (MLPs) to realize the phone mapping. Experimental results show that better phone posteriors generated from the source DNNs result in a significant improvement in cross-lingual phone mapping, while deep structures for phone mapping are only useful when sufficient target language training data are available. The third work focuses on building a robust acoustic model using the exemplar-based modeling technique. Exemplar-based model is non-parametric and uses the training samples directly during recognition without training model parameters. This study uses a specific exemplar-based model, called kernel density, to estimate the likelihood of target language triphone states. To improve performance for under-resourced languages, cross-lingual bottleneck feature is used. In the exemplar-based technique, the major design consideration is the choice of distance function used to measure the similarity of a test sample and a training sample. This work proposed a Mahalanobis distance based metric optimized by minimizing the classification error rate on the training data. Results show that the proposed distance produces better results than the Euclidean distance. In addition, a discriminative score tuning network, using the same principle of minimizing training classification error, is also proposed.