Feature-based robust techniques for speech recognition

Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speec...

Full description

Saved in:
Bibliographic Details
Main Author: Nguyen, Duc Hoang Ha
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/69839
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speech feature enhancement and model adaptation for robust speech recognition is studied, and three novel methods to improve performance are introduced. The first work proposes a modification of the spectral subtraction method to reduce the non-stationary characteristics of additive noise in the speech. The main idea is to first normalise the noise's characteristics towards a Gaussian noise model, and then tackle the remaining noise by a model compensation method. The strategy is to reduce the noise handling problem to the back-end process. In this work, the back-end compensation process is applied using the vector Taylor series (VTS) model compensation approach, and we call this method the noise normalization VTS (NN-VTS). The second work proposes an extension of particle filter compensation (PFC) for the large vocabulary continuous speech recognition (LVCSR) task. PFC is a clean speech features tracking method using side information from hidden Markov models (HMM) for the particle filter framework. However, under noisy conditions for sub-word based LVCSR, the task to obtain an accurately aligned state sequence of HMM that describe the underlying clean speech features is challenging. This is because the total number of triphone models involved can be very large. To improve the identification of correct phone sequence, this work proposes to use a noisy model HMM trained from noisy data to estimate the state sequence and a parallel clean model HMM trained from clean data to generate the clean speech features. These two HMMs are trained jointly, and the alignment of states between the clean and noisy models HMM is obtained by single pass retraining (SPR) technique. With this approach, the accuracy of state sequence estimate is improved by the noisy model HMM, and the accurately aligned state is obtained by SPR technique. When the missing side information for PFC is available, a word error reduction of 28.46% from multi-condition training is observed for the Aurora-4 task. The third work proposes a novel spectro-temporal transform framework to improve word error rate for the noisy and reverberant environments. Motivated by the findings that human speech comprehension relies on both the spectral content and temporal envelope of speech signal, a spectro-temporal (ST) transform framework is proposed. This framework adapts the features to minimize the mismatch between the input features and training data using the Kullback Leibler divergence based cost function. In our work, we examined two implementations to overcome the limited adaptation data issue. The first implementation is a cross transform which is a sparse spectro-temporal transforms. The second implementation is a cascaded transform of temporal transform and spectral transform. Experiments are conducted on the REVERB Challenge 2014 task, where clean and multi-condition trained acoustic models are tested with real reverberant and noisy speech. Experimental results confirmed that temporal information is important for reverberant speech recognition and the simultaneous use of spectral and temporal information for feature adaptation is effective.