Feature-based robust techniques for speech recognition

Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speec...

Full description

Saved in:
Bibliographic Details
Main Author: Nguyen, Duc Hoang Ha
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/69839
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-69839
record_format dspace
spelling sg-ntu-dr.10356-698392023-03-04T00:52:18Z Feature-based robust techniques for speech recognition Nguyen, Duc Hoang Ha Chng Eng Siong Li Haizhou School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speech feature enhancement and model adaptation for robust speech recognition is studied, and three novel methods to improve performance are introduced. The first work proposes a modification of the spectral subtraction method to reduce the non-stationary characteristics of additive noise in the speech. The main idea is to first normalise the noise's characteristics towards a Gaussian noise model, and then tackle the remaining noise by a model compensation method. The strategy is to reduce the noise handling problem to the back-end process. In this work, the back-end compensation process is applied using the vector Taylor series (VTS) model compensation approach, and we call this method the noise normalization VTS (NN-VTS). The second work proposes an extension of particle filter compensation (PFC) for the large vocabulary continuous speech recognition (LVCSR) task. PFC is a clean speech features tracking method using side information from hidden Markov models (HMM) for the particle filter framework. However, under noisy conditions for sub-word based LVCSR, the task to obtain an accurately aligned state sequence of HMM that describe the underlying clean speech features is challenging. This is because the total number of triphone models involved can be very large. To improve the identification of correct phone sequence, this work proposes to use a noisy model HMM trained from noisy data to estimate the state sequence and a parallel clean model HMM trained from clean data to generate the clean speech features. These two HMMs are trained jointly, and the alignment of states between the clean and noisy models HMM is obtained by single pass retraining (SPR) technique. With this approach, the accuracy of state sequence estimate is improved by the noisy model HMM, and the accurately aligned state is obtained by SPR technique. When the missing side information for PFC is available, a word error reduction of 28.46% from multi-condition training is observed for the Aurora-4 task. The third work proposes a novel spectro-temporal transform framework to improve word error rate for the noisy and reverberant environments. Motivated by the findings that human speech comprehension relies on both the spectral content and temporal envelope of speech signal, a spectro-temporal (ST) transform framework is proposed. This framework adapts the features to minimize the mismatch between the input features and training data using the Kullback Leibler divergence based cost function. In our work, we examined two implementations to overcome the limited adaptation data issue. The first implementation is a cross transform which is a sparse spectro-temporal transforms. The second implementation is a cascaded transform of temporal transform and spectral transform. Experiments are conducted on the REVERB Challenge 2014 task, where clean and multi-condition trained acoustic models are tested with real reverberant and noisy speech. Experimental results confirmed that temporal information is important for reverberant speech recognition and the simultaneous use of spectral and temporal information for feature adaptation is effective. Doctor of Philosophy (SCE) 2017-03-29T08:06:24Z 2017-03-29T08:06:24Z 2017 Thesis Nguyen, D. H. H. (2017). Feature-based robust techniques for speech recognition. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/69839 10.32657/10356/69839 en 124 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Nguyen, Duc Hoang Ha
Feature-based robust techniques for speech recognition
description Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speech feature enhancement and model adaptation for robust speech recognition is studied, and three novel methods to improve performance are introduced. The first work proposes a modification of the spectral subtraction method to reduce the non-stationary characteristics of additive noise in the speech. The main idea is to first normalise the noise's characteristics towards a Gaussian noise model, and then tackle the remaining noise by a model compensation method. The strategy is to reduce the noise handling problem to the back-end process. In this work, the back-end compensation process is applied using the vector Taylor series (VTS) model compensation approach, and we call this method the noise normalization VTS (NN-VTS). The second work proposes an extension of particle filter compensation (PFC) for the large vocabulary continuous speech recognition (LVCSR) task. PFC is a clean speech features tracking method using side information from hidden Markov models (HMM) for the particle filter framework. However, under noisy conditions for sub-word based LVCSR, the task to obtain an accurately aligned state sequence of HMM that describe the underlying clean speech features is challenging. This is because the total number of triphone models involved can be very large. To improve the identification of correct phone sequence, this work proposes to use a noisy model HMM trained from noisy data to estimate the state sequence and a parallel clean model HMM trained from clean data to generate the clean speech features. These two HMMs are trained jointly, and the alignment of states between the clean and noisy models HMM is obtained by single pass retraining (SPR) technique. With this approach, the accuracy of state sequence estimate is improved by the noisy model HMM, and the accurately aligned state is obtained by SPR technique. When the missing side information for PFC is available, a word error reduction of 28.46% from multi-condition training is observed for the Aurora-4 task. The third work proposes a novel spectro-temporal transform framework to improve word error rate for the noisy and reverberant environments. Motivated by the findings that human speech comprehension relies on both the spectral content and temporal envelope of speech signal, a spectro-temporal (ST) transform framework is proposed. This framework adapts the features to minimize the mismatch between the input features and training data using the Kullback Leibler divergence based cost function. In our work, we examined two implementations to overcome the limited adaptation data issue. The first implementation is a cross transform which is a sparse spectro-temporal transforms. The second implementation is a cascaded transform of temporal transform and spectral transform. Experiments are conducted on the REVERB Challenge 2014 task, where clean and multi-condition trained acoustic models are tested with real reverberant and noisy speech. Experimental results confirmed that temporal information is important for reverberant speech recognition and the simultaneous use of spectral and temporal information for feature adaptation is effective.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Nguyen, Duc Hoang Ha
format Theses and Dissertations
author Nguyen, Duc Hoang Ha
author_sort Nguyen, Duc Hoang Ha
title Feature-based robust techniques for speech recognition
title_short Feature-based robust techniques for speech recognition
title_full Feature-based robust techniques for speech recognition
title_fullStr Feature-based robust techniques for speech recognition
title_full_unstemmed Feature-based robust techniques for speech recognition
title_sort feature-based robust techniques for speech recognition
publishDate 2017
url http://hdl.handle.net/10356/69839
_version_ 1759853868378750976