Feature-based robust techniques for speech recognition

Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speec...

Full description

Saved in:

Bibliographic Details
Main Author:	Nguyen, Duc Hoang Ha
Other Authors:	Chng Eng Siong
Format:	Theses and Dissertations
Language:	English
Published:	2017
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Online Access:	http://hdl.handle.net/10356/69839
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-69839
record_format	dspace
spelling	sg-ntu-dr.10356-698392023-03-04T00:52:18Z Feature-based robust techniques for speech recognition Nguyen, Duc Hoang Ha Chng Eng Siong Li Haizhou School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speech feature enhancement and model adaptation for robust speech recognition is studied, and three novel methods to improve performance are introduced. The first work proposes a modification of the spectral subtraction method to reduce the non-stationary characteristics of additive noise in the speech. The main idea is to first normalise the noise's characteristics towards a Gaussian noise model, and then tackle the remaining noise by a model compensation method. The strategy is to reduce the noise handling problem to the back-end process. In this work, the back-end compensation process is applied using the vector Taylor series (VTS) model compensation approach, and we call this method the noise normalization VTS (NN-VTS). The second work proposes an extension of particle filter compensation (PFC) for the large vocabulary continuous speech recognition (LVCSR) task. PFC is a clean speech features tracking method using side information from hidden Markov models (HMM) for the particle filter framework. However, under noisy conditions for sub-word based LVCSR, the task to obtain an accurately aligned state sequence of HMM that describe the underlying clean speech features is challenging. This is because the total number of triphone models involved can be very large. To improve the identification of correct phone sequence, this work proposes to use a noisy model HMM trained from noisy data to estimate the state sequence and a parallel clean model HMM trained from clean data to generate the clean speech features. These two HMMs are trained jointly, and the alignment of states between the clean and noisy models HMM is obtained by single pass retraining (SPR) technique. With this approach, the accuracy of state sequence estimate is improved by the noisy model HMM, and the accurately aligned state is obtained by SPR technique. When the missing side information for PFC is available, a word error reduction of 28.46% from multi-condition training is observed for the Aurora-4 task. The third work proposes a novel spectro-temporal transform framework to improve word error rate for the noisy and reverberant environments. Motivated by the findings that human speech comprehension relies on both the spectral content and temporal envelope of speech signal, a spectro-temporal (ST) transform framework is proposed. This framework adapts the features to minimize the mismatch between the input features and training data using the Kullback Leibler divergence based cost function. In our work, we examined two implementations to overcome the limited adaptation data issue. The first implementation is a cross transform which is a sparse spectro-temporal transforms. The second implementation is a cascaded transform of temporal transform and spectral transform. Experiments are conducted on the REVERB Challenge 2014 task, where clean and multi-condition trained acoustic models are tested with real reverberant and noisy speech. Experimental results confirmed that temporal information is important for reverberant speech recognition and the simultaneous use of spectral and temporal information for feature adaptation is effective. Doctor of Philosophy (SCE) 2017-03-29T08:06:24Z 2017-03-29T08:06:24Z 2017 Thesis Nguyen, D. H. H. (2017). Feature-based robust techniques for speech recognition. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/69839 10.32657/10356/69839 en 124 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Nguyen, Duc Hoang Ha Feature-based robust techniques for speech recognition
description	Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speech feature enhancement and model adaptation for robust speech recognition is studied, and three novel methods to improve performance are introduced. The first work proposes a modification of the spectral subtraction method to reduce the non-stationary characteristics of additive noise in the speech. The main idea is to first normalise the noise's characteristics towards a Gaussian noise model, and then tackle the remaining noise by a model compensation method. The strategy is to reduce the noise handling problem to the back-end process. In this work, the back-end compensation process is applied using the vector Taylor series (VTS) model compensation approach, and we call this method the noise normalization VTS (NN-VTS). The second work proposes an extension of particle filter compensation (PFC) for the large vocabulary continuous speech recognition (LVCSR) task. PFC is a clean speech features tracking method using side information from hidden Markov models (HMM) for the particle filter framework. However, under noisy conditions for sub-word based LVCSR, the task to obtain an accurately aligned state sequence of HMM that describe the underlying clean speech features is challenging. This is because the total number of triphone models involved can be very large. To improve the identification of correct phone sequence, this work proposes to use a noisy model HMM trained from noisy data to estimate the state sequence and a parallel clean model HMM trained from clean data to generate the clean speech features. These two HMMs are trained jointly, and the alignment of states between the clean and noisy models HMM is obtained by single pass retraining (SPR) technique. With this approach, the accuracy of state sequence estimate is improved by the noisy model HMM, and the accurately aligned state is obtained by SPR technique. When the missing side information for PFC is available, a word error reduction of 28.46% from multi-condition training is observed for the Aurora-4 task. The third work proposes a novel spectro-temporal transform framework to improve word error rate for the noisy and reverberant environments. Motivated by the findings that human speech comprehension relies on both the spectral content and temporal envelope of speech signal, a spectro-temporal (ST) transform framework is proposed. This framework adapts the features to minimize the mismatch between the input features and training data using the Kullback Leibler divergence based cost function. In our work, we examined two implementations to overcome the limited adaptation data issue. The first implementation is a cross transform which is a sparse spectro-temporal transforms. The second implementation is a cascaded transform of temporal transform and spectral transform. Experiments are conducted on the REVERB Challenge 2014 task, where clean and multi-condition trained acoustic models are tested with real reverberant and noisy speech. Experimental results confirmed that temporal information is important for reverberant speech recognition and the simultaneous use of spectral and temporal information for feature adaptation is effective.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Nguyen, Duc Hoang Ha
format	Theses and Dissertations
author	Nguyen, Duc Hoang Ha
author_sort	Nguyen, Duc Hoang Ha
title	Feature-based robust techniques for speech recognition
title_short	Feature-based robust techniques for speech recognition
title_full	Feature-based robust techniques for speech recognition
title_fullStr	Feature-based robust techniques for speech recognition
title_full_unstemmed	Feature-based robust techniques for speech recognition
title_sort	feature-based robust techniques for speech recognition
publishDate	2017
url	http://hdl.handle.net/10356/69839
_version_	1759853868378750976

Feature-based robust techniques for speech recognition

Similar Items