Front-end noise reduction algorithms for automatic speech recognition

One of the biggest obstacles that hinders the widespread use of automatic speech recognition technology is the inability to handle noise, which includes environmental noise, channel distortion and speaker variability, etc. Towards this end, we propose several feature compensation approaches to impro...

Full description

Saved in:
Bibliographic Details
Main Author: Dai, Peng
Other Authors: Soon Ing Yann
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/61677
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-61677
record_format dspace
spelling sg-ntu-dr.10356-616772023-07-04T16:21:06Z Front-end noise reduction algorithms for automatic speech recognition Dai, Peng Soon Ing Yann School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Electronic systems::Signal processing One of the biggest obstacles that hinders the widespread use of automatic speech recognition technology is the inability to handle noise, which includes environmental noise, channel distortion and speaker variability, etc. Towards this end, we propose several feature compensation approaches to improve the robustness of automatic speech recognition (ASR) systems: 1) direct implementation of masking effect; 2) 2D psychoacoustic filter; 3) model based noise reduction. The first two are based on psychoacoustics, and the last one includes several algorithms based on a novel feature model. More details are given as follows. The human auditory system can work properly in adverse environments, e.g. in a crowded shopping mall where thousands of people are talking loudly together with the background commercial broadcast. Therefore, modeling the human auditory system is a straightforward and logical approach to improve the performance of ASR systems. The first part of this thesis focuses on the study of masking effects, which describes how a clearly audible sound (maskee) becomes less audible because of the presence of another sound (masker). Masking effects can be classified as temporal masking and frequency masking (a.k.a. simultaneous masking). Chapter 3 introduces a novel Mel-Frequency Cepstral Coefficients (MFCC) based algorithm which simulates the properties of the human auditory system. It sequentially implements temporal masking and frequency masking in the time domain and the frequency domain, respectively. For the second contribution on psychoacoustics, we further investigate the special property of the time-frequency domain and propose the 2D psychoacoustic filter. In the time-frequency domain, the speech signal is represented over both time and frequency, which provides us the chance to address another psychoacoustic problem, i.e. temporal frequency masking. Temporal frequency masking describes the situation where the masker and maskee possess both different frequency and different commencing time. The 2D psychoacoustic filter implements not only temporal masking and frequency masking, but also temporal frequency masking and temporal integration. We also propose a unified model for the 2D psychoacoustic filter, which effectively models the equivalent masking phenomena. Mathematical derivations are provided to show the correctness of the 2D psychoacoustic filter based on the characteristic functions of masking effects. The degradation of ASR performance is mainly due to the mismatch between the statistical model trained from the clean speech and the test features derived from the noisy speech. To reduce the mismatch, we propose to recover the clean speech from the noisy speech. Two different front-end noise reduction algorithms are presented, i.e. Smoothing & Noise Subtraction (SNS) and Newton & Log Power Subtraction (NLPS). SNS tries to recover the temporal structure of the speech power spectrum. The histogram of average speech log power spectrum shows that the contamination of noise leads to a shift of the noise peak. A two-step scheme is proposed to remove noise by first reducing the noise variance and then shifting the noise mean. As for NLPS, it works by solving a nonlinear function derived from the MFCC feature extraction algorithm. DOCTOR OF PHILOSOPHY (EEE) 2014-08-12T01:18:59Z 2014-08-12T01:18:59Z 2014 2014 Thesis Dai, P. (2014). Front-end noise reduction algorithms for automatic speech recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/61677 10.32657/10356/61677 en 154 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Electrical and electronic engineering::Electronic systems::Signal processing
spellingShingle DRNTU::Engineering::Electrical and electronic engineering::Electronic systems::Signal processing
Dai, Peng
Front-end noise reduction algorithms for automatic speech recognition
description One of the biggest obstacles that hinders the widespread use of automatic speech recognition technology is the inability to handle noise, which includes environmental noise, channel distortion and speaker variability, etc. Towards this end, we propose several feature compensation approaches to improve the robustness of automatic speech recognition (ASR) systems: 1) direct implementation of masking effect; 2) 2D psychoacoustic filter; 3) model based noise reduction. The first two are based on psychoacoustics, and the last one includes several algorithms based on a novel feature model. More details are given as follows. The human auditory system can work properly in adverse environments, e.g. in a crowded shopping mall where thousands of people are talking loudly together with the background commercial broadcast. Therefore, modeling the human auditory system is a straightforward and logical approach to improve the performance of ASR systems. The first part of this thesis focuses on the study of masking effects, which describes how a clearly audible sound (maskee) becomes less audible because of the presence of another sound (masker). Masking effects can be classified as temporal masking and frequency masking (a.k.a. simultaneous masking). Chapter 3 introduces a novel Mel-Frequency Cepstral Coefficients (MFCC) based algorithm which simulates the properties of the human auditory system. It sequentially implements temporal masking and frequency masking in the time domain and the frequency domain, respectively. For the second contribution on psychoacoustics, we further investigate the special property of the time-frequency domain and propose the 2D psychoacoustic filter. In the time-frequency domain, the speech signal is represented over both time and frequency, which provides us the chance to address another psychoacoustic problem, i.e. temporal frequency masking. Temporal frequency masking describes the situation where the masker and maskee possess both different frequency and different commencing time. The 2D psychoacoustic filter implements not only temporal masking and frequency masking, but also temporal frequency masking and temporal integration. We also propose a unified model for the 2D psychoacoustic filter, which effectively models the equivalent masking phenomena. Mathematical derivations are provided to show the correctness of the 2D psychoacoustic filter based on the characteristic functions of masking effects. The degradation of ASR performance is mainly due to the mismatch between the statistical model trained from the clean speech and the test features derived from the noisy speech. To reduce the mismatch, we propose to recover the clean speech from the noisy speech. Two different front-end noise reduction algorithms are presented, i.e. Smoothing & Noise Subtraction (SNS) and Newton & Log Power Subtraction (NLPS). SNS tries to recover the temporal structure of the speech power spectrum. The histogram of average speech log power spectrum shows that the contamination of noise leads to a shift of the noise peak. A two-step scheme is proposed to remove noise by first reducing the noise variance and then shifting the noise mean. As for NLPS, it works by solving a nonlinear function derived from the MFCC feature extraction algorithm.
author2 Soon Ing Yann
author_facet Soon Ing Yann
Dai, Peng
format Theses and Dissertations
author Dai, Peng
author_sort Dai, Peng
title Front-end noise reduction algorithms for automatic speech recognition
title_short Front-end noise reduction algorithms for automatic speech recognition
title_full Front-end noise reduction algorithms for automatic speech recognition
title_fullStr Front-end noise reduction algorithms for automatic speech recognition
title_full_unstemmed Front-end noise reduction algorithms for automatic speech recognition
title_sort front-end noise reduction algorithms for automatic speech recognition
publishDate 2014
url https://hdl.handle.net/10356/61677
_version_ 1772828366370504704