Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech

In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberatio...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiao, Xiong, Chng, Eng Siong, Li, Haizhou
Other Authors: School of Computer Engineering
Format: Conference or Workshop Item
Language:English
Published: 2013
Subjects:
Online Access:https://hdl.handle.net/10356/98409
http://hdl.handle.net/10220/13398
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-98409
record_format dspace
spelling sg-ntu-dr.10356-984092020-05-28T07:18:03Z Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech Xiao, Xiong Chng, Eng Siong Li, Haizhou School of Computer Engineering IEEE International Conference on Acoustics, Speech and Signal Processing (2012 : Kyoto, Japan) Temasek Laboratories DRNTU::Engineering::Computer science and engineering In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberation. As a result, the interaction between the spectral normalization (e.g. mean and variance normalization, MVN) and temporal normalization (e.g. temporal structure normalization, TSN) is ignored. We propose a joint spectral and temporal normalization (JSTN) framework to simultaneously normalize these two aspects of feature statistics. In JSTN, feature trajectories are filtered by linear filters and the filters' coefficients are optimized by maximizing a likelihood-based objective function. Experimental results on Aurora-5 benchmark task show that JSTN consistently out-performs the cascade of MVN and TSN on test data corrupted by both additive noise and reverberation, which validates our proposal. Specifically, JSTN reduces average word error rate by 8-9% relatively over the cascade of MVN and TSN for both artificial and real noisy data. 2013-09-09T06:59:14Z 2019-12-06T19:54:56Z 2013-09-09T06:59:14Z 2019-12-06T19:54:56Z 2012 2012 Conference Paper Xiao, X., Chng, E. S., & Li, H. (2012). Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4325-4328. https://hdl.handle.net/10356/98409 http://hdl.handle.net/10220/13398 10.1109/ICASSP.2012.6288876 en © 2012 IEEE.
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
description In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberation. As a result, the interaction between the spectral normalization (e.g. mean and variance normalization, MVN) and temporal normalization (e.g. temporal structure normalization, TSN) is ignored. We propose a joint spectral and temporal normalization (JSTN) framework to simultaneously normalize these two aspects of feature statistics. In JSTN, feature trajectories are filtered by linear filters and the filters' coefficients are optimized by maximizing a likelihood-based objective function. Experimental results on Aurora-5 benchmark task show that JSTN consistently out-performs the cascade of MVN and TSN on test data corrupted by both additive noise and reverberation, which validates our proposal. Specifically, JSTN reduces average word error rate by 8-9% relatively over the cascade of MVN and TSN for both artificial and real noisy data.
author2 School of Computer Engineering
author_facet School of Computer Engineering
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
format Conference or Workshop Item
author Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
author_sort Xiao, Xiong
title Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_short Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_full Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_fullStr Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_full_unstemmed Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_sort joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
publishDate 2013
url https://hdl.handle.net/10356/98409
http://hdl.handle.net/10220/13398
_version_ 1681056728648515584