Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberatio...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2013
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/98409 http://hdl.handle.net/10220/13398 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-98409 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-984092020-05-28T07:18:03Z Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech Xiao, Xiong Chng, Eng Siong Li, Haizhou School of Computer Engineering IEEE International Conference on Acoustics, Speech and Signal Processing (2012 : Kyoto, Japan) Temasek Laboratories DRNTU::Engineering::Computer science and engineering In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberation. As a result, the interaction between the spectral normalization (e.g. mean and variance normalization, MVN) and temporal normalization (e.g. temporal structure normalization, TSN) is ignored. We propose a joint spectral and temporal normalization (JSTN) framework to simultaneously normalize these two aspects of feature statistics. In JSTN, feature trajectories are filtered by linear filters and the filters' coefficients are optimized by maximizing a likelihood-based objective function. Experimental results on Aurora-5 benchmark task show that JSTN consistently out-performs the cascade of MVN and TSN on test data corrupted by both additive noise and reverberation, which validates our proposal. Specifically, JSTN reduces average word error rate by 8-9% relatively over the cascade of MVN and TSN for both artificial and real noisy data. 2013-09-09T06:59:14Z 2019-12-06T19:54:56Z 2013-09-09T06:59:14Z 2019-12-06T19:54:56Z 2012 2012 Conference Paper Xiao, X., Chng, E. S., & Li, H. (2012). Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4325-4328. https://hdl.handle.net/10356/98409 http://hdl.handle.net/10220/13398 10.1109/ICASSP.2012.6288876 en © 2012 IEEE. |
institution |
Nanyang Technological University |
building |
NTU Library |
country |
Singapore |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering |
spellingShingle |
DRNTU::Engineering::Computer science and engineering Xiao, Xiong Chng, Eng Siong Li, Haizhou Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech |
description |
In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberation. As a result, the interaction between the spectral normalization (e.g. mean and variance normalization, MVN) and temporal normalization (e.g. temporal structure normalization, TSN) is ignored. We propose a joint spectral and temporal normalization (JSTN) framework to simultaneously normalize these two aspects of feature statistics. In JSTN, feature trajectories are filtered by linear filters and the filters' coefficients are optimized by maximizing a likelihood-based objective function. Experimental results on Aurora-5 benchmark task show that JSTN consistently out-performs the cascade of MVN and TSN on test data corrupted by both additive noise and reverberation, which validates our proposal. Specifically, JSTN reduces average word error rate by 8-9% relatively over the cascade of MVN and TSN for both artificial and real noisy data. |
author2 |
School of Computer Engineering |
author_facet |
School of Computer Engineering Xiao, Xiong Chng, Eng Siong Li, Haizhou |
format |
Conference or Workshop Item |
author |
Xiao, Xiong Chng, Eng Siong Li, Haizhou |
author_sort |
Xiao, Xiong |
title |
Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech |
title_short |
Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech |
title_full |
Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech |
title_fullStr |
Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech |
title_full_unstemmed |
Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech |
title_sort |
joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech |
publishDate |
2013 |
url |
https://hdl.handle.net/10356/98409 http://hdl.handle.net/10220/13398 |
_version_ |
1681056728648515584 |