Feature Adaptation Using Linear Spectro-Temporal Transform for Robust Speech Recognition

Spectral information represents short-term speech information within a frame of a few tens of milliseconds, while temporal information captures the evolution of speech statistics over consecutive frames. Motivated by the findings that human speech comprehension relies on the integrity of both the sp...

Full description

Saved in:
Bibliographic Details
Main Authors: Nguyen, Duc Hoang Ha, Xiao, Xiong, Chng, Eng Siong, Li, Haizhou
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/84664
http://hdl.handle.net/10220/41916
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Spectral information represents short-term speech information within a frame of a few tens of milliseconds, while temporal information captures the evolution of speech statistics over consecutive frames. Motivated by the findings that human speech comprehension relies on the integrity of both the spectral content and temporal envelope of speech signal, we study a spectro-temporal transform framework that adapts run-time speech features to minimize the mismatch between run-time and training data, and its implementation that includes cross transform and cascaded transform. A Kullback-Leibler divergence based cost function is proposed to estimate the transform parameters. We conducted experiments on the REVERB Challenge 2014 task, where clean and multi-condition trained acoustic models are tested with real reverberant and noisy speech. We found that temporal information is important for reverberant speech recognition and the simultaneous use of spectral and temporal information for feature adaptation is effective. We also investigate the combination of the cross transform with fMLLR, the combination of batch, utterance and speaker mode adaptation, and multicondition adaptive training using proposed transforms. All experiments consistently report significant word error rate reductions.