Mismatch problem in deep-learning based speech enhancement

Speech enhancement aims to suppress background noise in noisy speech signals in order to improve speech perceptual quality and intelligibility. For tasks utilizing deep learning mechanisms, the training and testing data are usually assumed to have the same probability distribution. However, real-lif...

Full description

Saved in:
Bibliographic Details
Main Author: Hou, Nana
Other Authors: Chng Eng Siong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/159197
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Speech enhancement aims to suppress background noise in noisy speech signals in order to improve speech perceptual quality and intelligibility. For tasks utilizing deep learning mechanisms, the training and testing data are usually assumed to have the same probability distribution. However, real-life scenarios often fail to meet this assumption. As a result, speech enhancement performance may degrade significantly, when faced with mismatched probability distributions between training and testing data. This thesis focuses on alleviating the problem of mismatched probability distributions for speech enhancement. The mismatch problem in speech enhancement is caused by various factors, but in this work, we only focus on the following three scenarios: unseen noises in test data, missing high-frequency information under radio-channel testing conditions (channel effect), and sensitive time-domain encoder/decoder. Specifically, we will clarify three factors, analyze impacts on speech enhancement, and propose three methods to solve this problem. The first proposed method addresses the mismatch problem caused by the unseen noises in test data, under conditions with/without target-domain data. Specifically, we utilize the domain adversarial training (DAT) technique for domain transfer. If we have sufficient noisy target-domain data, a domain discriminator is proposed to learn general features with DAT in order to overcome the domain mismatch problem. If we have no target-domain data, we will utilize the noise labels of the source-domain data to generate noise-agnostic features with DAT to overcome the domain mismatch problem. The experiments show that the proposed method delivers voice quality comparable with other state-of-the-art supervised learning techniques. The second proposed method addresses the mismatch problem caused by missing high-frequency signals (i.e., channel effect), which is commonly seen in the radio-channel corpus. Under such scenarios, input signals are noisy, as well as lack high-frequency information due to the channel effect. To recover the missing information and also reduce background noises, we combine speech enhancement techniques and bandwidth extension with multi-task learning. Specifically, we propose an end-to-end time-domain framework for noise-robust bandwidth extension, that jointly optimizes mask-based speech enhancement and the bandwidth extension module with a multi-task loss function. In addition, the proposed framework also avoids decomposing signals into magnitude and phase spectra and therefore requires no phase estimation. Experimental results show that the proposed method achieves better performance over the best baseline with fewer parameters. The third proposed method addresses the mismatch problem caused by sensitive time-domain encoder/decoder. Time-domain speech enhancement has recently made great progress thanks to the learned filterbanks in the speech encoder/decoder as used in Conv-TasNet. However, the learned filterbanks in the encoder/decoder are usually trained by fully relying on the training data, which are sensitive to unseen test data. To alleviate this problem, we propose a two-step hybrid filterbanks-based network (TSHFNet) consisting of the fully-learned filters, semi-learned filters, and non-learned filters that can improve the robustness of the speech encoder/decoder when faced with matched/unmatched testing environments. The experiments confirm that the proposed method is more robust than the best time-domain speech enhancement baseline.