Mismatch problem in deep-learning based speech enhancement

Speech enhancement aims to suppress background noise in noisy speech signals in order to improve speech perceptual quality and intelligibility. For tasks utilizing deep learning mechanisms, the training and testing data are usually assumed to have the same probability distribution. However, real-lif...

Full description

Saved in:

Bibliographic Details
Main Author:	Hou, Nana
Other Authors:	Chng Eng Siong
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/159197
Tags:	Add Tag No Tags, Be the first to tag this record!

id	sg-ntu-dr.10356-159197
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Hou, Nana Mismatch problem in deep-learning based speech enhancement
description	Speech enhancement aims to suppress background noise in noisy speech signals in order to improve speech perceptual quality and intelligibility. For tasks utilizing deep learning mechanisms, the training and testing data are usually assumed to have the same probability distribution. However, real-life scenarios often fail to meet this assumption. As a result, speech enhancement performance may degrade significantly, when faced with mismatched probability distributions between training and testing data. This thesis focuses on alleviating the problem of mismatched probability distributions for speech enhancement. The mismatch problem in speech enhancement is caused by various factors, but in this work, we only focus on the following three scenarios: unseen noises in test data, missing high-frequency information under radio-channel testing conditions (channel effect), and sensitive time-domain encoder/decoder. Specifically, we will clarify three factors, analyze impacts on speech enhancement, and propose three methods to solve this problem. The first proposed method addresses the mismatch problem caused by the unseen noises in test data, under conditions with/without target-domain data. Specifically, we utilize the domain adversarial training (DAT) technique for domain transfer. If we have sufficient noisy target-domain data, a domain discriminator is proposed to learn general features with DAT in order to overcome the domain mismatch problem. If we have no target-domain data, we will utilize the noise labels of the source-domain data to generate noise-agnostic features with DAT to overcome the domain mismatch problem. The experiments show that the proposed method delivers voice quality comparable with other state-of-the-art supervised learning techniques. The second proposed method addresses the mismatch problem caused by missing high-frequency signals (i.e., channel effect), which is commonly seen in the radio-channel corpus. Under such scenarios, input signals are noisy, as well as lack high-frequency information due to the channel effect. To recover the missing information and also reduce background noises, we combine speech enhancement techniques and bandwidth extension with multi-task learning. Specifically, we propose an end-to-end time-domain framework for noise-robust bandwidth extension, that jointly optimizes mask-based speech enhancement and the bandwidth extension module with a multi-task loss function. In addition, the proposed framework also avoids decomposing signals into magnitude and phase spectra and therefore requires no phase estimation. Experimental results show that the proposed method achieves better performance over the best baseline with fewer parameters. The third proposed method addresses the mismatch problem caused by sensitive time-domain encoder/decoder. Time-domain speech enhancement has recently made great progress thanks to the learned filterbanks in the speech encoder/decoder as used in Conv-TasNet. However, the learned filterbanks in the encoder/decoder are usually trained by fully relying on the training data, which are sensitive to unseen test data. To alleviate this problem, we propose a two-step hybrid filterbanks-based network (TSHFNet) consisting of the fully-learned filters, semi-learned filters, and non-learned filters that can improve the robustness of the speech encoder/decoder when faced with matched/unmatched testing environments. The experiments confirm that the proposed method is more robust than the best time-domain speech enhancement baseline.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Hou, Nana
format	Thesis-Doctor of Philosophy
author	Hou, Nana
author_sort	Hou, Nana
title	Mismatch problem in deep-learning based speech enhancement
title_short	Mismatch problem in deep-learning based speech enhancement
title_full	Mismatch problem in deep-learning based speech enhancement
title_fullStr	Mismatch problem in deep-learning based speech enhancement
title_full_unstemmed	Mismatch problem in deep-learning based speech enhancement
title_sort	mismatch problem in deep-learning based speech enhancement
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/159197
_version_	1735491175832354816
spelling	sg-ntu-dr.10356-1591972022-06-08T02:24:23Z Mismatch problem in deep-learning based speech enhancement Hou, Nana Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering Speech enhancement aims to suppress background noise in noisy speech signals in order to improve speech perceptual quality and intelligibility. For tasks utilizing deep learning mechanisms, the training and testing data are usually assumed to have the same probability distribution. However, real-life scenarios often fail to meet this assumption. As a result, speech enhancement performance may degrade significantly, when faced with mismatched probability distributions between training and testing data. This thesis focuses on alleviating the problem of mismatched probability distributions for speech enhancement. The mismatch problem in speech enhancement is caused by various factors, but in this work, we only focus on the following three scenarios: unseen noises in test data, missing high-frequency information under radio-channel testing conditions (channel effect), and sensitive time-domain encoder/decoder. Specifically, we will clarify three factors, analyze impacts on speech enhancement, and propose three methods to solve this problem. The first proposed method addresses the mismatch problem caused by the unseen noises in test data, under conditions with/without target-domain data. Specifically, we utilize the domain adversarial training (DAT) technique for domain transfer. If we have sufficient noisy target-domain data, a domain discriminator is proposed to learn general features with DAT in order to overcome the domain mismatch problem. If we have no target-domain data, we will utilize the noise labels of the source-domain data to generate noise-agnostic features with DAT to overcome the domain mismatch problem. The experiments show that the proposed method delivers voice quality comparable with other state-of-the-art supervised learning techniques. The second proposed method addresses the mismatch problem caused by missing high-frequency signals (i.e., channel effect), which is commonly seen in the radio-channel corpus. Under such scenarios, input signals are noisy, as well as lack high-frequency information due to the channel effect. To recover the missing information and also reduce background noises, we combine speech enhancement techniques and bandwidth extension with multi-task learning. Specifically, we propose an end-to-end time-domain framework for noise-robust bandwidth extension, that jointly optimizes mask-based speech enhancement and the bandwidth extension module with a multi-task loss function. In addition, the proposed framework also avoids decomposing signals into magnitude and phase spectra and therefore requires no phase estimation. Experimental results show that the proposed method achieves better performance over the best baseline with fewer parameters. The third proposed method addresses the mismatch problem caused by sensitive time-domain encoder/decoder. Time-domain speech enhancement has recently made great progress thanks to the learned filterbanks in the speech encoder/decoder as used in Conv-TasNet. However, the learned filterbanks in the encoder/decoder are usually trained by fully relying on the training data, which are sensitive to unseen test data. To alleviate this problem, we propose a two-step hybrid filterbanks-based network (TSHFNet) consisting of the fully-learned filters, semi-learned filters, and non-learned filters that can improve the robustness of the speech encoder/decoder when faced with matched/unmatched testing environments. The experiments confirm that the proposed method is more robust than the best time-domain speech enhancement baseline. Doctor of Philosophy 2022-06-08T02:24:23Z 2022-06-08T02:24:23Z 2022 Thesis-Doctor of Philosophy Hou, N. (2022). Mismatch problem in deep-learning based speech enhancement. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/159197 https://hdl.handle.net/10356/159197 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Mismatch problem in deep-learning based speech enhancement

Similar Items