Single channel multi-talker speech separation with deep learning
The objective of speech separation is to divide a mixture signal, i.e., multiple speakers with background noise, into a set of individual streams, where each stream only contains a single speaker's voice. The study of speech separation is important as the performance of speech applications degr...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/143903 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The objective of speech separation is to divide a mixture signal, i.e., multiple speakers with background noise, into a set of individual streams, where each stream only contains a single speaker's voice. The study of speech separation is important as the performance of speech applications degrades dramatically in the presence of background noise, especially, interference speakers. Many real world speech applications are thus greatly limited. Towards this end, this thesis focuses on the techniques to improve the performance and practicality of speech separation, and the usage of speech separation technology to solve multi-talker speaker verification.
Speech always shows its characteristics of temporal continuity. However, the temporal continuity of the separated speech is broken by a frame leakage problem that someone's voice segment is wrongly separated into another speaker's output stream. To this end, this thesis first proposes a temporal objective function to optimize the neural network and a grid long short-term memory (LSTM) to learn spectro-temporal features. With these explicit temporal information as supervision and features, the temporal continuity, that is broken by windowing effect between frames, is bridged. The speech separation network is optimized through a multi-task learning framework with a subtask to predict an attribute (silence, single, and overlapped) for each time-frequency (TF) bin of the mixture magnitude. Experiments show that the proposed method significantly outperforms the corresponding baseline in both objective and subjective evaluations.
In general, speech separation methods require knowing or estimating the number of speakers in the mixture in advance. However, the number of speakers couldn't always be known in real world applications. Moreover, speech separation may suffer from what is called global permutation ambiguity problem, where the separated voice for the same speaker may not stick to the same output stream across long pauses or utterances. These two problems greatly limited speech separation in realistic environment. To this end, the second and third contributions of this thesis focus on a frequency-domain and a time-domain speaker extraction solutions, that are special cases of speech separation, to address the aforementioned problems. The idea is to mimic human's ability of selective auditory attention by only extracting the target speaker's voice given a reference speech of that speaker. The time-domain speaker extraction method further avoids the inherent phase estimation problem in the frequency-domain method during the signal reconstruction stage. Experiments show that the proposed method significantly outperforms a variety of baseline approaches in different evaluation environments. The experiments also confirm that the proposed method is more flexible and practicable than other traditional speech separation methods.
The performance of speaker verification degrades significantly when the test speech is corrupted by interference speakers. Speaker diarization also fails to segregate speakers in presence of an overlapped multi-talker speech. To the best of my knowledge, the forth contribution of this thesis is the first solution that addresses the overlapped multi-talker speaker verification by a tandem system. The proposed speaker extraction methods are exploited as the front-end processing of a traditional speaker verification system, that is called as SE-SV. Experimental results show that SE-SV significantly improves the performance of speaker verification with overlapped multi-talker speech and outperforms oracle speaker diarization. |
---|