Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM

Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing the mean square error (MSE) over all permutations between outputs and targets. Howev...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xu, Chenglin, Rao, Wei, Xiao, Xiong, Chng, Eng Siong, Li, Haizhou
Other Authors:	School of Computer Science and Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2020
Subjects:	Engineering::Computer science and engineering Constrained Permutation Invariant Training Grid LSTM
Online Access:	https://hdl.handle.net/10356/137336
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing the mean square error (MSE) over all permutations between outputs and targets. However, uPIT may be sub-optimal at segmental level because the optimization is not calculated over the individual frames. In this paper, we propose a constrained uPIT (cuPIT) to solve this problem by computing a weighted MSE loss using dynamic information (i.e., delta and acceleration). The weighted loss ensures the temporal continuity of output frames with the same speaker. Inspired by the heuristics (i.e., vocal tract continuity) in computational auditory scene analysis, we then extend the model by adding a Grid LSTM layer, that we name it as cuPIT-Grid LSTM, to automatically learn both temporal and spectral patterns over the input magnitude spectrum simultaneously. The experimental results show 9.6% and 8.5% relative improvements on WSJ0-2mix dataset under both closed and open conditions comparing with the uPIT baseline.

Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM

Similar Items