NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network

Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attentio...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	Saleem, Nasir, Gunawan, Teddy Surya, Kartiwi, Mira, Nugroho, Bambang Setia, Wijayanto, Inung
التنسيق:	مقال
اللغة:	English English
منشور في:	IEEE 2023
الموضوعات:	TK7885 Computer engineering
الوصول للمادة أونلاين:	http://irep.iium.edu.my/106019/7/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement.pdf http://irep.iium.edu.my/106019/8/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement_Scopus.pdf http://irep.iium.edu.my/106019/ https://ieeexplore.ieee.org/abstract/document/10168245
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

الوصف
الملخص:	Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attention and the Transformer model have demonstrated competitive results in SE. Transformer models with convolution layers can capture short and long-term temporal sequences by leveraging multi-head self-attention, which allows the model to attend the entire sequence. This study proposes a neural speech enhancement (NSE) using the convolutional encoder-decoder (CED) and convolutional attention Transformer (CAT), named the NSE-CATNet. To effectively process the time-frequency (T-F) distribution of spectral components in speech signals, a T-F attention module is incorporated into the convolutional Transformer model. This module enables the model to explicitly leverage position information and generate a two-dimensional attention map for the time-frequency speech distribution. The performance of the proposed SE is evaluated using objective speech quality and intelligibility metrics on two different datasets, the VoiceBank-DEMAND Corpus and the LibriSpeech dataset. The experimental results indicate that the proposed SE outperformed the competitive baselines in terms of speech enhancement performance at -5dB, 0dB, and 5dB. This suggests that the model is effective at improving the overall quality by 0.704 with VoiceBank-DEMAND and by 0.692 with LibriSpeech. Further, the intelligibility with VoiceBank-DEMAND and LibriSpeech is improved by 11.325% and 11.75% over the noisy speech signals.

NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network

مواد مشابهة