NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network
Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attentio...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English English |
Published: |
IEEE
2023
|
Subjects: | |
Online Access: | http://irep.iium.edu.my/106019/7/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement.pdf http://irep.iium.edu.my/106019/8/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement_Scopus.pdf http://irep.iium.edu.my/106019/ https://ieeexplore.ieee.org/abstract/document/10168245 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Islam Antarabangsa Malaysia |
Language: | English English |
id |
my.iium.irep.106019 |
---|---|
record_format |
dspace |
spelling |
my.iium.irep.1060192023-08-17T06:41:06Z http://irep.iium.edu.my/106019/ NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network Saleem, Nasir Gunawan, Teddy Surya Kartiwi, Mira Nugroho, Bambang Setia Wijayanto, Inung TK7885 Computer engineering Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attention and the Transformer model have demonstrated competitive results in SE. Transformer models with convolution layers can capture short and long-term temporal sequences by leveraging multi-head self-attention, which allows the model to attend the entire sequence. This study proposes a neural speech enhancement (NSE) using the convolutional encoder-decoder (CED) and convolutional attention Transformer (CAT), named the NSE-CATNet. To effectively process the time-frequency (T-F) distribution of spectral components in speech signals, a T-F attention module is incorporated into the convolutional Transformer model. This module enables the model to explicitly leverage position information and generate a two-dimensional attention map for the time-frequency speech distribution. The performance of the proposed SE is evaluated using objective speech quality and intelligibility metrics on two different datasets, the VoiceBank-DEMAND Corpus and the LibriSpeech dataset. The experimental results indicate that the proposed SE outperformed the competitive baselines in terms of speech enhancement performance at -5dB, 0dB, and 5dB. This suggests that the model is effective at improving the overall quality by 0.704 with VoiceBank-DEMAND and by 0.692 with LibriSpeech. Further, the intelligibility with VoiceBank-DEMAND and LibriSpeech is improved by 11.325% and 11.75% over the noisy speech signals. IEEE 2023-06-29 Article PeerReviewed application/pdf en http://irep.iium.edu.my/106019/7/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement.pdf application/pdf en http://irep.iium.edu.my/106019/8/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement_Scopus.pdf Saleem, Nasir and Gunawan, Teddy Surya and Kartiwi, Mira and Nugroho, Bambang Setia and Wijayanto, Inung (2023) NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network. IEEE Access, 11. 66979- 66994. E-ISSN 2169-3536 https://ieeexplore.ieee.org/abstract/document/10168245 10.1109/ACCESS.2023.3290908 |
institution |
Universiti Islam Antarabangsa Malaysia |
building |
IIUM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
International Islamic University Malaysia |
content_source |
IIUM Repository (IREP) |
url_provider |
http://irep.iium.edu.my/ |
language |
English English |
topic |
TK7885 Computer engineering |
spellingShingle |
TK7885 Computer engineering Saleem, Nasir Gunawan, Teddy Surya Kartiwi, Mira Nugroho, Bambang Setia Wijayanto, Inung NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network |
description |
Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attention and the Transformer model have demonstrated competitive results in SE. Transformer models with convolution layers can capture short and long-term temporal sequences by leveraging multi-head self-attention, which allows the model to attend the entire sequence. This study proposes a neural speech enhancement (NSE) using the convolutional encoder-decoder (CED) and convolutional attention Transformer (CAT), named the NSE-CATNet. To effectively process the time-frequency (T-F) distribution of spectral components in speech signals, a T-F attention module is incorporated into the convolutional Transformer model. This module enables the model to explicitly leverage position information and generate a two-dimensional attention map for the time-frequency speech distribution. The performance of the proposed SE is evaluated using objective speech quality and intelligibility metrics on two different datasets, the VoiceBank-DEMAND Corpus and the LibriSpeech dataset. The experimental results indicate that the proposed SE outperformed the competitive baselines in terms of speech enhancement performance at -5dB, 0dB, and 5dB. This suggests that the model is effective at improving the overall quality by 0.704 with VoiceBank-DEMAND and by 0.692 with LibriSpeech. Further, the intelligibility with VoiceBank-DEMAND and LibriSpeech is improved by 11.325% and 11.75% over the noisy speech signals. |
format |
Article |
author |
Saleem, Nasir Gunawan, Teddy Surya Kartiwi, Mira Nugroho, Bambang Setia Wijayanto, Inung |
author_facet |
Saleem, Nasir Gunawan, Teddy Surya Kartiwi, Mira Nugroho, Bambang Setia Wijayanto, Inung |
author_sort |
Saleem, Nasir |
title |
NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network |
title_short |
NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network |
title_full |
NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network |
title_fullStr |
NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network |
title_full_unstemmed |
NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network |
title_sort |
nse-catnet: deep neural speech enhancement using convolutional attention transformer network |
publisher |
IEEE |
publishDate |
2023 |
url |
http://irep.iium.edu.my/106019/7/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement.pdf http://irep.iium.edu.my/106019/8/106019_NSE-CATNet%20deep%20neural%20speech%20enhancement_Scopus.pdf http://irep.iium.edu.my/106019/ https://ieeexplore.ieee.org/abstract/document/10168245 |
_version_ |
1775621721947111424 |