SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/157118 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-157118 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1571182022-06-06T01:35:38Z SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng School of Electrical and Electronic Engineering Centre for Infocomm Technology (INFINITUS) Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAUNIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 % each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 % and 7 %, respectively, compared to using multichannel logmel spectrograms with generalized cross-correlation spectra. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work was supported in part by the SingaporeMinistry of Education Academic Research Fund Tier-2, under Research Grant MOE2017- T2-2-060, and in part by Google Cloud Research Credits Program under Award GCP205559654. K. N. Watcharasupat further acknowledges the support from the CN Yang Scholars Programme, Nanyang Technological University, Singapore. 2022-06-06T01:35:38Z 2022-06-06T01:35:38Z 2022 Journal Article Nguyen, T. N. T., Watcharasupat, K. N., Nguyen, N. K., Jones, D. L. & Gan, W. (2022). SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. IEEE/ACM Transactions On Audio, Speech, and Language Processing, 30, 1749-1762. https://dx.doi.org/10.1109/TASLP.2022.3173054 2329-9290 https://hdl.handle.net/10356/157118 10.1109/TASLP.2022.3173054 30 1749 1762 en MOE2017-T2-2-060 GCP205559654 IEEE/ACM Transactions on Audio, Speech, and Language Processing © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/TASLP.2022.3173054. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues |
spellingShingle |
Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection |
description |
Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize
these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAUNIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 % each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 % and 7 %, respectively, compared to using multichannel logmel spectrograms with generalized cross-correlation spectra. |
author2 |
School of Electrical and Electronic Engineering |
author_facet |
School of Electrical and Electronic Engineering Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng |
format |
Article |
author |
Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng |
author_sort |
Nguyen, Thi Ngoc Tho |
title |
SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection |
title_short |
SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection |
title_full |
SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection |
title_fullStr |
SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection |
title_full_unstemmed |
SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection |
title_sort |
salsa: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/157118 |
_version_ |
1735491158599008256 |