SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection

Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Thi Ngoc Tho, Watcharasupat, Karn N., Nguyen, Ngoc Khanh, Jones, Douglas L., Gan, Woon-Seng
Other Authors:	School of Electrical and Electronic Engineering
Format:	Article
Language:	English
Published:	2022
Subjects:	Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues
Online Access:	https://hdl.handle.net/10356/157118
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-157118
record_format	dspace
spelling	sg-ntu-dr.10356-1571182022-06-06T01:35:38Z SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng School of Electrical and Electronic Engineering Centre for Infocomm Technology (INFINITUS) Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAUNIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 % each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 % and 7 %, respectively, compared to using multichannel logmel spectrograms with generalized cross-correlation spectra. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work was supported in part by the SingaporeMinistry of Education Academic Research Fund Tier-2, under Research Grant MOE2017- T2-2-060, and in part by Google Cloud Research Credits Program under Award GCP205559654. K. N. Watcharasupat further acknowledges the support from the CN Yang Scholars Programme, Nanyang Technological University, Singapore. 2022-06-06T01:35:38Z 2022-06-06T01:35:38Z 2022 Journal Article Nguyen, T. N. T., Watcharasupat, K. N., Nguyen, N. K., Jones, D. L. & Gan, W. (2022). SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. IEEE/ACM Transactions On Audio, Speech, and Language Processing, 30, 1749-1762. https://dx.doi.org/10.1109/TASLP.2022.3173054 2329-9290 https://hdl.handle.net/10356/157118 10.1109/TASLP.2022.3173054 30 1749 1762 en MOE2017-T2-2-060 GCP205559654 IEEE/ACM Transactions on Audio, Speech, and Language Processing © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/TASLP.2022.3173054. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues
spellingShingle	Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
description	Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAUNIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 % each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 % and 7 %, respectively, compared to using multichannel logmel spectrograms with generalized cross-correlation spectra.
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng
format	Article
author	Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng
author_sort	Nguyen, Thi Ngoc Tho
title	SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_short	SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_full	SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_fullStr	SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_full_unstemmed	SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_sort	salsa: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
publishDate	2022
url	https://hdl.handle.net/10356/157118
_version_	1735491158599008256

SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection

Similar Items