SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection

Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and...

Full description

Saved in:
Bibliographic Details
Main Authors: Nguyen, Thi Ngoc Tho, Watcharasupat, Karn N., Nguyen, Ngoc Khanh, Jones, Douglas L., Gan, Woon-Seng
Other Authors: School of Electrical and Electronic Engineering
Format: Article
Language:English
Published: 2022
Subjects:
Online Access:https://hdl.handle.net/10356/157118
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-157118
record_format dspace
spelling sg-ntu-dr.10356-1571182022-06-06T01:35:38Z SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection Nguyen, Thi Ngoc Tho Watcharasupat, Karn N. Nguyen, Ngoc Khanh Jones, Douglas L. Gan, Woon-Seng School of Electrical and Electronic Engineering Centre for Infocomm Technology (INFINITUS) Engineering::Electrical and electronic engineering Deep Learning Microphone Array Feature Extraction Sound Event Localization and Detection Spatial Cues Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAUNIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 % each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 % and 7 %, respectively, compared to using multichannel logmel spectrograms with generalized cross-correlation spectra. Ministry of Education (MOE) Nanyang Technological University Submitted/Accepted version This work was supported in part by the SingaporeMinistry of Education Academic Research Fund Tier-2, under Research Grant MOE2017- T2-2-060, and in part by Google Cloud Research Credits Program under Award GCP205559654. K. N. Watcharasupat further acknowledges the support from the CN Yang Scholars Programme, Nanyang Technological University, Singapore. 2022-06-06T01:35:38Z 2022-06-06T01:35:38Z 2022 Journal Article Nguyen, T. N. T., Watcharasupat, K. N., Nguyen, N. K., Jones, D. L. & Gan, W. (2022). SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. IEEE/ACM Transactions On Audio, Speech, and Language Processing, 30, 1749-1762. https://dx.doi.org/10.1109/TASLP.2022.3173054 2329-9290 https://hdl.handle.net/10356/157118 10.1109/TASLP.2022.3173054 30 1749 1762 en MOE2017-T2-2-060 GCP205559654 IEEE/ACM Transactions on Audio, Speech, and Language Processing © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/TASLP.2022.3173054. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
Deep Learning
Microphone Array
Feature Extraction
Sound Event Localization and Detection
Spatial Cues
spellingShingle Engineering::Electrical and electronic engineering
Deep Learning
Microphone Array
Feature Extraction
Sound Event Localization and Detection
Spatial Cues
Nguyen, Thi Ngoc Tho
Watcharasupat, Karn N.
Nguyen, Ngoc Khanh
Jones, Douglas L.
Gan, Woon-Seng
SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
description Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAUNIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 % each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 % and 7 %, respectively, compared to using multichannel logmel spectrograms with generalized cross-correlation spectra.
author2 School of Electrical and Electronic Engineering
author_facet School of Electrical and Electronic Engineering
Nguyen, Thi Ngoc Tho
Watcharasupat, Karn N.
Nguyen, Ngoc Khanh
Jones, Douglas L.
Gan, Woon-Seng
format Article
author Nguyen, Thi Ngoc Tho
Watcharasupat, Karn N.
Nguyen, Ngoc Khanh
Jones, Douglas L.
Gan, Woon-Seng
author_sort Nguyen, Thi Ngoc Tho
title SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_short SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_full SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_fullStr SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_full_unstemmed SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
title_sort salsa: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
publishDate 2022
url https://hdl.handle.net/10356/157118
_version_ 1735491158599008256