Input features for deep learning-based polyphonic sound event localization and detection

Sound event localization and detection (SELD) is an emerging research topic that combines the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE). The SELD task aims to jointly recognize the sound classes and estimate the directions of arrival (DOAs) and the temporal acti...

Full description

Saved in:

Bibliographic Details
Main Author:	Nguyen, Thi Ngoc Tho
Other Authors:	Gan Woon Seng
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Electrical and electronic engineering::Electronic systems::Signal processing Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Online Access:	https://hdl.handle.net/10356/168245
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-168245
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering::Electronic systems::Signal processing Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
spellingShingle	Engineering::Electrical and electronic engineering::Electronic systems::Signal processing Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Nguyen, Thi Ngoc Tho Input features for deep learning-based polyphonic sound event localization and detection
description	Sound event localization and detection (SELD) is an emerging research topic that combines the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE). The SELD task aims to jointly recognize the sound classes and estimate the directions of arrival (DOAs) and the temporal activities of detected sound events. SELD has many acoustic sensing and monitoring applications, such as urban sound sensing, surveillance, wildlife monitoring, and context-aware devices such as hearing aids, smartphones, autonomous vehicles, and robots. The most successful method to solve SELD so far has been deep learning. Because of a need for source localization, SELD typically requires multichannel audio inputs from a microphone array. Input features to deep SELD networks generally consist of spectral and spatial features stacked along the channel dimension. Examples of spectral features are the magnitude and log-mel spectrograms. On the other hand, different types of microphone arrays offer various spatial features. Common spatial features for first-order ambisonic array (FOA) and multichannel microphone array (MIC) are intensity vector (IV) and generalized cross-correlation with phase transform (GCC-PHAT), respectively. The SED and DOAE subtasks require relatively different information from the same multichannel audio input. While SED mainly relies on the spectrotemporal patterns of the spectral features to distinguish different sound classes, DOAE primarily relies on either or both of the amplitude and phase differences between microphones to estimate the source directions. As a result, it is challenging to extract effective features for the SELD task. This thesis proposes two novel input features for two different SELD approaches to improve the performances of deep learning-based SELD models. The first input feature is a Short-time Spatial Histogram (SSH) that indicates the presence of a sound source in each direction at each time instance. The SSH is produced by our proposed single-source histogram analysis algorithm for DOAE. The SSH feature is helpful for both DOAE and SELD because it contains concise directional information and is robust to noise, reverberation, and multiple sources. This thesis proposes a two-step method for SELD, in which the SELD task is divided into SED, DOAE, and temporal matching subtasks. In the first step, the SED and DOAE subtasks are optimized independently to maximize each subtask's performance and reduce unwanted associations between the sound classes and the DOAs due to small training sets. The SED outputs are the class probabilities, while the DOAE outputs are the SSHs. In the second step, a temporal matching module, named Sequence Matching Network (SMN), learns the temporal activities in SED and DOAE output sequences to associate the estimated DOAs with the corresponding sound classes of the detected sound events. The second input feature is called Spatial cue-Augmented Log-SpectrogrAm (SALSA), developed for end-to-end SELD learning. SALSA feature consists of multichannel log-magnitude linear-frequency spectrograms stacked along with normalized principal eigenvector of the spatial covariance matrix at each time-frequency bin. The feature has exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The principal eigenvector provides helpful spatial cues and can be normalized differently depending on the microphone array format to extract either or both amplitude and phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as FOA and MIC. For fast feature processing that is required for real-time applications, this thesis proposes a computationally efficient variant of the SALSA feature called SALSA-Lite for MIC format. The SALSA-Lite feature uses the frequency-normalized phase information from the complex spectrograms instead of the principal eigenvectors. Experimental results on self-collected datasets showed that SSH was effective for deep learning-based DOAE. Experimental results on public datasets showed that the two-step method and deep models trained on the SALSA and SALSA-lite features achieved similar or better performance than many state-of-the-art (SOTA) SELD systems in 2020 and 2021, respectively. In addition, SALSA-lite was \num{30} times faster than SALSA in computation. An ensemble of the two-step method and an ensemble of deep models trained on the SALSA features ranked second in the team category of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 and 2021 SELD challenge, respectively.
author2	Gan Woon Seng
author_facet	Gan Woon Seng Nguyen, Thi Ngoc Tho
format	Thesis-Doctor of Philosophy
author	Nguyen, Thi Ngoc Tho
author_sort	Nguyen, Thi Ngoc Tho
title	Input features for deep learning-based polyphonic sound event localization and detection
title_short	Input features for deep learning-based polyphonic sound event localization and detection
title_full	Input features for deep learning-based polyphonic sound event localization and detection
title_fullStr	Input features for deep learning-based polyphonic sound event localization and detection
title_full_unstemmed	Input features for deep learning-based polyphonic sound event localization and detection
title_sort	input features for deep learning-based polyphonic sound event localization and detection
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/168245
_version_	1772826141439033344
spelling	sg-ntu-dr.10356-1682452023-07-04T17:03:50Z Input features for deep learning-based polyphonic sound event localization and detection Nguyen, Thi Ngoc Tho Gan Woon Seng School of Electrical and Electronic Engineering Digital Signal Processing Laboratory EWSGAN@ntu.edu.sg Engineering::Electrical and electronic engineering::Electronic systems::Signal processing Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Sound event localization and detection (SELD) is an emerging research topic that combines the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE). The SELD task aims to jointly recognize the sound classes and estimate the directions of arrival (DOAs) and the temporal activities of detected sound events. SELD has many acoustic sensing and monitoring applications, such as urban sound sensing, surveillance, wildlife monitoring, and context-aware devices such as hearing aids, smartphones, autonomous vehicles, and robots. The most successful method to solve SELD so far has been deep learning. Because of a need for source localization, SELD typically requires multichannel audio inputs from a microphone array. Input features to deep SELD networks generally consist of spectral and spatial features stacked along the channel dimension. Examples of spectral features are the magnitude and log-mel spectrograms. On the other hand, different types of microphone arrays offer various spatial features. Common spatial features for first-order ambisonic array (FOA) and multichannel microphone array (MIC) are intensity vector (IV) and generalized cross-correlation with phase transform (GCC-PHAT), respectively. The SED and DOAE subtasks require relatively different information from the same multichannel audio input. While SED mainly relies on the spectrotemporal patterns of the spectral features to distinguish different sound classes, DOAE primarily relies on either or both of the amplitude and phase differences between microphones to estimate the source directions. As a result, it is challenging to extract effective features for the SELD task. This thesis proposes two novel input features for two different SELD approaches to improve the performances of deep learning-based SELD models. The first input feature is a Short-time Spatial Histogram (SSH) that indicates the presence of a sound source in each direction at each time instance. The SSH is produced by our proposed single-source histogram analysis algorithm for DOAE. The SSH feature is helpful for both DOAE and SELD because it contains concise directional information and is robust to noise, reverberation, and multiple sources. This thesis proposes a two-step method for SELD, in which the SELD task is divided into SED, DOAE, and temporal matching subtasks. In the first step, the SED and DOAE subtasks are optimized independently to maximize each subtask's performance and reduce unwanted associations between the sound classes and the DOAs due to small training sets. The SED outputs are the class probabilities, while the DOAE outputs are the SSHs. In the second step, a temporal matching module, named Sequence Matching Network (SMN), learns the temporal activities in SED and DOAE output sequences to associate the estimated DOAs with the corresponding sound classes of the detected sound events. The second input feature is called Spatial cue-Augmented Log-SpectrogrAm (SALSA), developed for end-to-end SELD learning. SALSA feature consists of multichannel log-magnitude linear-frequency spectrograms stacked along with normalized principal eigenvector of the spatial covariance matrix at each time-frequency bin. The feature has exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The principal eigenvector provides helpful spatial cues and can be normalized differently depending on the microphone array format to extract either or both amplitude and phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as FOA and MIC. For fast feature processing that is required for real-time applications, this thesis proposes a computationally efficient variant of the SALSA feature called SALSA-Lite for MIC format. The SALSA-Lite feature uses the frequency-normalized phase information from the complex spectrograms instead of the principal eigenvectors. Experimental results on self-collected datasets showed that SSH was effective for deep learning-based DOAE. Experimental results on public datasets showed that the two-step method and deep models trained on the SALSA and SALSA-lite features achieved similar or better performance than many state-of-the-art (SOTA) SELD systems in 2020 and 2021, respectively. In addition, SALSA-lite was \num{30} times faster than SALSA in computation. An ensemble of the two-step method and an ensemble of deep models trained on the SALSA features ranked second in the team category of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 and 2021 SELD challenge, respectively. Doctor of Philosophy 2023-05-23T06:30:41Z 2023-05-23T06:30:41Z 2023 Thesis-Doctor of Philosophy Nguyen, T. N. T. (2023). Input features for deep learning-based polyphonic sound event localization and detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168245 https://hdl.handle.net/10356/168245 10.32657/10356/168245 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Input features for deep learning-based polyphonic sound event localization and detection

Similar Items