Input features for deep learning-based polyphonic sound event localization and detection
Sound event localization and detection (SELD) is an emerging research topic that combines the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE). The SELD task aims to jointly recognize the sound classes and estimate the directions of arrival (DOAs) and the temporal acti...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/168245 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Sound event localization and detection (SELD) is an emerging research topic that combines the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE). The SELD task aims to jointly recognize the sound classes and estimate the directions of arrival (DOAs) and the temporal activities of detected sound events. SELD has many acoustic sensing and monitoring applications, such as urban sound sensing, surveillance, wildlife monitoring, and context-aware devices such as hearing aids, smartphones, autonomous vehicles, and robots.
The most successful method to solve SELD so far has been deep learning. Because of a need for source localization, SELD typically requires multichannel audio inputs from a microphone array. Input features to deep SELD networks generally consist of spectral and spatial features stacked along the channel dimension. Examples of spectral features are the magnitude and log-mel spectrograms. On the other hand, different types of microphone arrays offer various spatial features. Common spatial features for first-order ambisonic array (FOA) and multichannel microphone array (MIC) are intensity vector (IV) and generalized cross-correlation with phase transform (GCC-PHAT), respectively.
The SED and DOAE subtasks require relatively different information from the same multichannel audio input. While SED mainly relies on the spectrotemporal patterns of the spectral features to distinguish different sound classes, DOAE primarily relies on either or both of the amplitude and phase differences between microphones to estimate the source directions. As a result, it is challenging to extract effective features for the SELD task. This thesis proposes two novel input features for two different SELD approaches to improve the performances of deep learning-based SELD models.
The first input feature is a Short-time Spatial Histogram (SSH) that indicates the presence of a sound source in each direction at each time instance. The SSH is produced by our proposed single-source histogram analysis algorithm for DOAE. The SSH feature is helpful for both DOAE and SELD because it contains concise directional information and is robust to noise, reverberation, and multiple sources. This thesis proposes a two-step method for SELD, in which the SELD task is divided into SED, DOAE, and temporal matching subtasks. In the first step, the SED and DOAE subtasks are optimized independently to maximize each subtask's performance and reduce unwanted associations between the sound classes and the DOAs due to small training sets. The SED outputs are the class probabilities, while the DOAE outputs are the SSHs. In the second step, a temporal matching module, named Sequence Matching Network (SMN), learns the temporal activities in SED and DOAE output sequences to associate the estimated DOAs with the corresponding sound classes of the detected sound events.
The second input feature is called Spatial cue-Augmented Log-SpectrogrAm (SALSA), developed for end-to-end SELD learning. SALSA feature consists of multichannel log-magnitude linear-frequency spectrograms stacked along with normalized principal eigenvector of the spatial covariance matrix at each time-frequency bin. The feature has exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The principal eigenvector provides helpful spatial cues and can be normalized differently depending on the microphone array format to extract either or both amplitude and phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as FOA and MIC. For fast feature processing that is required for real-time applications, this thesis proposes a computationally efficient variant of the SALSA feature called SALSA-Lite for MIC format. The SALSA-Lite feature uses the frequency-normalized phase information from the complex spectrograms instead of the principal eigenvectors.
Experimental results on self-collected datasets showed that SSH was effective for deep learning-based DOAE. Experimental results on public datasets showed that the two-step method and deep models trained on the SALSA and SALSA-lite features achieved similar or better performance than many state-of-the-art (SOTA) SELD systems in 2020 and 2021, respectively. In addition, SALSA-lite was \num{30} times faster than SALSA in computation. An ensemble of the two-step method and an ensemble of deep models trained on the SALSA features ranked second in the team category of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 and 2021 SELD challenge, respectively. |
---|