Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs

Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acous...

Full description

Saved in:
Bibliographic Details
Main Authors: Ooi, Kenneth, Watcharasupat, Karn, Lam, Bhan, Ong, Zhen-Ting, Gan, Woon-Seng
Other Authors: School of Electrical and Electronic Engineering
Format: Conference or Workshop Item
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165017
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-165017
record_format dspace
spelling sg-ntu-dr.10356-1650172023-05-26T15:39:54Z Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs Ooi, Kenneth Watcharasupat, Karn Lam, Bhan Ong, Zhen-Ting Gan, Woon-Seng School of Electrical and Electronic Engineering 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023) Digital Signal Processing Laboratory Engineering::Electrical and electronic engineering Social sciences::Psychology::Affection and emotion Auditory Masking Neural Attention Multimodal Fusion Probabilistic Loss Deep Learning Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of 0.1194±0.0012 for the best-performing all-modality model, against 0.1217±0.0009 for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability. Ministry of National Development (MND) National Research Foundation (NRF) Submitted/Accepted version This research is supported by the Singapore Ministry of National Development and the National Research Foundation, Prime Minister’s Office under the Cities of Tomorrow Research Programme (Award No. COT-V4-2020-1). 2023-05-25T01:38:00Z 2023-05-25T01:38:00Z 2023 Conference Paper Ooi, K., Watcharasupat, K., Lam, B., Ong, Z. & Gan, W. (2023). Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023). https://dx.doi.org/10.1109/ICASSP49357.2023.10094866 978-1-7281-6327-7 https://hdl.handle.net/10356/165017 10.1109/ICASSP49357.2023.10094866 en COT-V4-2020-1 © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICASSP49357.2023.10094866. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
Social sciences::Psychology::Affection and emotion
Auditory Masking
Neural Attention
Multimodal Fusion
Probabilistic Loss
Deep Learning
spellingShingle Engineering::Electrical and electronic engineering
Social sciences::Psychology::Affection and emotion
Auditory Masking
Neural Attention
Multimodal Fusion
Probabilistic Loss
Deep Learning
Ooi, Kenneth
Watcharasupat, Karn
Lam, Bhan
Ong, Zhen-Ting
Gan, Woon-Seng
Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
description Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of 0.1194±0.0012 for the best-performing all-modality model, against 0.1217±0.0009 for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability.
author2 School of Electrical and Electronic Engineering
author_facet School of Electrical and Electronic Engineering
Ooi, Kenneth
Watcharasupat, Karn
Lam, Bhan
Ong, Zhen-Ting
Gan, Woon-Seng
format Conference or Workshop Item
author Ooi, Kenneth
Watcharasupat, Karn
Lam, Bhan
Ong, Zhen-Ting
Gan, Woon-Seng
author_sort Ooi, Kenneth
title Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
title_short Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
title_full Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
title_fullStr Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
title_full_unstemmed Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
title_sort autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
publishDate 2023
url https://hdl.handle.net/10356/165017
_version_ 1772827596923338752