Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs
Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acous...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/165017 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-165017 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1650172023-05-26T15:39:54Z Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs Ooi, Kenneth Watcharasupat, Karn Lam, Bhan Ong, Zhen-Ting Gan, Woon-Seng School of Electrical and Electronic Engineering 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023) Digital Signal Processing Laboratory Engineering::Electrical and electronic engineering Social sciences::Psychology::Affection and emotion Auditory Masking Neural Attention Multimodal Fusion Probabilistic Loss Deep Learning Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of 0.1194±0.0012 for the best-performing all-modality model, against 0.1217±0.0009 for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability. Ministry of National Development (MND) National Research Foundation (NRF) Submitted/Accepted version This research is supported by the Singapore Ministry of National Development and the National Research Foundation, Prime Minister’s Office under the Cities of Tomorrow Research Programme (Award No. COT-V4-2020-1). 2023-05-25T01:38:00Z 2023-05-25T01:38:00Z 2023 Conference Paper Ooi, K., Watcharasupat, K., Lam, B., Ong, Z. & Gan, W. (2023). Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023). https://dx.doi.org/10.1109/ICASSP49357.2023.10094866 978-1-7281-6327-7 https://hdl.handle.net/10356/165017 10.1109/ICASSP49357.2023.10094866 en COT-V4-2020-1 © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICASSP49357.2023.10094866. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Electrical and electronic engineering Social sciences::Psychology::Affection and emotion Auditory Masking Neural Attention Multimodal Fusion Probabilistic Loss Deep Learning |
spellingShingle |
Engineering::Electrical and electronic engineering Social sciences::Psychology::Affection and emotion Auditory Masking Neural Attention Multimodal Fusion Probabilistic Loss Deep Learning Ooi, Kenneth Watcharasupat, Karn Lam, Bhan Ong, Zhen-Ting Gan, Woon-Seng Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs |
description |
Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of 0.1194±0.0012 for the best-performing all-modality model, against 0.1217±0.0009 for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability. |
author2 |
School of Electrical and Electronic Engineering |
author_facet |
School of Electrical and Electronic Engineering Ooi, Kenneth Watcharasupat, Karn Lam, Bhan Ong, Zhen-Ting Gan, Woon-Seng |
format |
Conference or Workshop Item |
author |
Ooi, Kenneth Watcharasupat, Karn Lam, Bhan Ong, Zhen-Ting Gan, Woon-Seng |
author_sort |
Ooi, Kenneth |
title |
Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs |
title_short |
Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs |
title_full |
Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs |
title_fullStr |
Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs |
title_full_unstemmed |
Autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs |
title_sort |
autonomous soundscape augmentation with multimodal fusion of visual and participant-linked inputs |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/165017 |
_version_ |
1772827596923338752 |