Audio-visual source separation under visual-agnostic condition
Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separ...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/169193 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-169193 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1691932023-07-08T05:40:12Z Audio-visual source separation under visual-agnostic condition He, Yixuan Lihui Chen School of Electrical and Electronic Engineering ELHCHEN@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art appearance-based models on benchmark audio visual datasets. Master of Science (Signal Processing) 2023-07-05T07:03:23Z 2023-07-05T07:03:23Z 2023 Thesis-Master by Coursework He, Y. (2023). Audio-visual source separation under visual-agnostic condition. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169193 https://hdl.handle.net/10356/169193 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence He, Yixuan Audio-visual source separation under visual-agnostic condition |
description |
Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art
appearance-based models on benchmark audio visual datasets. |
author2 |
Lihui Chen |
author_facet |
Lihui Chen He, Yixuan |
format |
Thesis-Master by Coursework |
author |
He, Yixuan |
author_sort |
He, Yixuan |
title |
Audio-visual source separation under visual-agnostic condition |
title_short |
Audio-visual source separation under visual-agnostic condition |
title_full |
Audio-visual source separation under visual-agnostic condition |
title_fullStr |
Audio-visual source separation under visual-agnostic condition |
title_full_unstemmed |
Audio-visual source separation under visual-agnostic condition |
title_sort |
audio-visual source separation under visual-agnostic condition |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/169193 |
_version_ |
1772827461562662912 |