Audio-visual source separation under visual-agnostic condition

Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separ...

Full description

Saved in:

Bibliographic Details
Main Author:	He, Yixuan
Other Authors:	Lihui Chen
Format:	Thesis-Master by Coursework
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	https://hdl.handle.net/10356/169193
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-169193
record_format	dspace
spelling	sg-ntu-dr.10356-1691932023-07-08T05:40:12Z Audio-visual source separation under visual-agnostic condition He, Yixuan Lihui Chen School of Electrical and Electronic Engineering ELHCHEN@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art appearance-based models on benchmark audio visual datasets. Master of Science (Signal Processing) 2023-07-05T07:03:23Z 2023-07-05T07:03:23Z 2023 Thesis-Master by Coursework He, Y. (2023). Audio-visual source separation under visual-agnostic condition. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169193 https://hdl.handle.net/10356/169193 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence He, Yixuan Audio-visual source separation under visual-agnostic condition
description	Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art appearance-based models on benchmark audio visual datasets.
author2	Lihui Chen
author_facet	Lihui Chen He, Yixuan
format	Thesis-Master by Coursework
author	He, Yixuan
author_sort	He, Yixuan
title	Audio-visual source separation under visual-agnostic condition
title_short	Audio-visual source separation under visual-agnostic condition
title_full	Audio-visual source separation under visual-agnostic condition
title_fullStr	Audio-visual source separation under visual-agnostic condition
title_full_unstemmed	Audio-visual source separation under visual-agnostic condition
title_sort	audio-visual source separation under visual-agnostic condition
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/169193
_version_	1772827461562662912

Audio-visual source separation under visual-agnostic condition

Similar Items