Audio-visual source separation under visual-agnostic condition

Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separ...

Full description

Saved in:
Bibliographic Details
Main Author: He, Yixuan
Other Authors: Lihui Chen
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169193
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-169193
record_format dspace
spelling sg-ntu-dr.10356-1691932023-07-08T05:40:12Z Audio-visual source separation under visual-agnostic condition He, Yixuan Lihui Chen School of Electrical and Electronic Engineering ELHCHEN@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art appearance-based models on benchmark audio visual datasets. Master of Science (Signal Processing) 2023-07-05T07:03:23Z 2023-07-05T07:03:23Z 2023 Thesis-Master by Coursework He, Y. (2023). Audio-visual source separation under visual-agnostic condition. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169193 https://hdl.handle.net/10356/169193 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
He, Yixuan
Audio-visual source separation under visual-agnostic condition
description Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art appearance-based models on benchmark audio visual datasets.
author2 Lihui Chen
author_facet Lihui Chen
He, Yixuan
format Thesis-Master by Coursework
author He, Yixuan
author_sort He, Yixuan
title Audio-visual source separation under visual-agnostic condition
title_short Audio-visual source separation under visual-agnostic condition
title_full Audio-visual source separation under visual-agnostic condition
title_fullStr Audio-visual source separation under visual-agnostic condition
title_full_unstemmed Audio-visual source separation under visual-agnostic condition
title_sort audio-visual source separation under visual-agnostic condition
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/169193
_version_ 1772827461562662912