Audio-visual source separation under visual-agnostic condition

Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separ...

Full description

Saved in:
Bibliographic Details
Main Author: He, Yixuan
Other Authors: Lihui Chen
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169193
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art appearance-based models on benchmark audio visual datasets.