Audio-visual source separation under visual-agnostic condition
Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separ...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/169193 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its synchronized visual information. In real scenarios, with images or frames with several sounding objects given as visual cues, the task requires the model to first locate every sound makers and then separate each sound component corresponding to them. Towards solving this problem, some existing works make use of object detector to find the sounding objects or relying on pixel-by-pixel forwarding to separate every sound components, thus resulting in a more complex system. In this dissertation, a novel approach named "joint framework" is proposed to tackle the audio-visual separation task, which simplifies the visual module by first introducing attention mechanism to perform localization. In addition, the joint framework handles multiple visual inputs and produces multiple audio sources at one time, which traditional audio-only separation models. Based this new framework, the model is capable of performing separation even under visual-agnostic situation and leveraging extra audio-only data to train the audio-visual model, making it a more robust and data-friendly system. With above advantages, the joint framework can still achieve competitive results over other state-of-art
appearance-based models on benchmark audio visual datasets. |
---|