Visual Scene Graphs for Audio Source Separation
- URL: http://arxiv.org/abs/2109.11955v1
- Date: Fri, 24 Sep 2021 13:40:51 GMT
- Title: Visual Scene Graphs for Audio Source Separation
- Authors: Moitreya Chatterjee and Jonathan Le Roux and Narendra Ahuja and Anoop
Cherian
- Abstract summary: State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
- Score: 65.47212419514761
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: State-of-the-art approaches for visually-guided audio source separation
typically assume sources that have characteristic sounds, such as musical
instruments. These approaches often ignore the visual context of these sound
sources or avoid modeling object interactions that may be useful to better
characterize the sources, especially when the same object class may produce
varied sounds from distinct interactions. To address this challenging problem,
we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning
model that embeds the visual structure of the scene as a graph and segments
this graph into subgraphs, each subgraph being associated with a unique sound
obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a
recursive neural network that emits mutually-orthogonal sub-graph embeddings of
the visual graph using multi-head attention. These embeddings are used for
conditioning an audio encoder-decoder towards source separation. Our pipeline
is trained end-to-end via a self-supervised task consisting of separating audio
sources using the visual graph from artificially mixed sounds. In this paper,
we also introduce an "in the wild'' video dataset for sound source separation
that contains multiple non-musical sources, which we call Audio Separation in
the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and
provides a challenging, natural, and daily-life setting for source separation.
Thorough experiments on the proposed ASIW and the standard MUSIC datasets
demonstrate state-of-the-art sound separation performance of our method against
recent prior approaches.
Related papers
- Semantic Grouping Network for Audio Source Separation [41.54814517077309]
We present a novel Semantic Grouping Network, termed as SGN, that can directly disentangle sound representations and extract high-level semantic information for each source from input audio mixture.
We conducted extensive experiments on music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and VGG-Sound.
arXiv Detail & Related papers (2024-07-04T08:37:47Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - High-Quality Visually-Guided Sound Separation from Diverse Categories [56.92841782969847]
DAVIS is a Diffusion-based Audio-VIsual Separation framework.
It synthesizes separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets.
arXiv Detail & Related papers (2023-07-31T19:41:49Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation [36.38300120482868]
We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation.
ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
arXiv Detail & Related papers (2022-10-29T02:55:39Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of
On-Screen Sounds [33.4237979175049]
We present AudioScope, a novel audio-visual sound separation framework.
It can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos.
We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data.
arXiv Detail & Related papers (2020-11-02T17:36:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.