Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation
- URL: http://arxiv.org/abs/2210.16472v1
- Date: Sat, 29 Oct 2022 02:55:39 GMT
- Title: Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation
- Authors: Moitreya Chatterjee and Narendra Ahuja and Anoop Cherian
- Abstract summary: We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation.
ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
- Score: 36.38300120482868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There exists an unequivocal distinction between the sound produced by a
static source and that produced by a moving one, especially when the source
moves towards or away from the microphone. In this paper, we propose to use
this connection between audio and visual dynamics for solving two challenging
tasks simultaneously, namely: (i) separating audio sources from a mixture using
visual cues, and (ii) predicting the 3D visual motion of a sounding source
using its separated audio. Towards this end, we present Audio Separator and
Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D
structure of the scene and the motion of sound sources for better audio source
separation. At the heart of ASMP is a 2.5D scene graph capturing various
objects in the video and their pseudo-3D spatial proximities. This graph is
constructed by registering together 2.5D monocular depth predictions from the
2D video frames and associating the 2.5D scene regions with the outputs of an
object detector applied on those frames. The ASMP task is then mathematically
modeled as the joint problem of: (i) recursively segmenting the 2.5D scene
graph into several sub-graphs, each associated with a constituent sound in the
input audio mixture (which is then separated) and (ii) predicting the 3D
motions of the corresponding sound sources from the separated audio. To
empirically evaluate ASMP, we present experiments on two challenging
audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio
Visual Event (AVE). Our results demonstrate that ASMP achieves a clear
improvement in source separation quality, outperforming prior works on both
datasets, while also estimating the direction of motion of the sound sources
better than other methods.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - A Unified Audio-Visual Learning Framework for Localization, Separation,
and Recognition [26.828874753756523]
We propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition.
OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives.
Experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks.
arXiv Detail & Related papers (2023-05-30T23:53:12Z) - Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence.
Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type.
We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.