Related papers: Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

URL: http://arxiv.org/abs/2210.16472v1
Date: Sat, 29 Oct 2022 02:55:39 GMT
Title: Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee and Narendra Ahuja and Anoop Cherian
Abstract summary: We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
Score: 36.38300120482868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (ii) predicting the 3D visual motion of a sounding source using its separated audio. Towards this end, we present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. At the heart of ASMP is a 2.5D scene graph capturing various objects in the video and their pseudo-3D spatial proximities. This graph is constructed by registering together 2.5D monocular depth predictions from the 2D video frames and associating the 2.5D scene regions with the outputs of an object detector applied on those frames. The ASMP task is then mathematically modeled as the joint problem of: (i) recursively segmenting the 2.5D scene graph into several sub-graphs, each associated with a constituent sound in the input audio mixture (which is then separated) and (ii) predicting the 3D motions of the corresponding sound sources from the separated audio. To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods.

Related papers

SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera [61.642416712939095]
SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset.
arXiv Detail & Related papers (2024-12-22T05:04:17Z)
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio. We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment. Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z)
Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation. This framework establishes robust correlations between an object's visual characteristics and its associated sound. We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z)
A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition [26.828874753756523]
We propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. Experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks.
arXiv Detail & Related papers (2023-05-30T23:53:12Z)
Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone. We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z)
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video. The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z)
Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest. We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time. We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z)
Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence. Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type. We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.