Active Audio-Visual Separation of Dynamic Sound Sources
- URL: http://arxiv.org/abs/2202.00850v1
- Date: Wed, 2 Feb 2022 02:03:28 GMT
- Title: Active Audio-Visual Separation of Dynamic Sound Sources
- Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
- Abstract summary: We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
- Score: 93.97385339354318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore active audio-visual separation for dynamic sound sources, where an
embodied agent moves intelligently in a 3D environment to continuously isolate
the time-varying audio stream being emitted by an object of interest. The agent
hears a mixed stream of multiple time-varying audio sources (e.g., multiple
people conversing and a band playing music at a noisy party). Given a limited
time budget, it needs to extract the target sound using egocentric audio-visual
observations. We propose a reinforcement learning agent equipped with a novel
transformer memory that learns motion policies to control its camera and
microphone to recover the dynamic target audio, improving its own estimates for
past timesteps via self-attention. Using highly realistic acoustic SoundSpaces
simulations in real-world scanned Matterport3D environments, we show that our
model is able to learn efficient behavior to carry out continuous separation of
a time-varying audio target. Project:
https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.
Related papers
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources
in Unmapped 3D Environments [0.0]
We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI agent must catch a moving sound source in an unmapped environment in the presence of distractors and noisy sounds.
Our approach outperforms the current state-of-the-art with better generalization to unheard sounds and better robustness to noisy scenarios.
arXiv Detail & Related papers (2022-01-12T03:08:03Z) - Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped
Environments with Moving Sounds [5.002862602915434]
Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment.
We propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds.
We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments.
arXiv Detail & Related papers (2021-11-29T15:17:46Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence.
Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type.
We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.