Move2Hear: Active Audio-Visual Source Separation
- URL: http://arxiv.org/abs/2105.07142v1
- Date: Sat, 15 May 2021 04:58:08 GMT
- Title: Move2Hear: Active Audio-Visual Source Separation
- Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
- Abstract summary: We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
- Score: 90.16327303008224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the active audio-visual source separation problem, where an
agent must move intelligently in order to better isolate the sounds coming from
an object of interest in its environment. The agent hears multiple audio
sources simultaneously (e.g., a person speaking down the hall in a noisy
household) and must use its eyes and ears to automatically separate out the
sounds originating from the target object within a limited time budget. Towards
this goal, we introduce a reinforcement learning approach that trains movement
policies controlling the agent's camera and microphone placement over time,
guided by the improvement in predicted audio separation quality. We demonstrate
our approach in scenarios motivated by both augmented reality (system is
already co-located with the target object) and mobile robotics (agent begins
arbitrarily far from the target object). Using state-of-the-art realistic
audio-visual simulations in 3D environments, we demonstrate our model's ability
to find minimal movement sequences with maximal payoff for audio source
separation. Project: http://vision.cs.utexas.edu/projects/move2hear.
Related papers
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - Sound Adversarial Audio-Visual Navigation [43.962774217305935]
Existing audio-visual navigation works assume a clean environment that solely contains the target sound.
In this work, we design an acoustically complex environment in which there exists a sound attacker playing a zero-sum game with the agent.
Under certain constraints to the attacker, we can improve the robustness of the agent towards unexpected sound attacks in audio-visual navigation.
arXiv Detail & Related papers (2022-02-22T14:19:42Z) - Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z) - Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence.
Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type.
We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Telling Left from Right: Learning Spatial Correspondence of Sight and
Sound [16.99266133458188]
We propose a novel self-supervised task to leverage a principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream.
We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams.
We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines.
arXiv Detail & Related papers (2020-06-11T04:00:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.