Sound Localization from Motion: Jointly Learning Sound Direction and
Camera Rotation
- URL: http://arxiv.org/abs/2303.11329v2
- Date: Mon, 21 Aug 2023 14:59:10 GMT
- Title: Sound Localization from Motion: Jointly Learning Sound Direction and
Camera Rotation
- Authors: Ziyang Chen, Shengyi Qian, Andrew Owens
- Abstract summary: We use images and sounds that undergo subtle but geometrically consistent changes as we rotate our heads to estimate camera rotation and localizing sound sources.
A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from sounds.
We train these models to generate predictions that agree with one another.
Our model can successfully estimate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches.
- Score: 26.867430697990674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The images and sounds that we perceive undergo subtle but geometrically
consistent changes as we rotate our heads. In this paper, we use these cues to
solve a problem we call Sound Localization from Motion (SLfM): jointly
estimating camera rotation and localizing sound sources. We learn to solve
these tasks solely through self-supervision. A visual model predicts camera
rotation from a pair of images, while an audio model predicts the direction of
sound sources from binaural sounds. We train these models to generate
predictions that agree with one another. At test time, the models can be
deployed independently. To obtain a feature representation that is well-suited
to solving this challenging problem, we also propose a method for learning an
audio-visual representation through cross-view binauralization: estimating
binaural sound from one view, given images and sound from another. Our model
can successfully estimate accurate rotations on both real and synthetic scenes,
and localize sound sources with accuracy competitive with state-of-the-art
self-supervised approaches. Project site: https://ificl.github.io/SLfM/
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - Listen2Scene: Interactive material-aware binaural sound propagation for
reconstructed 3D scenes [69.03289331433874]
We present an end-to-end audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications.
We propose a novel neural-network-based sound propagation method to generate acoustic effects for 3D models of real environments.
arXiv Detail & Related papers (2023-02-02T04:09:23Z) - Mix and Localize: Localizing Sound Sources in Mixtures [10.21507741240426]
We present a method for simultaneously localizing multiple sound sources within a visual scene.
Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al.
We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds.
arXiv Detail & Related papers (2022-11-28T04:30:50Z) - Sound Localization by Self-Supervised Time Delay Estimation [22.125613860688357]
Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone.
We learn these correspondences through self-supervision, drawing on recent techniques from visual tracking.
We also propose a multimodal contrastive learning model that solves a visually-guided localization task.
arXiv Detail & Related papers (2022-04-26T17:59:01Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.