Points2Sound: From mono to binaural audio using 3D point cloud scenes
- URL: http://arxiv.org/abs/2104.12462v3
- Date: Fri, 19 May 2023 12:54:02 GMT
- Title: Points2Sound: From mono to binaural audio using 3D point cloud scenes
- Authors: Francesc Llu\'is, Vasileios Chatziioannou, Alex Hofmann
- Abstract summary: We propose Points2Sound, a multi-modal deep learning model which generates a version from mono audio using 3D point cloud scenes.
Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of synthesis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: For immersive applications, the generation of binaural sound that matches its
visual counterpart is crucial to bring meaningful experiences to people in a
virtual environment. Recent studies have shown the possibility of using neural
networks for synthesizing binaural audio from mono audio by using 2D visual
information as guidance. Extending this approach by guiding the audio with 3D
visual information and operating in the waveform domain may allow for a more
accurate auralization of a virtual audio scene. We propose Points2Sound, a
multi-modal deep learning model which generates a binaural version from mono
audio using 3D point cloud scenes. Specifically, Points2Sound consists of a
vision network and an audio network. The vision network uses 3D sparse
convolutions to extract a visual feature from the point cloud scene. Then, the
visual feature conditions the audio network, which operates in the waveform
domain, to synthesize the binaural version. Results show that 3D visual
information can successfully guide multi-modal deep learning models for the
task of binaural synthesis. We also investigate how 3D point cloud attributes,
learning objectives, different reverberant conditions, and several types of
mono mixture signals affect the binaural audio synthesis performance of
Points2Sound for the different numbers of sound sources present in the scene.
Related papers
- 3D Audio-Visual Segmentation [44.61476023587931]
Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR.
We propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models.
Experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI.
arXiv Detail & Related papers (2024-11-04T16:30:14Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Listen2Scene: Interactive material-aware binaural sound propagation for
reconstructed 3D scenes [69.03289331433874]
We present an end-to-end audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications.
We propose a novel neural-network-based sound propagation method to generate acoustic effects for 3D models of real environments.
arXiv Detail & Related papers (2023-02-02T04:09:23Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning [127.1119359047849]
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments.
It generates highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations.
SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
arXiv Detail & Related papers (2022-06-16T17:17:44Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - Music source separation conditioned on 3D point clouds [0.0]
This paper proposes a multi-modal deep learning model to perform music source separation conditioned on 3D point clouds of music performance recordings.
It extracts visual features using 3D sparse convolutions, while audio features are extracted using dense convolutions.
A fusion module combines the extracted features to finally perform the audio source separation.
arXiv Detail & Related papers (2021-02-03T12:18:35Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.