AudioEar: Single-View Ear Reconstruction for Personalized Spatial Audio
- URL: http://arxiv.org/abs/2301.12613v1
- Date: Mon, 30 Jan 2023 02:15:50 GMT
- Title: AudioEar: Single-View Ear Reconstruction for Personalized Spatial Audio
- Authors: Xiaoyang Huang, Yanjun Wang, Yang Liu, Bingbing Ni, Wenjun Zhang,
Jinxian Liu, Teng Li
- Abstract summary: We propose to achieve personalized spatial audio by reconstructing 3D human ears with single-view images.
To fill the gap between the vision and acoustics community, we develop a pipeline to integrate the reconstructed ear mesh with an off-the-shelf 3D human body.
- Score: 44.460995595847606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial audio, which focuses on immersive 3D sound rendering, is widely
applied in the acoustic industry. One of the key problems of current spatial
audio rendering methods is the lack of personalization based on different
anatomies of individuals, which is essential to produce accurate sound source
positions. In this work, we address this problem from an interdisciplinary
perspective. The rendering of spatial audio is strongly correlated with the 3D
shape of human bodies, particularly ears. To this end, we propose to achieve
personalized spatial audio by reconstructing 3D human ears with single-view
images. First, to benchmark the ear reconstruction task, we introduce
AudioEar3D, a high-quality 3D ear dataset consisting of 112 point cloud ear
scans with RGB images. To self-supervisedly train a reconstruction model, we
further collect a 2D ear dataset composed of 2,000 images, each one with manual
annotation of occlusion and 55 landmarks, named AudioEar2D. To our knowledge,
both datasets have the largest scale and best quality of their kinds for public
use. Further, we propose AudioEarM, a reconstruction method guided by a depth
estimation network that is trained on synthetic data, with two loss functions
tailored for ear data. Lastly, to fill the gap between the vision and acoustics
community, we develop a pipeline to integrate the reconstructed ear mesh with
an off-the-shelf 3D human body and simulate a personalized Head-Related
Transfer Function (HRTF), which is the core of spatial audio rendering. Code
and data are publicly available at https://github.com/seanywang0408/AudioEar.
Related papers
- 3D Audio-Visual Segmentation [44.61476023587931]
Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR.
We propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models.
Experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI.
arXiv Detail & Related papers (2024-11-04T16:30:14Z) - Modeling and Driving Human Body Soundfields through Acoustic Primitives [79.38642644610592]
We present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body.
We demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately.
Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.
arXiv Detail & Related papers (2024-07-18T01:05:13Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and
Audio [50.39279046238891]
We present a model that can generate accurate 3D spatial audio for full human bodies.
The system consumes, as input, audio signals from headset microphones and body pose.
We show that our model can produce accurate body-induced sound fields when trained with a suitable loss.
arXiv Detail & Related papers (2023-11-01T16:40:35Z) - Novel-View Acoustic Synthesis from 3D Reconstructed Rooms [17.72902700567848]
We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis.
We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation.
We show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks.
arXiv Detail & Related papers (2023-10-23T17:34:31Z) - Listen2Scene: Interactive material-aware binaural sound propagation for
reconstructed 3D scenes [69.03289331433874]
We present an end-to-end audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications.
We propose a novel neural-network-based sound propagation method to generate acoustic effects for 3D models of real environments.
arXiv Detail & Related papers (2023-02-02T04:09:23Z) - Learning to Separate Voices by Spatial Regions [5.483801693991577]
We consider the problem of audio voice separation for applications, such as earphones and hearing aids.
We propose a two-stage self-supervised framework in which overheard voices from earphones are pre-processed to extract relatively clean personalized signals.
Results show promising performance, underscoring the importance of personalization over a generic supervised approach.
arXiv Detail & Related papers (2022-07-09T06:25:01Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - A Human Ear Reconstruction Autoencoder [19.72707659069644]
We aim to tackle the 3D ear reconstruction task, where more subtle and difficult curves and features are present on the 2D ear input images.
Our Human Ear Reconstruction Autoencoder (HERA) system predicts 3D ear poses and shape parameters for 3D ear meshes, without any supervision to these parameters.
The constructed end-to-end self-supervised model is then evaluated both with 2D landmark localisation performance and the appearance of the reconstructed 3D ears.
arXiv Detail & Related papers (2020-10-07T12:52:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.