Learning to Separate Voices by Spatial Regions
- URL: http://arxiv.org/abs/2207.04203v1
- Date: Sat, 9 Jul 2022 06:25:01 GMT
- Title: Learning to Separate Voices by Spatial Regions
- Authors: Zhongweiyang Xu and Romit Roy Choudhury
- Abstract summary: We consider the problem of audio voice separation for applications, such as earphones and hearing aids.
We propose a two-stage self-supervised framework in which overheard voices from earphones are pre-processed to extract relatively clean personalized signals.
Results show promising performance, underscoring the importance of personalization over a generic supervised approach.
- Score: 5.483801693991577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of audio voice separation for binaural applications,
such as earphones and hearing aids. While today's neural networks perform
remarkably well (separating $4+$ sources with 2 microphones) they assume a
known or fixed maximum number of sources, K. Moreover, today's models are
trained in a supervised manner, using training data synthesized from generic
sources, environments, and human head shapes.
This paper intends to relax both these constraints at the expense of a slight
alteration in the problem definition. We observe that, when a received mixture
contains too many sources, it is still helpful to separate them by region,
i.e., isolating signal mixtures from each conical sector around the user's
head. This requires learning the fine-grained spatial properties of each
region, including the signal distortions imposed by a person's head. We propose
a two-stage self-supervised framework in which overheard voices from earphones
are pre-processed to extract relatively clean personalized signals, which are
then used to train a region-wise separation model. Results show promising
performance, underscoring the importance of personalization over a generic
supervised approach. (audio samples available at our project website:
https://uiuc-earable-computing.github.io/binaural/. We believe this result
could help real-world applications in selective hearing, noise cancellation,
and audio augmented reality.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Sound Localization from Motion: Jointly Learning Sound Direction and
Camera Rotation [26.867430697990674]
We use images and sounds that undergo subtle but geometrically consistent changes as we rotate our heads to estimate camera rotation and localizing sound sources.
A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from sounds.
We train these models to generate predictions that agree with one another.
Our model can successfully estimate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches.
arXiv Detail & Related papers (2023-03-20T17:59:55Z) - AudioEar: Single-View Ear Reconstruction for Personalized Spatial Audio [44.460995595847606]
We propose to achieve personalized spatial audio by reconstructing 3D human ears with single-view images.
To fill the gap between the vision and acoustics community, we develop a pipeline to integrate the reconstructed ear mesh with an off-the-shelf 3D human body.
arXiv Detail & Related papers (2023-01-30T02:15:50Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Multiple Sound Sources Localization from Coarse to Fine [41.56420350529494]
How to visually localize multiple sound sources in unconstrained videos is a formidable problem.
We develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes.
Our model achieves state-of-the-art results on public dataset of localization.
arXiv Detail & Related papers (2020-07-13T12:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.