Related papers: Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

URL: http://arxiv.org/abs/2509.13548v2
Date: Thu, 18 Sep 2025 01:20:59 GMT
Title: Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers
Authors: Manan Mittal, Thomas Deppisch, Joseph Forrer, Chris Le Sueur, Zamir Ben-Hur, David Lou Along, Daniel D. E. Wong,
Abstract summary: We propose a novel mixture of experts framework for field-of-view enhancement in signal matching.<n>Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions.<n>This allows for realtime tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality.
Score: 2.614081506519191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues. Unlike traditional methods that rely on explicit direction-of-arrival estimation or operate in the Ambisonics domain, our signal-dependent framework combines multiple binaural filters in an online manner using implicit localization. This allows for real-time tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality. The method is agnostic to array geometry offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.

Related papers

In-the-wild Audio Spatialization with Flexible Text-guided Localization [37.60344400859993]
To enhance immersive experiences, audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications.<n>While existing audio spatialization methods can generally map any available monaural audio to audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments.<n>We propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives.
arXiv Detail & Related papers (2025-06-01T09:41:56Z)
SOAF: Scene Occlusion-aware Neural Acoustic Field [9.651041527067907]
We propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation.<n>Our approach derives a global prior for the sound field using distance-aware parametric sound-propagation modeling.<n>We extract features from the local acoustic field centered at the receiver using a Fibonacci Sphere to generate audio for novel views.
arXiv Detail & Related papers (2024-07-02T13:40:56Z)
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.<n>Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.<n>We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.<n>Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios [11.811571392419324]
Speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios. This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings.
arXiv Detail & Related papers (2023-12-17T16:12:35Z)
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture. We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z)
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z)
Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings. We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios. Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z)
Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video. Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.