Attention-Driven Multichannel Speech Enhancement in Moving Sound Source
Scenarios
- URL: http://arxiv.org/abs/2312.10756v1
- Date: Sun, 17 Dec 2023 16:12:35 GMT
- Title: Attention-Driven Multichannel Speech Enhancement in Moving Sound Source
Scenarios
- Authors: Yuzhu Wang, Archontis Politis, Tuomas Virtanen
- Abstract summary: Speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios.
This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings.
- Score: 11.811571392419324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current multichannel speech enhancement algorithms typically assume a
stationary sound source, a common mismatch with reality that limits their
performance in real-world scenarios. This paper focuses on attention-driven
spatial filtering techniques designed for dynamic settings. Specifically, we
study the application of linear and nonlinear attention-based methods for
estimating time-varying spatial covariance matrices used to design the filters.
We also investigate the direct estimation of spatial filters by attention-based
methods without explicitly estimating spatial statistics. The clean speech
clips from WSJ0 are employed for simulating speech signals of moving speakers
in a reverberant environment. The experimental dataset is built by mixing the
simulated speech signals with multichannel real noise from CHiME-3. Evaluation
results show that the attention-driven approaches are robust and consistently
outperform conventional spatial filtering approaches in both static and dynamic
sound environments.
Related papers
- SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models [62.14165748145729]
We introduce SPUR, a lightweight, plug-in approach that equips large audio-speaker models with spatial perception.<n>SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning.
arXiv Detail & Related papers (2025-11-10T01:29:26Z) - Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection [53.689841037081834]
Ivan-ISTD is designed to address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD.<n>Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-14T07:48:31Z) - Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers [3.5522191686718725]
We propose a novel mixture of experts framework for field-of-view enhancement in signal matching.<n>Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions.<n>This allows for realtime tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality.
arXiv Detail & Related papers (2025-09-16T21:30:06Z) - Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z) - Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance [14.16697537117357]
We present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead.<n>We show how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance.
arXiv Detail & Related papers (2025-07-03T16:54:56Z) - Leveraging Spatial Cues from Cochlear Implant Microphones to Efficiently Enhance Speech Separation in Real-World Listening Scenes [1.1081718316044291]
We quantify the impact of real-world acoustic scenes on speech separation and explore how spatial cues can enhance separation quality efficiently.
Our findings show that spatial cues (both implicit and explicit) improve separation for mixtures with spatially separated talkers.
Explicit spatial cues are particularly beneficial when implicit spatial cues are weak.
These results emphasize the importance of training models on real-world data to improve generalizability in everyday listening scenarios.
arXiv Detail & Related papers (2025-01-24T16:30:58Z) - ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [2.2927722373373247]
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects.
arXiv Detail & Related papers (2024-10-19T02:28:53Z) - Sound event localization and classification using WASN in Outdoor Environment [2.234738672139924]
Methods for sound event localization and classification typically rely on a single microphone array.
We propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source.
arXiv Detail & Related papers (2024-03-29T11:44:14Z) - Spectral Enhanced Rectangle Transformer for Hyperspectral Image
Denoising [64.11157141177208]
We propose a spectral enhanced rectangle Transformer to model the spatial and spectral correlation in hyperspectral images.
For the former, we exploit the rectangle self-attention horizontally and vertically to capture the non-local similarity in the spatial domain.
For the latter, we design a spectral enhancement module that is capable of extracting global underlying low-rank property of spatial-spectral cubes to suppress noise.
arXiv Detail & Related papers (2023-04-03T09:42:13Z) - Implicit Neural Spatial Filtering for Multichannel Source Separation in
the Waveform Domain [131.74762114632404]
The model is trained end-to-end and performs spatial processing implicitly.
We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer.
arXiv Detail & Related papers (2022-06-30T17:13:01Z) - Insights into Deep Non-linear Filters for Improved Multi-channel Speech
Enhancement [21.422488450492434]
In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately.
There is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter.
arXiv Detail & Related papers (2022-06-27T13:54:14Z) - Few-Shot Audio-Visual Learning of Environment Acoustics [89.16560042178523]
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener.
We explore how to infer RIRs based on a sparse set of images and echoes observed in the space.
In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
arXiv Detail & Related papers (2022-06-08T16:38:24Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.