Related papers: Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios

Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios

URL: http://arxiv.org/abs/2312.10756v1
Date: Sun, 17 Dec 2023 16:12:35 GMT
Title: Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios
Authors: Yuzhu Wang, Archontis Politis, Tuomas Virtanen
Abstract summary: Speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios. This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings.
Score: 11.811571392419324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current multichannel speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios. This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings. Specifically, we study the application of linear and nonlinear attention-based methods for estimating time-varying spatial covariance matrices used to design the filters. We also investigate the direct estimation of spatial filters by attention-based methods without explicitly estimating spatial statistics. The clean speech clips from WSJ0 are employed for simulating speech signals of moving speakers in a reverberant environment. The experimental dataset is built by mixing the simulated speech signals with multichannel real noise from CHiME-3. Evaluation results show that the attention-driven approaches are robust and consistently outperform conventional spatial filtering approaches in both static and dynamic sound environments.

Related papers

Leveraging Spatial Cues from Cochlear Implant Microphones to Efficiently Enhance Speech Separation in Real-World Listening Scenes [1.1081718316044291]
We quantify the impact of real-world acoustic scenes on speech separation and explore how spatial cues can enhance separation quality efficiently. Our findings show that spatial cues (both implicit and explicit) improve separation for mixtures with spatially separated talkers. Explicit spatial cues are particularly beneficial when implicit spatial cues are weak. These results emphasize the importance of training models on real-world data to improve generalizability in everyday listening scenarios.
arXiv Detail & Related papers (2025-01-24T16:30:58Z)
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [2.2927722373373247]
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects.
arXiv Detail & Related papers (2024-10-19T02:28:53Z)
Sound event localization and classification using WASN in Outdoor Environment [2.234738672139924]
Methods for sound event localization and classification typically rely on a single microphone array. We propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source.
arXiv Detail & Related papers (2024-03-29T11:44:14Z)
Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising [64.11157141177208]
We propose a spectral enhanced rectangle Transformer to model the spatial and spectral correlation in hyperspectral images. For the former, we exploit the rectangle self-attention horizontally and vertically to capture the non-local similarity in the spatial domain. For the latter, we design a spectral enhancement module that is capable of extracting global underlying low-rank property of spatial-spectral cubes to suppress noise.
arXiv Detail & Related papers (2023-04-03T09:42:13Z)
Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain [131.74762114632404]
The model is trained end-to-end and performs spatial processing implicitly. We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer.
arXiv Detail & Related papers (2022-06-30T17:13:01Z)
Insights into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement [21.422488450492434]
In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. There is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter.
arXiv Detail & Related papers (2022-06-27T13:54:14Z)
Few-Shot Audio-Visual Learning of Environment Acoustics [89.16560042178523]
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener. We explore how to infer RIRs based on a sparse set of images and echoes observed in the space. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
arXiv Detail & Related papers (2022-06-08T16:38:24Z)
Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model. It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.