DOA Estimation with Lightweight Network on LLM-Aided Simulated Acoustic Scenes
- URL: http://arxiv.org/abs/2511.08012v1
- Date: Wed, 12 Nov 2025 01:34:01 GMT
- Title: DOA Estimation with Lightweight Network on LLM-Aided Simulated Acoustic Scenes
- Authors: Haowen Li, Zhengding Luo, Dongyuan Shi, Boxiang Wang, Junwei Ji, Ziyi Yang, Woon-Seng Gan,
- Abstract summary: Direction-of-Arrival (DOA) estimation is critical in spatial audio and acoustic signal processing.<n>We propose LightDOA, a lightweight DOA estimation model based on depthwise separable convolutions.<n> Experimental results show that LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes.
- Score: 46.0445214387366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direction-of-Arrival (DOA) estimation is critical in spatial audio and acoustic signal processing, with wide-ranging applications in real-world. Most existing DOA models are trained on synthetic data by convolving clean speech with room impulse responses (RIRs), which limits their generalizability due to constrained acoustic diversity. In this paper, we revisit DOA estimation using a recently introduced dataset constructed with the assistance of large language models (LLMs), which provides more realistic and diverse spatial audio scenes. We benchmark several representative neural-based DOA methods on this dataset and propose LightDOA, a lightweight DOA estimation model based on depthwise separable convolutions, specifically designed for mutil-channel input in varying environments. Experimental results show that LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes while maintaining low computational complexity. This study not only highlights the potential of spatial audio synthesized with the assistance of LLMs in advancing robust and efficient DOA estimation research, but also highlights LightDOA as efficient solution for resource-constrained applications.
Related papers
- Reciprocal Latent Fields for Precomputed Sound Propagation [0.6474760227870046]
We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting acoustic parameters.<n>We show that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude.
arXiv Detail & Related papers (2026-02-06T18:31:11Z) - SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models [49.313324100819955]
Signal Embedding Energy (SEE) is a method for quantifying the impact of noise intensity on LALM inputs.<n>SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98.<n>This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
arXiv Detail & Related papers (2026-01-12T08:57:55Z) - SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models [62.14165748145729]
We introduce SPUR, a lightweight, plug-in approach that equips large audio-speaker models with spatial perception.<n>SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning.
arXiv Detail & Related papers (2025-11-10T01:29:26Z) - Fun-ASR Technical Report [89.84148151617022]
We present Fun-ASR, a large-scale, LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning.<n>Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.<n>Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
arXiv Detail & Related papers (2025-09-15T23:19:36Z) - LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech [35.36044093564255]
LENS-DF is a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization.<n>We conduct experiments based on self-supervised learning front-end and simple back-end.<n>The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes.
arXiv Detail & Related papers (2025-07-22T04:31:13Z) - Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach [0.0]
We introduce the Latent Acoustic Mapping (LAM) model, a self-supervised framework that bridges the interpretability of traditional methods with the adaptability and efficiency of deep learning methods.<n>LAM generates high-resolution acoustic maps, adapts to varying acoustic conditions, and operates efficiently across different microphone arrays.<n>We show that LAM's acoustic maps can serve as effective features for supervised models, further enhancing DoAE accuracy and underscoring its potential to advance adaptive, high-performance sound localization systems.
arXiv Detail & Related papers (2025-07-08T03:35:00Z) - ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling [57.1025908604556]
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment.
We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment.
We introduce ActiveRIR, a reinforcement learning policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions.
arXiv Detail & Related papers (2024-04-24T21:30:01Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.