Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs
- URL: http://arxiv.org/abs/2511.13273v1
- Date: Mon, 17 Nov 2025 11:45:41 GMT
- Title: Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs
- Authors: Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang,
- Abstract summary: Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering.<n>Yet, whether these models can perceive dynamics, particularly the motion of sound sources, remains unclear.<n>We introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding.
- Score: 39.209987830131816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
Related papers
- Spatial Audio Question Answering and Reasoning on Dynamic Source Movements [8.029049649310211]
We introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events.<n>Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps.<n>Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes.
arXiv Detail & Related papers (2026-02-18T10:16:30Z) - WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations [67.6147632074449]
We introduce the World-of-Whale benchmark (WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal vocalizations.<n>WoW-Bench is composed of a Perception benchmark for categorizing novel sounds and a Cognition benchmark, inspired by Bloom's taxonomy, to assess the abilities to remember, understand, apply, and analyze sound events.<n>Experiments with state-of-the-art LALMs show performance far below human levels, indicating a need for stronger auditory grounding in LALMs.
arXiv Detail & Related papers (2025-08-28T16:29:46Z) - Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - BAT: Learning to Reason about Spatial Sounds with Large Language Models [45.757161909533714]
We present BAT, which combines the sound perception ability of a spatial scene analysis model with the natural language reasoning capabilities of a large language model (LLM)<n>Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning.
arXiv Detail & Related papers (2024-02-02T17:34:53Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.