Related papers: SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models

SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models

URL: http://arxiv.org/abs/2601.11039v1
Date: Fri, 16 Jan 2026 07:10:57 GMT
Title: SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models
Authors: Yirong Sun, Yanjun Chen, Xin Qiu, Gang Zhang, Hongyu Chen, Daokuan Wu, Chengming Li, Min Yang, Dawei Zhu, Wei Zhang, Xiaoyu Shen,
Abstract summary: Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio remains under-explored.<n>We introduce SonicBench, a psychophysically grounded benchmark that systematically evaluates 12 core physical attributes across five dimensions.<n>Our evaluation reveals a substantial deficiency in LALMs' foundational auditory understanding.
Score: 30.62556746827114
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio such as pitch, loudness, and spatial location remains under-explored. To bridge this gap, we introduce SonicBench, a psychophysically grounded benchmark that systematically evaluates 12 core physical attributes across five perceptual dimensions. Unlike previous datasets, SonicBench uses a controllable generation toolbox to construct stimuli for two complementary paradigms: recognition (absolute judgment) and comparison (relative judgment). This design allows us to probe not only sensory precision but also relational reasoning capabilities, a domain where humans typically exhibit greater proficiency. Our evaluation reveals a substantial deficiency in LALMs' foundational auditory understanding; most models perform near random guessing and, contrary to human patterns, fail to show the expected advantage on comparison tasks. Furthermore, explicit reasoning yields minimal gains. However, our linear probing analysis demonstrates crucially that frozen audio encoders do successfully capture these physical cues (accuracy at least 60%), suggesting that the primary bottleneck lies in the alignment and decoding stages, where models fail to leverage the sensory signals they have already captured.

Related papers

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation [63.3417467957431]
Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content.<n>We present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to evaluate the audio physics grounding capabilities of existing T2AV models.<n>Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation.
arXiv Detail & Related papers (2025-12-30T05:22:31Z)
Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model [0.27528170226206433]
We introduce Brain-Semantoks, a self-supervised framework to learn abstract representations of brain dynamics.<n>Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks.<n>We show that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe.
arXiv Detail & Related papers (2025-12-12T14:11:20Z)
Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs [39.209987830131816]
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering.<n>Yet, whether these models can perceive dynamics, particularly the motion of sound sources, remains unclear.<n>We introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding.
arXiv Detail & Related papers (2025-11-17T11:45:41Z)
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence [81.94084852268468]
We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space.<n> STAR-Bench combines a Foundational Acoustic Perception setting with a Holistic Spatio-Temporal Reasoning setting.<n>Our data curation pipeline uses two methods to ensure high-quality samples.
arXiv Detail & Related papers (2025-10-28T17:50:34Z)
SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams [57.84331423686738]
Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence.<n>We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses.<n>Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud.
arXiv Detail & Related papers (2025-10-12T13:36:40Z)
Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z)
Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis [6.858439600092057]
We explore the use of prosody, or the high-level linguistic features of human speech, as a more foundational means of detecting audio deepfakes.<n>We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models.<n>We show that we can explain the prosodic features that have highest impact on the model's decision.
arXiv Detail & Related papers (2025-02-20T16:52:55Z)
BAT: Learning to Reason about Spatial Sounds with Large Language Models [45.757161909533714]
We present BAT, which combines the sound perception ability of a spatial scene analysis model with the natural language reasoning capabilities of a large language model (LLM)<n>Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning.
arXiv Detail & Related papers (2024-02-02T17:34:53Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Audio Impairment Recognition Using a Correlation-Based Feature Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.