Do Models Hear Like Us? Probing the Representational Alignment of Audio LLMs and Naturalistic EEG
- URL: http://arxiv.org/abs/2601.16540v1
- Date: Fri, 23 Jan 2026 08:18:29 GMT
- Title: Do Models Hear Like Us? Probing the Representational Alignment of Audio LLMs and Naturalistic EEG
- Authors: Haoyun Yang, Xin Xiao, Jiang Zhong, Yu Tian, Dong Xiaohua, Yu Mao, Hao Wu, Kaiwen Wei,
- Abstract summary: We examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets.<n>Our analysis reveals 3 key findings: (1) we observe a rank-related split, in which model rankings vary substantially across different similarity metrics; (2) we identify-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400 neural dynamics; and (3) we find an affective whereby negativeody capabilities, identified using a proposed Trimodal- Consistency Neighborhood (TNC) criterion, reduces geometric similarity dependence while enhancing
- Score: 21.253523606290685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.
Related papers
- STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence [81.94084852268468]
We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space.<n> STAR-Bench combines a Foundational Acoustic Perception setting with a Holistic Spatio-Temporal Reasoning setting.<n>Our data curation pipeline uses two methods to ensure high-quality samples.
arXiv Detail & Related papers (2025-10-28T17:50:34Z) - WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities [55.00677513249723]
EEG signals simultaneously encode both cognitive processes and intrinsic neural states.<n>We map EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation.<n>The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations.
arXiv Detail & Related papers (2025-09-26T06:21:51Z) - Concept-Guided Interpretability via Neural Chunking [64.6429903327095]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract recurring chunks on a neural population level.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z) - A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework [10.354955365036181]
Despite the crucial role relational thinking plays in human understanding of speech, it has yet to be leveraged in any artificial speech recognition systems.
This paper presents a novel spectro-temporal relational thinking based acoustic modeling framework.
Models built upon this framework outperform stateof-the-art systems with a 7.82% improvement in phoneme recognition tasks over the TIMIT dataset.
arXiv Detail & Related papers (2024-09-17T05:45:33Z) - Analysis of Argument Structure Constructions in a Deep Recurrent Language Model [0.0]
We explore the representation and processing of Argument Structure Constructions (ASCs) in a recurrent neural language model.
Our results show that sentence representations form distinct clusters corresponding to the four ASCs across all hidden layers.
This indicates that even a relatively simple, brain-constrained recurrent neural network can effectively differentiate between various construction types.
arXiv Detail & Related papers (2024-08-06T09:27:41Z) - Interpretable Spatio-Temporal Embedding for Brain Structural-Effective Network with Ordinary Differential Equation [56.34634121544929]
In this study, we first construct the brain-effective network via the dynamic causal model.
We then introduce an interpretable graph learning framework termed Spatio-Temporal Embedding ODE (STE-ODE)
This framework incorporates specifically designed directed node embedding layers, aiming at capturing the dynamic interplay between structural and effective networks.
arXiv Detail & Related papers (2024-05-21T20:37:07Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Insights on Neural Representations for End-to-End Speech Recognition [28.833851817220616]
End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation.
Previous investigations of network similarities using correlation analysis techniques have not been explored for End-to-End ASR models.
This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches.
arXiv Detail & Related papers (2022-05-19T10:19:32Z) - Learning Signal-Agnostic Manifolds of Neural Fields [50.066449953522685]
We leverage neural fields to capture the underlying structure in image, shape, audio and cross-modal audiovisual domains.
We show that by walking across the underlying manifold of GEM, we may generate new samples in our signal domains.
arXiv Detail & Related papers (2021-11-11T18:57:40Z) - Extracting the Locus of Attention at a Cocktail Party from Single-Trial
EEG using a Joint CNN-LSTM Model [0.1529342790344802]
Human brain performs remarkably well in segregating a particular speaker from interfering speakers in a multi-speaker scenario.
We present a joint convolutional neural network (CNN) - long short-term memory (LSTM) model to infer the auditory attention.
arXiv Detail & Related papers (2021-02-08T01:06:48Z) - A journey in ESN and LSTM visualisations on a language task [77.34726150561087]
We trained ESNs and LSTMs on a Cross-Situationnal Learning (CSL) task.
The results are of three kinds: performance comparison, internal dynamics analyses and visualization of latent space.
arXiv Detail & Related papers (2020-12-03T08:32:01Z) - Analyzing analytical methods: The case of phonology in neural models of
spoken language [44.00588930401902]
We study the case of representations of phonology in neural network models of spoken language.
We use two commonly applied analytical techniques to quantify to what extent neural activation patterns encode phonemes and phoneme sequences.
arXiv Detail & Related papers (2020-04-15T13:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.