Related papers: AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

URL: http://arxiv.org/abs/2506.05140v1
Date: Thu, 05 Jun 2025 15:22:47 GMT
Title: AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models
Authors: Chih-Kai Yang, Neo Ho, Yi-Jyun Lee, Hung-yi Lee,
Abstract summary: This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes.<n>By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions.<n>Our results offer insights into auditory attribute processing, paving the way for future improvements.
Score: 44.99833362998488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.

Related papers

The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents [21.736748922886555]
This research uncovers a novel privacy risk associated with multimodal large language models (MLLMs)<n>The ability to infer sensitive personal attributes from audio data -- a technique we term audio private attribute profiling -- poses a significant threat.<n>We propose Gifts, a hybrid multi-agent framework that leverages the complementary strengths of audio-language models (ALMs) and large language models (LLMs) to enhance inference capabilities.
arXiv Detail & Related papers (2025-07-14T07:51:56Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
We introduce LISTEN, a contrastive-like training method designed to improve ALLMs' ability to distinguish between present and absent sounds.<n>We also extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption.<n> Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration [0.8702432681310401]
We investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system.<n>Our analysis reveals that the LLM contributes significantly to improvements in rare word error rate (R-WER)<n>Through extensive ablation studies, we highlight the importance of adapter integration in aligning speech encoder outputs with the LLM's linguistic capabilities.
arXiv Detail & Related papers (2025-02-22T08:30:38Z)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders [29.356200147371275]
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses.<n>We propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective.<n>We propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations.
arXiv Detail & Related papers (2025-02-21T16:36:42Z)
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis [0.0]
Large language models (LLMs) represent and recall multi-associated attributes across transformer layers.<n> intermediate layers encode factual knowledge by superimposing related attributes in overlapping spaces.<n>later layers refine linguistic patterns and progressively separate attribute representations.
arXiv Detail & Related papers (2025-02-15T18:08:51Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
Understanding Ranking LLMs: A Mechanistic Analysis for Information Retrieval [20.353393773305672]
We employ a probing-based analysis to examine neuron activations in ranking LLMs.<n>Our study spans a broad range of feature categories, including lexical signals, document structure, query-document interactions, and complex semantic representations.<n>Our findings offer crucial insights for developing more transparent and reliable retrieval systems.
arXiv Detail & Related papers (2024-10-24T08:20:10Z)
Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models [49.87432626548563]
We introduce methods to assess the extent of object hallucination of publicly available LALMs. Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content. We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
arXiv Detail & Related papers (2024-06-12T16:51:54Z)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories. This paper introduces a Retrieving And Ranking augmented method for MLLMs. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model [30.88357561791563]
This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning model. Results show that the AID fine-tuning task steers the top 2 layers to learn richer phoneme and prosody representation.
arXiv Detail & Related papers (2023-06-10T21:20:47Z)
Dissecting Recall of Factual Associations in Auto-Regressive Language Models [41.71388509750695]
Transformer-based language models (LMs) are known to capture factual knowledge in their parameters. We study how the model aggregates information about the subject and relation to predict the correct attribute. Our findings introduce a comprehensive view of how factual associations are stored and extracted internally in LMs.
arXiv Detail & Related papers (2023-04-28T11:26:17Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.