Related papers: Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

URL: http://arxiv.org/abs/2509.18816v1
Date: Tue, 23 Sep 2025 09:02:15 GMT
Title: Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models
Authors: Junyu Wang, Ziyang Ma, Zhengding Luo, Tianrui Wang, Meng Ge, Xiaobao Wang, Longbiao Wang,
Abstract summary: MATA is a training-free method that dynamically pushes LALMs to pay textbfMore textbfAttention textbfTo textbfAudio tokens within the self-attention mechanism.<n>Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains.
Score: 60.857389526958485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose \textbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay \textbf{M}ore \textbf{A}ttention \textbf{T}o \textbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.

Related papers

DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding [58.29124051111574]
We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding.<n>DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum.<n>Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA.
arXiv Detail & Related papers (2026-01-30T16:44:23Z)
Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models [49.097347801692166]
We introduce Thinking-with-Sound (TwS), a framework that equips Large Audio-Language Models with Audio CoT.<n>TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning.<n>Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50%$ compared to clean audio.
arXiv Detail & Related papers (2025-09-26T01:27:59Z)
When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models [18.160420407067743]
MCR-BENCH is the first benchmark designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs.<n>We reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input.<n>This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications.
arXiv Detail & Related papers (2025-08-21T09:58:24Z)
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model [85.72664004969182]
We introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks.<n>The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction.<n>Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence.
arXiv Detail & Related papers (2025-06-10T16:37:39Z)
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [13.887164304514101]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning [15.472819870523093]
We introduce Audio Mamba, the first self-attention-free, purely SSM-based model for audio classification. We evaluate AuM on various audio datasets - comprising six different benchmarks - where it achieves comparable or better performance.
arXiv Detail & Related papers (2024-06-05T15:00:59Z)
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model [12.399378490833818]
Self-Supervised Audio Mamba (SSAMBA) is the first self-supervised, attention-free, and SSM-based model for audio representation learning.<n>Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks.
arXiv Detail & Related papers (2024-05-20T06:58:47Z)
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures [16.59243476473915]
Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data. ID emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. We introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures.
arXiv Detail & Related papers (2024-03-14T17:13:37Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA) Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.