Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
- URL: http://arxiv.org/abs/2509.21749v1
- Date: Fri, 26 Sep 2025 01:27:59 GMT
- Title: Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
- Authors: Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, Yiwei Wang,
- Abstract summary: We introduce Thinking-with-Sound (TwS), a framework that equips Large Audio-Language Models with Audio CoT.<n>TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning.<n>Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50%$ compared to clean audio.
- Score: 49.097347801692166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q\&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50\%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73\%$ absolute accuracy, with improvements scaling consistently up to $36.61\%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.
Related papers
- Eureka-Audio: Triggering Audio Intelligence in Compact Language Models [28.38037427018435]
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against larger models.<n>Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning.<n>To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline.
arXiv Detail & Related papers (2026-02-15T02:01:08Z) - Towards Audio Token Compression in Large Audio Language Models [26.379508239446935]
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks.<n>However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals.<n>This paper explores techniques to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder.
arXiv Detail & Related papers (2025-11-26T02:00:38Z) - AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs [53.248502396225724]
AudioMarathon is a benchmark designed to evaluate both understanding and inference efficiency on long-form audio.<n>We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows.<n>The results show large gaps across current LALMs and highlight the need for better temporal reasoning.
arXiv Detail & Related papers (2025-10-08T17:50:16Z) - Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models [60.857389526958485]
MATA is a training-free method that dynamically pushes LALMs to pay textbfMore textbfAttention textbfTo textbfAudio tokens within the self-attention mechanism.<n>Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains.
arXiv Detail & Related papers (2025-09-23T09:02:15Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [91.11904427660043]
We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.<n>We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.<n>Our findings stress the core of structured CoT training in advancing audio reasoning.
arXiv Detail & Related papers (2025-03-04T06:18:34Z) - EAT: Self-Supervised Pre-Training with Efficient Audio Transformer [2.443213094810588]
Efficient Audio Transformer (EAT) is inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality.
A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events.
Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks.
arXiv Detail & Related papers (2024-01-07T14:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.