Related papers: Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

URL: http://arxiv.org/abs/2602.11909v1
Date: Thu, 12 Feb 2026 13:06:34 GMT
Title: Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
Authors: Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou,
Abstract summary: Current efforts replicate text-based reasoning by contextualizing audio content through a one-time encoding.<n>We propose audio-interleaved reasoning to break through this bottleneck.<n>We present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning.
Score: 39.264735719707154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.

Related papers

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation [30.42124709340273]
We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation.<n>Our results demonstrate that audio-language pretraining yields competitive, transferable representations.<n>These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations.
arXiv Detail & Related papers (2025-11-20T19:17:35Z)
SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models [18.802543558300044]
We present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student.<n>Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set and in unseen auditory scenes and questions.
arXiv Detail & Related papers (2025-09-19T06:39:39Z)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models [87.23256929520743]
AudioStory is a framework that integrates large language models with text-to-audio systems to generate structured, long-form audio narratives.<n>It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues.<n>Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.
arXiv Detail & Related papers (2025-08-27T17:55:38Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Probing Audio-Generation Capabilities of Text-Based Language Models [5.4211188445379825]
This research investigates the extent to which Large Language Models can be prompted to generate audio.<n>We employ a three-tier approach, progressively increasing the complexity of audio generation.<n>Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases.
arXiv Detail & Related papers (2025-05-04T23:46:01Z)
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [91.11904427660043]
We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.<n>We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.<n>Our findings stress the core of structured CoT training in advancing audio reasoning.
arXiv Detail & Related papers (2025-03-04T06:18:34Z)
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.<n>LALMs excel in general audio understanding, but are limited in temporal reasoning.<n>This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.