Related papers: AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

URL: http://arxiv.org/abs/2602.10439v1
Date: Wed, 11 Feb 2026 02:30:48 GMT
Title: AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning
Authors: Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang,
Abstract summary: Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning.<n>We propose Audio, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools.
Score: 29.443084496227026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.

Related papers

AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning [36.67330306977483]
Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements.<n>We propose AuTAgent, a reinforcement learning framework that learns when and which tools to invoke.
arXiv Detail & Related papers (2026-02-14T09:12:20Z)
DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding [58.29124051111574]
We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding.<n>DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum.<n>Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA.
arXiv Detail & Related papers (2026-01-30T16:44:23Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM [15.340075567628466]
This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt.<n>Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning improves the results further.
arXiv Detail & Related papers (2025-11-04T03:54:55Z)
Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models [49.097347801692166]
We introduce Thinking-with-Sound (TwS), a framework that equips Large Audio-Language Models with Audio CoT.<n>TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning.<n>Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50%$ compared to clean audio.
arXiv Detail & Related papers (2025-09-26T01:27:59Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z)
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model [53.492751392755636]
We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
arXiv Detail & Related papers (2023-08-15T06:38:38Z)
Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function. Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.