Step-Audio-R1 Technical Report
- URL: http://arxiv.org/abs/2511.15848v2
- Date: Wed, 26 Nov 2025 14:55:41 GMT
- Title: Step-Audio-R1 Technical Report
- Authors: Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu,
- Abstract summary: We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain.<n>Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro.
- Score: 70.37077572409319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
Related papers
- Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning [39.264735719707154]
Current efforts replicate text-based reasoning by contextualizing audio content through a one-time encoding.<n>We propose audio-interleaved reasoning to break through this bottleneck.<n>We present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning.
arXiv Detail & Related papers (2026-02-12T13:06:34Z) - Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception [142.4692205981783]
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception.<n>Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance.<n>To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable primitive proves effective in preventing the model from being derailed by flawed external candidates.
arXiv Detail & Related papers (2026-01-14T12:06:50Z) - SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models [18.802543558300044]
We present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student.<n>Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set and in unseen auditory scenes and questions.
arXiv Detail & Related papers (2025-09-19T06:39:39Z) - AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning [3.949628618389608]
AURA is a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs)<n>AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling.<n>We propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity.
arXiv Detail & Related papers (2025-08-10T20:06:42Z) - Step-Audio 2 Technical Report [120.58375054866815]
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.<n>By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding.
arXiv Detail & Related papers (2025-07-22T14:23:55Z) - Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.84031769492708]
This task defines three QA subsets to test audio-language models on interactive question-answering over diverse acoustic scenes.<n>Preliminary results on the development set are compared, showing strong variation across models and subsets.<n>This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity.
arXiv Detail & Related papers (2025-05-12T09:04:16Z) - Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [91.11904427660043]
We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.<n>We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.<n>Our findings stress the core of structured CoT training in advancing audio reasoning.
arXiv Detail & Related papers (2025-03-04T06:18:34Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.