The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
- URL: http://arxiv.org/abs/2602.14224v1
- Date: Sun, 15 Feb 2026 16:38:09 GMT
- Title: The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
- Authors: Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li, Jaeyeon Kim, Jin Xu, Jinyu Li, Carlos Busso, Kai Yu, Eng Siong Chng, Xie Chen,
- Abstract summary: We organized the Audio Reasoning Challenge at Interspeech 2026.<n>The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains.<n>Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions.
- Score: 83.79481911755481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.
Related papers
- Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering [13.757806950813995]
Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions.<n>We propose a novel Query-guided Spatial--Temporal--Frequency interaction method to enhance audio--visual understanding.<n>Our proposed method achieves significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches.
arXiv Detail & Related papers (2026-01-27T17:24:32Z) - Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering [20.893202481783444]
We propose Omni-T, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought.<n>We show that Omni-T achieves 73.80% on MMAUmini and a new state of the art of 64.30% on MMAR.
arXiv Detail & Related papers (2025-09-14T06:54:12Z) - AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z) - Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.84031769492708]
This task defines three QA subsets to test audio-language models on interactive question-answering over diverse acoustic scenes.<n>Preliminary results on the development set are compared, showing strong variation across models and subsets.<n>This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity.
arXiv Detail & Related papers (2025-05-12T09:04:16Z) - Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [91.11904427660043]
We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.<n>We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.<n>Our findings stress the core of structured CoT training in advancing audio reasoning.
arXiv Detail & Related papers (2025-03-04T06:18:34Z) - Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model [26.20569269005708]
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding.<n>However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored.<n>We conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities.
arXiv Detail & Related papers (2025-01-13T11:54:40Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better [9.378013909890374]
We present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024)
To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy.
Our model ranks textbf2nd in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
arXiv Detail & Related papers (2024-09-12T05:05:34Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.