Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering
- URL: http://arxiv.org/abs/2509.12275v3
- Date: Thu, 18 Sep 2025 07:19:29 GMT
- Title: Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering
- Authors: Jinghua Zhao, Hang Su, Lichun Fan, Zhenbo Luo, Hui Wang, Haoqin Sun, Yong Qin,
- Abstract summary: We propose Omni-T, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought.<n>We show that Omni-T achieves 73.80% on MMAUmini and a new state of the art of 64.30% on MMAR.
- Score: 20.893202481783444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid progress of large audio-language models (LALMs), audio question answering (AQA) has emerged as a challenging task requiring both fine-grained audio understanding and complex reasoning. While current methods mainly rely on constructing new datasets via captioning or reasoning traces, existing high-quality AQA data remains underutilized. To address this, we propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Experiments show that Omni-CLST achieves 73.80% on MMAU-mini and a new state of the art of 64.30% on MMAR, demonstrating robust generalization in multimodal audio-language understanding.
Related papers
- The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents [83.79481911755481]
We organized the Audio Reasoning Challenge at Interspeech 2026.<n>The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains.<n>Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions.
arXiv Detail & Related papers (2026-02-15T16:38:09Z) - SICL-AT: Another way to adapt Auditory LLM to low-resource task [34.82834349882226]
Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks.<n>They often struggle when applied to low-resource or unfamiliar tasks.<n>In-Context Learning (ICL) provides a training-free, inference-time solution.
arXiv Detail & Related papers (2026-01-26T19:15:16Z) - AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs [8.918587474371321]
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging.<n>We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs.<n>Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution.
arXiv Detail & Related papers (2025-09-09T15:30:40Z) - Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning [5.148672971653068]
Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive Automated Speaking Assessment (ASA)<n>We propose Speech-First Multimodal Training (SFMT) to establish more robust modeling foundations of speech before cross-modal synergetic fusion.<n>In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches.
arXiv Detail & Related papers (2025-08-18T02:57:43Z) - AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z) - Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [63.741916531380696]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z) - Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.<n>The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.<n>We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z) - CAT: Enhancing Multimodal Large Language Model to Answer Questions in
Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components.
We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways.
CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z) - SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models [71.78800549517298]
Continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world.
Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input.
We propose a novel Shared Attention Framework (SAPT) to align the PET learning and selection via the Shared Attentive Learning & Selection module.
arXiv Detail & Related papers (2024-01-16T11:45:03Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Coverage-based Example Selection for In-Context Learning [27.215972147196805]
We show that BERTScore-Recall (BSR) selects better examples that demonstrate more of the salient aspects of the test input.
On 15 datasets spanning 6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metric for in-context example selection across the board, and (2) for compositional tasks, Set-BSR outperforms independent ranking by up to 17 points on average.
arXiv Detail & Related papers (2023-05-24T08:58:28Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.