Related papers: Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering

Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering

URL: http://arxiv.org/abs/2509.12275v3
Date: Thu, 18 Sep 2025 07:19:29 GMT
Title: Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering
Authors: Jinghua Zhao, Hang Su, Lichun Fan, Zhenbo Luo, Hui Wang, Haoqin Sun, Yong Qin,
Abstract summary: We propose Omni-T, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought.<n>We show that Omni-T achieves 73.80% on MMAUmini and a new state of the art of 64.30% on MMAR.
Score: 20.893202481783444
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid progress of large audio-language models (LALMs), audio question answering (AQA) has emerged as a challenging task requiring both fine-grained audio understanding and complex reasoning. While current methods mainly rely on constructing new datasets via captioning or reasoning traces, existing high-quality AQA data remains underutilized. To address this, we propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Experiments show that Omni-CLST achieves 73.80% on MMAU-mini and a new state of the art of 64.30% on MMAR, demonstrating robust generalization in multimodal audio-language understanding.

Related papers

The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents [83.79481911755481]
We organized the Audio Reasoning Challenge at Interspeech 2026.<n>The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains.<n>Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions.
arXiv Detail & Related papers (2026-02-15T16:38:09Z)
SICL-AT: Another way to adapt Auditory LLM to low-resource task [34.82834349882226]
Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks.<n>They often struggle when applied to low-resource or unfamiliar tasks.<n>In-Context Learning (ICL) provides a training-free, inference-time solution.
arXiv Detail & Related papers (2026-01-26T19:15:16Z)
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs [8.918587474371321]
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging.<n>We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs.<n>Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution.
arXiv Detail & Related papers (2025-09-09T15:30:40Z)
Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning [5.148672971653068]
Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive Automated Speaking Assessment (ASA)<n>We propose Speech-First Multimodal Training (SFMT) to establish more robust modeling foundations of speech before cross-modal synergetic fusion.<n>In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches.
arXiv Detail & Related papers (2025-08-18T02:57:43Z)
AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z)
Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [63.741916531380696]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z)
Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.<n>The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.<n>We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z)
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components. We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways. CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z)
SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models [71.78800549517298]
Continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world. Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input. We propose a novel Shared Attention Framework (SAPT) to align the PET learning and selection via the Shared Attentive Learning & Selection module.
arXiv Detail & Related papers (2024-01-16T11:45:03Z)
Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances. WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z)
Coverage-based Example Selection for In-Context Learning [27.215972147196805]
We show that BERTScore-Recall (BSR) selects better examples that demonstrate more of the salient aspects of the test input. On 15 datasets spanning 6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metric for in-context example selection across the board, and (2) for compositional tasks, Set-BSR outperforms independent ranking by up to 17 points on average.
arXiv Detail & Related papers (2023-05-24T08:58:28Z)
SLICER: Learning universal audio representations using low-resource self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z)
Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.