SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning
- URL: http://arxiv.org/abs/2504.15900v3
- Date: Tue, 29 Apr 2025 02:51:19 GMT
- Title: SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning
- Authors: Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, Xiangang Li,
- Abstract summary: Reinforcement learning can sharpen the reasoning ability of large language models (LLMs) by prompting them to "think before answering"<n>We show that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.
- Score: 21.36638095182274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work shows that reinforcement learning(RL) can markedly sharpen the reasoning ability of large language models (LLMs) by prompting them to "think before answering." Yet whether and how these gains transfer to audio-language reasoning remains largely unexplored. We extend the Group-Relative Policy Optimization (GRPO) framework from DeepSeek-R1 to a Large Audio-Language Model (LALM), and construct a 32k sample multiple-choice corpus. Using a two-stage regimen supervised fine-tuning on structured and unstructured chains-of-thought, followed by curriculum-guided GRPO, we systematically compare implicit vs. explicit, and structured vs. free form reasoning under identical architectures. Our structured audio reasoning model, SARI (Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning), achieves a 16.35% improvement in average accuracy over the base model Qwen2-Audio-7B-Instruct. Furthermore, the variant built upon Qwen2.5-Omni reaches state-of-the-art performance of 67.08% on the MMAU test-mini benchmark. Ablation experiments show that on the base model we use: (i) SFT warm-up is important for stable RL training, (ii) structured chains yield more robust generalization than unstructured ones, and (iii) easy-to-hard curricula accelerate convergence and improve final performance. These findings demonstrate that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.
Related papers
- Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO [24.91321958525287]
Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge.<n>Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable.<n>We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition.
arXiv Detail & Related papers (2026-02-05T05:27:11Z) - Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z) - Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z) - Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning [29.722512436773638]
We propose textscStructure-R1, a framework that transforms retrieved content into structured representations optimized for reasoning.<n>We show that textscStructure-R1 consistently achieves competitive performance with a 7B-scale backbone model.<n>Our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity.
arXiv Detail & Related papers (2025-10-16T23:19:28Z) - Exploring Superior Function Calls via Reinforcement Learning [9.278264697070306]
We present a novel reinforcement learning framework designed to enhance group relative policy optimization.<n>We address three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction.<n>Our framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios.
arXiv Detail & Related papers (2025-08-07T07:51:38Z) - AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z) - Checklists Are Better Than Reward Models For Aligning Language Models [99.1896531064102]
We propose "Reinforcement Learning from Checklist Feedback" (RLCF)<n>From instructions, we extract checklists and evaluate how well responses satisfy each item.<n>Using both AI judges and specialized verifier programs, we combine these scores to compute rewards for RL.
arXiv Detail & Related papers (2025-07-24T17:58:00Z) - StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs [1.5134121362787107]
StrucSum is a training-free prompting framework for large language models (LLMs)<n>It injects structural signals into prompts via three targeted strategies.<n>Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency.
arXiv Detail & Related papers (2025-05-29T00:10:23Z) - EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [108.73513190593232]
Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet struggle with structured cross-modal reasoning.<n>We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs.
arXiv Detail & Related papers (2025-05-07T17:59:49Z) - Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [22.88876323500893]
reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs)<n>We conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task.<n>Our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%.
arXiv Detail & Related papers (2025-03-14T08:43:53Z) - Reasoning with Reinforced Functional Token Tuning [70.96651128307985]
We propose Reinforced Functional Token Tuning (RFTT) to empower Large Language Models (LLMs) with self-play learn-to-reason capabilities.<n>RFTT embeds a rich set of learnable functional tokens directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors.
arXiv Detail & Related papers (2025-02-19T02:59:42Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models [30.09120709652445]
This paper is centred around a focused study: whether the current state-of-the-art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with textitin-context learning.
arXiv Detail & Related papers (2024-10-11T00:45:50Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Self-Discover: Large Language Models Self-Compose Reasoning Structures [136.48389510481758]
We introduce SELF-DISCOVER, a framework for self-discovering task-intrinsic reasoning structures.
SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks.
We show that the self-discovered reasoning structures are universally applicable across model families.
arXiv Detail & Related papers (2024-02-06T01:13:53Z) - SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning [29.514755268807868]
We propose SEER, a novel method that maximizes a structure-based return to facilitate structured reasoning and explanation.
Our proposed structure-based return precisely describes the hierarchical and branching structure inherent in structured reasoning.
Our experiments show that SEER significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-01-24T06:10:51Z) - Unifying Structure and Language Semantic for Efficient Contrastive
Knowledge Graph Completion with Structured Entity Anchors [0.3913403111891026]
The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known.
We propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning.
arXiv Detail & Related papers (2023-11-07T11:17:55Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.