RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
- URL: http://arxiv.org/abs/2602.21628v2
- Date: Tue, 03 Mar 2026 09:05:32 GMT
- Title: RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
- Authors: Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin, Hengyu Chang, Ancheng Xu, Zhihao Yang, Hamid Alinejad-Rokny, Qiang Qu, Bo Zheng, Min Yang,
- Abstract summary: Stratified-based Curriculum Learning (RuCL) is a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design.<n>RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence.<n>Experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model.
- Score: 37.197149670957394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
Related papers
- Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling [90.87033586963828]
Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs)<n>We propose Self-Consistency Sampling (SCS) to correct this issue.<n>Based on Qwen2.5-VL-7B-Instruct, SCS improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation.
arXiv Detail & Related papers (2025-11-13T18:59:57Z) - Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models [61.78513830395669]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs)<n>As models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal.<n>We propose the Explore Residual Prompts in Policy Optimization framework, which encourages exploration on residual prompts and reactivates their training signals.
arXiv Detail & Related papers (2025-11-06T20:40:27Z) - OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [38.1645520104553]
We introduce OpenRubrics, a large-scale collection of (prompt,explicit) pairs for training rubric-generation and rubric-based reward models.<n>To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Generation (CRG), which derives both hard rules (implicit qualities) and principles (implicit qualities) by contrasting preferred and rejected responses.<n>Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling.
arXiv Detail & Related papers (2025-10-09T03:31:26Z) - ExGRPO: Learning to Reason from Experience [82.83309610498446]
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models.<n>Standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability.<n>In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value.
arXiv Detail & Related papers (2025-10-02T17:31:30Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts [25.205293698698867]
We introduce Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training.<n>Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance.<n>Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes.
arXiv Detail & Related papers (2025-08-13T18:37:46Z) - Libra: Assessing and Improving Reward Model by Learning to Think [37.22776255575947]
We present a reasoning-oriented benchmark (Libra Bench) to address the limitations of existing reward model benchmarks in reasoning scenarios.<n>We introduce a novel approach for improving the generative reward model via learning-to-think methodologies.<n>We develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks.
arXiv Detail & Related papers (2025-07-29T10:02:43Z) - VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training [23.391643634478587]
Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback.<n> bootstrapping dilemma arises as high-quality training data depends on already strong VL models.<n>We propose an iterative training framework leveraging vision experts, Chain-of-Thought rationales, and Margin-based Rejection Sampling.
arXiv Detail & Related papers (2025-06-16T18:10:51Z) - Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start [24.244577648817188]
"aha moment" patterns are often attributed to emergent properties from reinforcement learning (RL)<n>We present a comprehensive study on enhancing multimodal reasoning through a two-stage approach.<n>Our experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods.
arXiv Detail & Related papers (2025-05-28T13:21:38Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs [58.18140409409302]
Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL)<n>Applying RL in broader domains like chatbots and content generation presents unique challenges.<n>We show a case study of reproducing existing reward model ensemble research using embedding-based reward models.
arXiv Detail & Related papers (2025-02-04T19:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.