Related papers: Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

URL: http://arxiv.org/abs/2602.08503v1
Date: Mon, 09 Feb 2026 10:55:13 GMT
Title: Learning Self-Correction in Vision-Language Models via Rollout Augmentation
Authors: Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang,
Abstract summary: Self-correction is essential for solving reasoning problems in vision-language models (VLMs)<n>Existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely.<n>We propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples.<n>We introduce Octopus-8B, a reasoning VLM with controllable self-correction capability.
Score: 25.49118301476432
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards [69.74686029941881]
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
arXiv Detail & Related papers (2026-02-09T10:51:58Z)
Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE [16.58714489761542]
We present CompassMax-V3-Thinking, a hundred-billion-scale MoE reasoning model trained with a new RL framework built on one principle: each prompt must matter.<n>To overcome these challenges, we introduce several unified innovations.<n>The resulting model delivers strong performance across both internal and public evaluations.
arXiv Detail & Related papers (2025-12-08T16:57:43Z)
Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling [90.87033586963828]
Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs)<n>We propose Self-Consistency Sampling (SCS) to correct this issue.<n>Based on Qwen2.5-VL-7B-Instruct, SCS improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation.
arXiv Detail & Related papers (2025-11-13T18:59:57Z)
Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z)
Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models [29.090093552573766]
We propose an offline RL post-training objective for Vision-Language-Action (VLA) flow models.<n>We then induce an efficient and feasible offline RL fine-tuning algorithm -- Adaptive Reinforced Flow Matching (ARFM)<n>ARFM exhibits excellent generalization, robustness, few-shot learning, and continuous learning performance.
arXiv Detail & Related papers (2025-09-04T09:48:43Z)
Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z)
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.