Related papers: ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

URL: http://arxiv.org/abs/2602.22623v1
Date: Thu, 26 Feb 2026 04:55:57 GMT
Title: ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Authors: Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan,
Abstract summary: We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks.<n>We provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives.<n>We also introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups.
Score: 64.77036363086519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.

Related papers

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards [51.45138356629732]
We introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward.<n>This auxiliary signal directly incentivizes the model for selecting the correct grounding information.<n>LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.
arXiv Detail & Related papers (2026-03-02T18:07:53Z)
Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z)
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z)
Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning [52.144281362465996]
We propose EAPO (Evidence-Augmented Policy Optimization) to apply Reinforcement Learning to long-context scenarios.<n>We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling.<n>We then introduce a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward.<n>To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism.
arXiv Detail & Related papers (2026-01-15T11:40:57Z)
Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z)
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment [29.409251059248643]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs)<n>We propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal.<n>Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards.
arXiv Detail & Related papers (2025-10-11T13:00:25Z)
CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z)
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning [11.573904453859098]
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs)<n>Yet, RL's success relies on the reliability of rewards, which are provided by verifiers.<n>In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs.<n>We propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods.
arXiv Detail & Related papers (2025-05-20T17:16:44Z)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.