LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
- URL: http://arxiv.org/abs/2510.06915v2
- Date: Tue, 04 Nov 2025 05:29:20 GMT
- Title: LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
- Authors: Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang,
- Abstract summary: We introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation.<n>Preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios.<n>We propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs.
- Score: 45.520815757751194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.
Related papers
- LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards [57.993003392037174]
LongR is a framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism.<n>LongR consistently enhances performance across diverse RL algorithms.
arXiv Detail & Related papers (2026-02-05T15:26:47Z) - R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? [63.51955244144878]
R-HORIZON is a method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs)<n>Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons.<n>Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately.
arXiv Detail & Related papers (2025-10-09T13:16:22Z) - QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning [80.26953590563232]
We formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process.<n>We propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling.<n> Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B.
arXiv Detail & Related papers (2025-05-23T09:31:55Z) - Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision [40.63870977649693]
Chain-of-Thought prompting has shown promise for multi-step reasoning, but its effectiveness for long-context scenarios remains underexplored.<n>We propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance.<n>Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios.
arXiv Detail & Related papers (2025-02-28T07:15:12Z) - Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs)<n>We introduce a $textbfR$esponse-$textbfc$onditioned $textbfB$radley-$textbfT$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following.<n>We also propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization
arXiv Detail & Related papers (2025-02-02T14:50:25Z) - LongReward: Improving Long-context Large Language Models with AI Feedback [54.3321542678909]
LongReward is a novel method that provides rewards for long-context model responses from four human-valued dimensions.
Our experiments indicate that LongReward not only significantly improves models' long-context performance but also enhances their ability to follow short instructions.
arXiv Detail & Related papers (2024-10-28T17:50:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.