Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping
- URL: http://arxiv.org/abs/2601.12465v1
- Date: Sun, 18 Jan 2026 16:10:04 GMT
- Title: Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping
- Authors: Miao Peng, Weizhou Shen, Nuo Chen, Chenliang Li, Ming Yan, Jia Li,
- Abstract summary: Long-context reasoning requires both precise grounding and robust long-range reasoning.<n>We propose DeepReasonQA, a KG-driven framework that constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains.<n>We show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters.
- Score: 38.280470586624496
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.
Related papers
- LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards [51.45138356629732]
We introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward.<n>This auxiliary signal directly incentivizes the model for selecting the correct grounding information.<n>LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.
arXiv Detail & Related papers (2026-03-02T18:07:53Z) - LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards [57.993003392037174]
LongR is a framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism.<n>LongR consistently enhances performance across diverse RL algorithms.
arXiv Detail & Related papers (2026-02-05T15:26:47Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? [63.51955244144878]
R-HORIZON is a method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs)<n>Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons.<n>Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately.
arXiv Detail & Related papers (2025-10-09T13:16:22Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning [80.26953590563232]
We formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process.<n>We propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling.<n> Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B.
arXiv Detail & Related papers (2025-05-23T09:31:55Z) - Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance [33.16322104912836]
Large language models' (LLMs) reasoning is largely due to the chain-of-thought (CoT) approaches.<n>LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions.<n>Human beings are naturally cognitive misers and will prompt language models to give rather short responses.
arXiv Detail & Related papers (2025-04-13T14:12:14Z) - Concise Reasoning via Reinforcement Learning [13.657506042120167]
We revisit the core principles of reinforcement learning (RL)<n>We uncover a natural correlation between conciseness and accuracy that has been largely overlooked.<n>We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought.
arXiv Detail & Related papers (2025-04-07T15:35:54Z) - ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering [42.146660039671076]
We develop a retrieve-then-reason framework for large language models (LLMs)
We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts"
We introduce ALR$2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure.
arXiv Detail & Related papers (2024-10-04T08:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.