Related papers: Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

URL: http://arxiv.org/abs/2508.03018v1
Date: Tue, 05 Aug 2025 02:56:58 GMT
Title: Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning
Authors: Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, Guillaume Sartoretti,
Abstract summary: We propose a three-stage framework to develop robust reasoning models for sparse environments.<n>Our framework bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion.<n>Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency.
Score: 15.103861901247125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.

Related papers

Bidirectional Curriculum Generation: A Multi-Agent Framework for Data-Efficient Mathematical Reasoning [16.95900718416944]
We introduce a novel Bidirectional Curriculum Generation framework to maximize the instructional value of every training sample.<n>Unlike rigid trajectories, our multi-agent ecosystem mimics adaptive pedagogy to establish a closed feedback loop.<n>This mechanism ensures that the model consumes only the most effective data at any given stage.
arXiv Detail & Related papers (2026-03-05T12:49:21Z)
DiffuReason: Bridging Latent Reasoning and Generative Refinement for Sequential Recommendation [20.756497463882763]
We propose DiffuReason, a unified "Think-then-Diffuse" framework for sequential recommendation.<n>It integrates multi-step Thinking Tokens for latent reasoning, diffusion-based refinement for denoising intermediate representations, and end-to-end Group Relative Policy Optimization.<n>Experiments on four benchmarks demonstrate that DiffuReason consistently improves diverse backbone architectures.
arXiv Detail & Related papers (2026-02-10T12:55:30Z)
Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search [62.1546099504045]
We propose a dual-phase test-time scaling framework that separates reasoning into planning and execution.<n>Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately.<n> Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing computation redundant.
arXiv Detail & Related papers (2025-09-29T19:27:23Z)
Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following [10.119219532863767]
lazy reasoning during the thinking stage is the primary factor contributing to poor instruction adherence.<n>We propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking.<n>Our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
arXiv Detail & Related papers (2025-08-05T07:42:00Z)
Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z)
LARES: Latent Reasoning for Sequential Recommendation [96.26996622771593]
We present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation.<n>Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity.<n>Our framework exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.
arXiv Detail & Related papers (2025-05-22T16:22:54Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Guided Exploration for Efficient Relational Model Learning [20.853275384388805]
Efficient exploration is critical for learning relational models in large-scale environments with complex, long-horizon tasks.<n>Goal-literal babbling (GLIB) improves upon random exploration by setting and planning to novel goals.<n>We introduce Baking-Large, a challenging domain with extensive state-action spaces and long-horizon tasks.
arXiv Detail & Related papers (2025-02-10T04:23:01Z)
Coarse-to-Fine Process Reward Modeling for Mathematical Reasoning [11.15613673478208]
The Process Reward Model (PRM) plays a crucial role in mathematical reasoning tasks, requiring high-quality supervised process data.<n>We observe that reasoning steps generated by Large Language Models (LLMs) often fail to exhibit strictly incremental information, leading to redundancy.<n>We propose CFPRM, a simple yet effective coarse-to-fine strategy for detecting redundant steps.
arXiv Detail & Related papers (2025-01-23T12:44:45Z)
Spatial Reasoning and Planning for Deep Embodied Agents [2.7195102129095003]
This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks. It focuses on enhancing learning efficiency, interpretability, and transferability across novel scenarios.
arXiv Detail & Related papers (2024-09-28T23:05:56Z)
Adaptive Planning with Generative Models under Uncertainty [20.922248169620783]
Planning with generative models has emerged as an effective decision-making paradigm across a wide range of domains. While continuous replanning at each timestep might seem intuitive because it allows decisions to be made based on the most recent environmental observations, it results in substantial computational challenges. Our work addresses this challenge by introducing a simple adaptive planning policy that leverages the generative model's ability to predict long-horizon state trajectories.
arXiv Detail & Related papers (2024-08-02T18:07:53Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z)
Guiding Language Model Reasoning with Planning Tokens [122.43639723387516]
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks. We propose a hierarchical generation scheme to encourage a more structural generation of chain-of-thought steps. Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.