HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning
- URL: http://arxiv.org/abs/2601.22448v1
- Date: Fri, 30 Jan 2026 01:31:17 GMT
- Title: HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning
- Authors: Weiqi Wang, Xin Liu, Binxuan Huang, Hejie Cui, Rongzhi Zhang, Changlong Yu, Shuowei Jin, Jingfeng Yang, Qingyu Yin, Zhengyang Wang, Zheng Li, Yifan Gao, Priyanka Nigam, Bing Yin, Lihong Li, Yangqiu Song,
- Abstract summary: HeaPA consistently improves accuracy and reaches target performance with fewer computations.<n>Our analyses suggest these gains come from frontier-focused sampling and on-policy pool growth.
- Score: 78.12979615107564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model's learning progress, so uniform sampling can't keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool-which makes it hard to support stable on-policy pool growth-or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On-Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap-based boundary sampling, expands the pool via on-policy augmentation with lightweight asynchronous validation, and stabilizes correlated queries through topology-aware re-estimation of pool statistics and controlled reinsertion. Across two training corpora, two training recipes, and seven benchmarks, HeaPA consistently improves accuracy and reaches target performance with fewer computations while keeping wall-clock time comparable. Our analyses suggest these gains come from frontier-focused sampling and on-policy pool growth, with the benefits becoming larger as model scale increases. Our code is available at https://github.com/horizon-rl/HeaPA.
Related papers
- SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models [67.41779761651924]
SOUP is a framework that unifies off- and on-policy learning within individual samples at the token level.<n>It consistently outperforms standard on-policy training and existing off-policy extensions.
arXiv Detail & Related papers (2026-01-29T09:56:15Z) - CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling [11.252930904797]
We propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS)<n>CoPRIS mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts.<n>Experiments show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems.
arXiv Detail & Related papers (2025-11-05T11:39:32Z) - Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training [47.26632817047513]
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates.<n>We propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs.<n>Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process.
arXiv Detail & Related papers (2025-10-06T16:34:09Z) - Single-stream Policy Optimization [21.214853668053234]
We introduce Single-stream Policy Optimization (SPO)<n>SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch.<n>Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings.
arXiv Detail & Related papers (2025-09-16T16:39:11Z) - TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z) - Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z) - Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.