Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
- URL: http://arxiv.org/abs/2509.25849v1
- Date: Tue, 30 Sep 2025 06:41:57 GMT
- Title: Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
- Authors: Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, Zhi-Quan Luo,
- Abstract summary: Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions.<n>This exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task.<n>We propose an optimal assignment rule that adaptively distributes resources based on the model's current learning status.
- Score: 40.30643197563909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.
Related papers
- CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs [31.371566320424552]
CoBA-RL is a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability.<n>Our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements.
arXiv Detail & Related papers (2026-02-03T03:14:36Z) - Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs [126.45104018441698]
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs)<n>We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions.<n>We propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies.
arXiv Detail & Related papers (2026-01-13T17:48:43Z) - Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation [8.50468505606714]
We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL)<n>This problem arises in applications such as robotics, control, and preference optimization in language models.<n>We introduce a two-stage procedure -- meta-training followed by fine-tuning -- to address this problem.
arXiv Detail & Related papers (2025-11-16T21:05:21Z) - Train Long, Think Short: Curriculum Learning for Efficient Reasoning [51.506559652495476]
We propose a curriculum learning strategy for length-controlled reasoning.<n>Our method starts with generous token budgets and gradually tightens them over training.<n>Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines.
arXiv Detail & Related papers (2025-08-12T13:48:03Z) - ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning [54.08906841213777]
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning.<n>We propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of computation times.<n>We show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times.
arXiv Detail & Related papers (2025-02-02T12:22:26Z) - Sample Efficient Reinforcement Learning by Automatically Learning to
Compose Subtasks [3.1594865504808944]
We propose an RL algorithm that automatically structure the reward function for sample efficiency, given a set of labels that signify subtasks.
We evaluate our algorithm in a variety of sparse-reward environments.
arXiv Detail & Related papers (2024-01-25T15:06:40Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [67.66904892192794]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.<n>MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.<n>Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Explore to Generalize in Zero-Shot RL [38.43215023828472]
We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task.
We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far effective generalization, yielding a success rate of $83%$ on the Maze task and $74%$ on Heist with $200$ training levels.
arXiv Detail & Related papers (2023-06-05T17:49:43Z) - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven
Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems.
We propose a novel method for computing the normalized maximum likelihood (NML) distribution.
We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.