Related papers: Value-Based Deep RL Scales Predictably

Value-Based Deep RL Scales Predictably

URL: http://arxiv.org/abs/2502.04327v1
Date: Thu, 06 Feb 2025 18:59:47 GMT
Title: Value-Based Deep RL Scales Predictably
Authors: Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar,
Abstract summary: We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
Score: 100.21834069400023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

Related papers

Data-Efficient RLVR via Off-Policy Influence Guidance [84.60336960383867]
This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective.<n>We develop textbfCurriculum textbfRL with textbfOff-textbfPolicy textInfluence guidance (textbfCROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy.
arXiv Detail & Related papers (2025-10-30T13:40:52Z)
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models [53.339700196282905]
A key challenge in applying reinforcement learning to large language models (dLLMs) is the intractability of their likelihood functions.<n>We propose a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective.<n> Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
arXiv Detail & Related papers (2025-10-13T17:47:50Z)
Compute-Optimal Scaling for Value-Based Deep RL [99.680827753493]
We investigate compute scaling for online, value-based deep RL.<n>Our analysis reveals a nuanced interplay between model size, batch size, and UTD.<n>We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD.
arXiv Detail & Related papers (2025-08-20T17:54:21Z)
Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training [4.90288999217624]
We introduce a framework for optimizing domain-specific dataset construction in foundation model training.<n>Our approach extends the usual point estimate approaches, aka micro-annealing, to estimating scaling laws.<n>We validate our approach through experiments on a pre-trained model with 7 billion parameters.
arXiv Detail & Related papers (2025-07-29T21:56:45Z)
Provably Efficient Online RLHF with One-Pass Reward Modeling [59.30310692855397]
We propose a one-pass reward modeling method that does not require storing the historical data and can be computed in constant time.<n>We provide theoretical guarantees showing that our method improves both statistical and computational efficiency.<n>We conduct experiments using Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on the Ultrafeedback-binarized and Mixture2 datasets.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z)
HyperINF: Unleashing the HyperPower of the Schulz's Method for Data Influence Estimation [37.62285675595782]
We propose HyperINF, an efficient and accurate influence function approximation method. We incorporate the generalized fisher information (GFIM) as a low-rank approximation of the Hessian matrix. On LoRA-tuned models, HyperINF achieves superior downstream performance with minimal memory and computational overhead.
arXiv Detail & Related papers (2024-10-07T14:42:45Z)
A Distribution-Aware Flow-Matching for Generating Unstructured Data for Few-Shot Reinforcement Learning [1.0709300917082865]
We introduce a distribution-aware flow matching approach to generate synthetic unstructured data for few-shot reinforcement learning. Our approach addresses key challenges in traditional model-based RL, such as overfitting and data correlation. Results demonstrate that our method achieves stable convergence in terms of maximum Q-value while enhancing frame rates by 30% in the initial timestamps.
arXiv Detail & Related papers (2024-09-21T15:50:59Z)
Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent [23.669599662214686]
HyperAgent is a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks. It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite.
arXiv Detail & Related papers (2024-02-05T07:07:30Z)
Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels [57.46832672991433]
We propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS) We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We develop an expectation-propagation expectation-maximization algorithm for efficient posterior inference and function estimation.
arXiv Detail & Related papers (2023-10-09T03:55:09Z)
Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets. We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z)
A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance. We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset. We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.