Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning
- URL: http://arxiv.org/abs/2403.05385v5
- Date: Thu, 1 Aug 2024 16:02:06 GMT
- Title: Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning
- Authors: Alex Ayoub, Kaiwen Wang, Vincent Liu, Samuel Robertson, James McInerney, Dawen Liang, Nathan Kallus, Csaba Szepesvári,
- Abstract summary: We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy.
We empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
- Score: 57.154674117714265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
Related papers
- Simplify RLHF as Reward-Weighted SFT: A Variational Method [34.222095430239555]
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values.
We propose a novel simplification of RLHF from the perspective of variational inference.
We transform the alignment objective into a reward-driven supervised fine-tuning form to obtain noticeable improvement on training stability and effectiveness.
arXiv Detail & Related papers (2025-02-16T07:22:00Z) - CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
CITER enables efficient collaboration between small and large language models (SLMs & LLMs) through a token-level routing strategy.
We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation.
Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z) - Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning [11.666700714916065]
Constrained RL is a framework for enforcing safe actions in Reinforcement Learning.
Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem.
We present an approach that does not modify the trajectory based cost constraint and instead imitates good'' trajectories.
arXiv Detail & Related papers (2023-12-16T08:48:46Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - BCRLSP: An Offline Reinforcement Learning Framework for Sequential
Targeted Promotion [8.499811428928071]
We propose the Budget Constrained Reinforcement Learning for Sequential Promotion framework to determine the value of cash bonuses to be sent to users.
We show that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines.
arXiv Detail & Related papers (2022-07-16T00:10:12Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.