Switching the Loss Reduces the Cost in Batch Reinforcement Learning
- URL: http://arxiv.org/abs/2403.05385v3
- Date: Tue, 12 Mar 2024 16:01:02 GMT
- Title: Switching the Loss Reduces the Cost in Batch Reinforcement Learning
- Authors: Alex Ayoub, Kaiwen Wang, Vincent Liu, Samuel Robertson, James
McInerney, Dawen Liang, Nathan Kallus, and Csaba Szepesv\'ari
- Abstract summary: We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy.
We empirically verify that FQI-LOG uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
- Score: 34.271542267787716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch
reinforcement learning (RL). We show that the number of samples needed to learn
a near-optimal policy with FQI-LOG scales with the accumulated cost of the
optimal policy, which is zero in problems where acting optimally achieves the
goal and incurs no cost. In doing so, we provide a general framework for
proving $\textit{small-cost}$ bounds, i.e. bounds that scale with the optimal
achievable cost, in batch RL. Moreover, we empirically verify that FQI-LOG uses
fewer samples than FQI trained with squared loss on problems where the optimal
policy reliably achieves the goal.
Related papers
- Adaptive $Q$-Network: On-the-fly Target Selection for Deep Reinforcement Learning [18.579378919155864]
We propose Adaptive $Q$-Network (AdaQN) as a new approach for automated Reinforcement Learning (AutoRL)
AdaQN takes into account the non-stationarity of the optimization procedure without requiring additional samples.
We demonstrate that AdaQN is theoretically sound and empirically validate it in MuJoCo control problems.
arXiv Detail & Related papers (2024-05-25T11:57:43Z) - Value Augmented Sampling for Language Model Alignment and Personalization [39.070662999014836]
We present a new framework for reward optimization, Value Augmented Sampling (VAS)
VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function.
Our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time.
arXiv Detail & Related papers (2024-05-10T17:59:04Z) - Imitate the Good and Avoid the Bad: An Incremental Approach to Safe
Reinforcement Learning [13.112202426665466]
Constrained RL is a framework for enforcing safe actions in Reinforcement Learning.
Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem.
We present an approach that does not modify the trajectory based cost constraint and instead imitates good'' trajectories.
arXiv Detail & Related papers (2023-12-16T08:48:46Z) - Towards Understanding and Improving GFlowNet Training [71.85707593318297]
We introduce an efficient evaluation strategy to compare the learned sampling distribution to the target reward distribution.
We propose prioritized replay training of high-reward $x$, relative edge flow policy parametrization, and a novel guided trajectory balance objective.
arXiv Detail & Related papers (2023-05-11T22:50:41Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - BCRLSP: An Offline Reinforcement Learning Framework for Sequential
Targeted Promotion [8.499811428928071]
We propose the Budget Constrained Reinforcement Learning for Sequential Promotion framework to determine the value of cash bonuses to be sent to users.
We show that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines.
arXiv Detail & Related papers (2022-07-16T00:10:12Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.