Bootstrapping Expectiles in Reinforcement Learning
- URL: http://arxiv.org/abs/2406.04081v1
- Date: Thu, 6 Jun 2024 13:51:39 GMT
- Title: Bootstrapping Expectiles in Reinforcement Learning
- Authors: Pierre Clavier, Emmanuel Rachelson, Erwan Le Pennec, Matthieu Geist,
- Abstract summary: Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states.
In practice, this can be very simply done by replacing the $L$ loss with a more general expectile loss for the critic.
For the overestimation problem, we show that the proposed approach, ExpectRL, provides better results than a classic twin-critic.
- Score: 25.793702194455772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states, leading to the concept of bootstrapping. To introduce a form of pessimism, we propose to replace this expectation with an expectile. In practice, this can be very simply done by replacing the $L_2$ loss with a more general expectile loss for the critic. Introducing pessimism in RL is desirable for various reasons, such as tackling the overestimation problem (for which classic solutions are double Q-learning or the twin-critic approach of TD3) or robust RL (where transitions are adversarial). We study empirically these two cases. For the overestimation problem, we show that the proposed approach, ExpectRL, provides better results than a classic twin-critic. On robust RL benchmarks, involving changes of the environment, we show that our approach is more robust than classic RL algorithms. We also introduce a variation of ExpectRL combined with domain randomization which is competitive with state-of-the-art robust RL agents. Eventually, we also extend \ExpectRL with a mechanism for choosing automatically the expectile value, that is the degree of pessimism
Related papers
- Learning from Delayed Feedback in Games via Extra Prediction [26.93300099029726]
This study raises and addresses the problem of time-delayed feedback in learning in games.<n>To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL)
arXiv Detail & Related papers (2025-09-26T14:46:21Z) - Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening [36.81125165911328]
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities.<n>We investigate whether current reinforcement learning algorithms merely sharpen the base model's distribution around problems it can already solve.<n>We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings.
arXiv Detail & Related papers (2025-06-03T01:15:15Z) - Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning [55.36978389831446]
We recast reflective exploration within the Bayes-Adaptive RL framework.<n>Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on observed outcomes.
arXiv Detail & Related papers (2025-05-26T22:51:00Z) - Improving LLM General Preference Alignment via Optimistic Online Mirror Descent [57.622821649679786]
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences.
In this paper, we drop the Bradley-Terry (BT) model assumption and study LLM alignment under general preferences, formulated as a two-player game.
We show that our approach achieves an $O(T-1)$ bound on the duality gap, improving upon the previous $O(T-1/2)$ result.
arXiv Detail & Related papers (2025-02-24T05:24:52Z) - Outlier-Robust Training of Machine Learning Models [21.352210662488112]
We propose an Adaptive Alternation Algorithm for training machine learning models with outliers.
The algorithm iteratively trains the model by using a weighted version of the non-robust loss, while updating the weights at each.
Considering arbitrary outliers (i.e., with no distributional assumption on the outliers), we show that the use of robust loss kernels sigma increases the region of convergence.
arXiv Detail & Related papers (2024-12-31T04:19:53Z) - DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning [6.20048328543366]
This paper introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference.
Although the model showed poor learning performance, DROP showed excellent one in all tasks with high generality.
arXiv Detail & Related papers (2024-10-22T23:14:09Z) - Reinforcement Learning with Quasi-Hyperbolic Discounting [2.3999111269325266]
Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias.
Our work significantly advances the practical application of reinforcement learning.
arXiv Detail & Related papers (2024-09-16T06:00:42Z) - More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - Bandit Social Learning: Exploration under Myopic Behavior [54.767961587919075]
We study social learning dynamics motivated by reviews on online platforms.
Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration.
We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z) - Learning To Dive In Branch And Bound [95.13209326119153]
We propose L2Dive to learn specific diving structurals with graph neural networks.
We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions.
arXiv Detail & Related papers (2023-01-24T12:01:45Z) - Lookback for Learning to Branch [77.32867454769936]
Bipartite Graph Neural Networks (GNNs) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers.
Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) in branch-and-bound (B&B) solvers.
arXiv Detail & Related papers (2022-06-30T02:33:32Z) - Bellman-consistent Pessimism for Offline Reinforcement Learning [46.97637726255375]
We introduce the notion of Bellman-consistent pessimism for general function approximation.
Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting.
arXiv Detail & Related papers (2021-06-13T05:50:36Z) - A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories.
We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem.
The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z) - Faster Game Solving via Predictive Blackwell Approachability: Connecting
Regret Matching and Mirror Descent [119.5481797273995]
Follow-the-regularized-leader (FTRL) and online mirror descent (OMD) are the most prevalent regret minimizers in online convex optimization.
We show that RM and RM+ are the algorithms that result from running FTRL and OMD, respectively, to select the halfspace to force at all times in the underlying Blackwell approachability game.
In experiments across 18 common zero-sum extensive-form benchmark games, we show that predictive RM+ coupled with counterfactual regret minimization converges vastly faster than the fastest prior algorithms.
arXiv Detail & Related papers (2020-07-28T16:49:55Z) - Towards Tractable Optimism in Model-Based Reinforcement Learning [37.51073590932658]
To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error)
We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP.
We show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
arXiv Detail & Related papers (2020-06-21T20:53:19Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.