DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning
- URL: http://arxiv.org/abs/2410.17473v1
- Date: Tue, 22 Oct 2024 23:14:09 GMT
- Title: DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning
- Authors: Taisuke Kobayashi,
- Abstract summary: This paper introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference.
Although the model showed poor learning performance, DROP showed excellent one in all tasks with high generality.
- Score: 6.20048328543366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning (RL), temporal difference (TD) error is known to be related to the firing rate of dopamine neurons. It has been observed that each dopamine neuron does not behave uniformly, but each responds to the TD error in an optimistic or pessimistic manner, interpreted as a kind of distributional RL. To explain such a biological data, a heuristic model has also been designed with learning rates asymmetric for the positive and negative TD errors. However, this heuristic model is not theoretically-grounded and unknown whether it can work as a RL algorithm. This paper therefore introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference. In combination with ensemble learning, a distributional value function as a critic is estimated from regularly introduced optimism and pessimism. Based on its central value, a policy in an actor is improved. This proposed algorithm, so-called DROP (distributional and regular optimism and pessimism), is compared on dynamic tasks. Although the heuristic model showed poor learning performance, DROP showed excellent one in all tasks with high generality. In other words, it was suggested that DROP is a new model that can elicit the potential contributions of optimism and pessimism.
Related papers
- Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases [76.9127853906115]
Bridging the gap between diffusion models and human preferences is crucial for their integration into practical generative.
We propose Temporal Diffusion Policy Optimization with critic active neuron Reset (TDPO-R), a policy gradient algorithm that exploits the temporal inductive bias of diffusion models.
Empirical results demonstrate the superior efficacy of our methods in mitigating reward overoptimization.
arXiv Detail & Related papers (2024-02-13T15:55:41Z) - Human Trajectory Forecasting with Explainable Behavioral Uncertainty [63.62824628085961]
Human trajectory forecasting helps to understand and predict human behaviors, enabling applications from social robots to self-driving cars.
Model-free methods offer superior prediction accuracy but lack explainability, while model-based methods provide explainability but cannot predict well.
We show that BNSP-SFM achieves up to a 50% improvement in prediction accuracy, compared with 11 state-of-the-art methods.
arXiv Detail & Related papers (2023-07-04T16:45:21Z) - Bandit Social Learning: Exploration under Myopic Behavior [54.767961587919075]
We study social learning dynamics motivated by reviews on online platforms.
Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration.
We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z) - Exploration in Model-based Reinforcement Learning with Randomized Reward [40.87376174638752]
We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism.
We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration.
arXiv Detail & Related papers (2023-01-09T01:50:55Z) - Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning
Approach [6.7826352751791985]
We propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting.
We integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism.
We develop the computational algorithm based on variational inference, which is highly efficient and scalable.
arXiv Detail & Related papers (2022-10-26T02:14:10Z) - The Boltzmann Policy Distribution: Accounting for Systematic
Suboptimality in Human Models [5.736353542430439]
We introduce the Boltzmann policy distribution (BPD), which serves as a prior over human policies.
BPD adapts via Bayesian inference to capture systematic deviations by observing human actions during a single episode.
We show that the BPD enables prediction of human behavior and human-AI collaboration equally as well as imitation learning-based human models.
arXiv Detail & Related papers (2022-04-22T15:26:25Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions.
We make robust and efficient counterfactual predictions for both individual and average treatment effects.
The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z) - Towards Tractable Optimism in Model-Based Reinforcement Learning [37.51073590932658]
To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error)
We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP.
We show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
arXiv Detail & Related papers (2020-06-21T20:53:19Z) - Efficient Model-Based Reinforcement Learning through Optimistic Policy
Search and Planning [93.1435980666675]
We show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms.
Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions.
arXiv Detail & Related papers (2020-06-15T18:37:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.