DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction
- URL: http://arxiv.org/abs/2003.07305v1
- Date: Mon, 16 Mar 2020 16:18:52 GMT
- Title: DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction
- Authors: Aviral Kumar, Abhishek Gupta, Sergey Levine
- Abstract summary: We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
- Score: 96.90215318875859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning can learn effective policies for a wide range of
tasks, but is notoriously difficult to use due to instability and sensitivity
to hyperparameters. The reasons for this remain unclear. When using standard
supervised methods (e.g., for bandits), on-policy data collection provides
"hard negatives" that correct the model in precisely those states and actions
that the policy is likely to visit. We call this phenomenon "corrective
feedback." We show that bootstrapping-based Q-learning algorithms do not
necessarily benefit from this corrective feedback, and training on the
experience collected by the algorithm is not sufficient to correct errors in
the Q-function. In fact, Q-learning and related methods can exhibit
pathological interactions between the distribution of experience collected by
the agent and the policy induced by training on that experience, leading to
potential instability, sub-optimal convergence, and poor results when learning
from noisy, sparse or delayed rewards. We demonstrate the existence of this
problem, both theoretically and empirically. We then show that a specific
correction to the data distribution can mitigate this issue. Based on these
observations, we propose a new algorithm, DisCor, which computes an
approximation to this optimal distribution and uses it to re-weight the
transitions used for training, resulting in substantial improvements in a range
of challenging RL settings, such as multi-task learning and learning from noisy
reward signals. Blog post presenting a summary of this work is available at:
https://bair.berkeley.edu/blog/2020/03/16/discor/.
Related papers
- Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering.
We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Learning Representations on the Unit Sphere: Investigating Angular
Gaussian and von Mises-Fisher Distributions for Online Continual Learning [7.145581090959242]
We propose a memory-based representation learning technique equipped with our new loss functions.
We demonstrate that the proposed method outperforms the current state-of-the-art methods on both standard evaluation scenarios and realistic scenarios with blurry task boundaries.
arXiv Detail & Related papers (2023-06-06T02:38:01Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses.
We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z) - Transfer Reinforcement Learning under Unobserved Contextual Information [16.895704973433382]
We study a transfer reinforcement learning problem where the state transitions and rewards are affected by the environmental context.
We develop a method to obtain causal bounds on the transition and reward functions using the demonstrator's data.
We propose new Q learning and UCB-Q learning algorithms that converge to the true value function without bias.
arXiv Detail & Related papers (2020-03-09T22:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.