Regret Minimization Experience Replay
- URL: http://arxiv.org/abs/2105.07253v1
- Date: Sat, 15 May 2021 16:08:45 GMT
- Title: Regret Minimization Experience Replay
- Authors: Zhenghai Xue, Xu-Hui Liu, Jing-Cheng Pang, Shengyi Jiang, Feng Xu,
Yang Yu
- Abstract summary: prioritized sampling is a promising technique to improve the performance of RL agents.
In this work, we analyze the optimal prioritization strategy that can minimize the regret of RL policy theoretically.
We propose two practical algorithms, RM-DisCor and RM-TCE.
- Score: 14.233842517210437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Experience replay is widely used in various deep off-policy reinforcement
learning (RL) algorithms. It stores previously collected samples for further
reuse. To better utilize these samples, prioritized sampling is a promising
technique to improve the performance of RL agents. Previous prioritization
methods based on temporal-difference (TD) error are highly heuristic and
divergent from the objective of RL. In this work, we analyze the optimal
prioritization strategy that can minimize the regret of RL policy
theoretically. Our theory suggests that the data with higher TD error, better
on-policiness and more corrective feedback should be assigned with higher
weights during sampling. Based on this theory, we propose two practical
algorithms, RM-DisCor and RM-TCE. RM-DisCor is a general algorithm and RM-TCE
is a more efficient variant relying on the temporal ordering of states. Both
algorithms improve the performance of off-policy RL algorithms in challenging
RL benchmarks, including MuJoCo, Atari and Meta-World.
Related papers
- RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$ [12.111848705677142]
We propose RL$3$, a hybrid approach that incorporates action-values, learned per task through traditional RL, in the inputs to meta-RL.
We show that RL$3$ earns greater cumulative reward in the long term, compared to RL$2$, while maintaining data-efficiency in the short term, and generalizes better to out-of-distribution tasks.
arXiv Detail & Related papers (2023-06-28T04:16:16Z) - Decoupled Prioritized Resampling for Offline RL [120.49021589395005]
We propose Offline Prioritized Experience Replay (OPER) for offline reinforcement learning.
OPER features a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training.
We show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - MAC-PO: Multi-Agent Experience Replay via Collective Priority
Optimization [12.473095790918347]
We propose name, which formulates optimal prioritized experience replay for multi-agent problems.
By minimizing the resulting policy regret, we can narrow the gap between the current policy and a nominal optimal policy.
arXiv Detail & Related papers (2023-02-21T03:11:21Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Deep Black-Box Reinforcement Learning with Movement Primitives [15.184283143878488]
We present a new algorithm for deep reinforcement learning (RL)
It is based on differentiable trust region layers, a successful on-policy deep RL algorithm.
We compare our ERL algorithm to state-of-the-art step-based algorithms in many complex simulated robotic control tasks.
arXiv Detail & Related papers (2022-10-18T06:34:52Z) - Beyond Tabula Rasa: Reincarnating Reinforcement Learning [37.201451908129386]
Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research.
We present reincarnating RL as an alternative workflow, where prior computational work is reused or transferred between design iterations of an RL agent.
We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations.
arXiv Detail & Related papers (2022-06-03T15:11:10Z) - Robust Predictable Control [149.71263296079388]
We show that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck.
We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
arXiv Detail & Related papers (2021-09-07T17:29:34Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - RL-DARTS: Differentiable Architecture Search for Reinforcement Learning [62.95469460505922]
We introduce RL-DARTS, one of the first applications of Differentiable Architecture Search (DARTS) in reinforcement learning (RL)
By replacing the image encoder with a DARTS supernet, our search method is sample-efficient, requires minimal extra compute resources, and is also compatible with off-policy and on-policy RL algorithms, needing only minor changes in preexisting code.
We show that the supernet gradually learns better cells, leading to alternative architectures which can be highly competitive against manually designed policies, but also verify previous design choices for RL policies.
arXiv Detail & Related papers (2021-06-04T03:08:43Z) - Rewriting History with Inverse RL: Hindsight Inference for Policy
Improvement [137.29281352505245]
We show that hindsight relabeling is inverse RL, an observation that suggests that we can use inverse RL in tandem for RL algorithms to efficiently solve many tasks.
Our experiments confirm that relabeling data using inverse RL accelerates learning in general multi-task settings.
arXiv Detail & Related papers (2020-02-25T18:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.