Simplifying Deep Reinforcement Learning via Self-Supervision
- URL: http://arxiv.org/abs/2106.05526v1
- Date: Thu, 10 Jun 2021 06:29:59 GMT
- Title: Simplifying Deep Reinforcement Learning via Self-Supervision
- Authors: Daochen Zha, Kwei-Herng Lai, Kaixiong Zhou, Xia Hu
- Abstract summary: Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses.
We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
- Score: 51.2400839966489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised regression to demonstrations has been demonstrated to be a stable
way to train deep policy networks. We are motivated to study how we can take
full advantage of supervised loss functions for stably training deep
reinforcement learning agents. This is a challenging task because it is unclear
how the training data could be collected to enable policy improvement. In this
work, we propose Self-Supervised Reinforcement Learning (SSRL), a simple
algorithm that optimizes policies with purely supervised losses. We demonstrate
that, without policy gradient or value estimation, an iterative procedure of
``labeling" data and supervised regression is sufficient to drive stable policy
improvement. By selecting and imitating trajectories with high episodic
rewards, SSRL is surprisingly competitive to contemporary algorithms with more
stable performance and less running time, showing the potential of solving
reinforcement learning with supervised learning techniques. The code is
available at https://github.com/daochenzha/SSRL
Related papers
- Efficient Offline Reinforcement Learning: The Critic is Critical [5.916429671763282]
Off-policy reinforcement learning provides a promising approach for improving performance beyond supervised approaches.
We propose a best-of-both approach by first learning the behavior policy and critic with supervised learning, before improving with off-policy reinforcement learning.
arXiv Detail & Related papers (2024-06-19T09:16:38Z) - Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering.
We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z) - Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy.
We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode.
We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.