Related papers: Simplifying Deep Reinforcement Learning via Self-Supervision

Simplifying Deep Reinforcement Learning via Self-Supervision

URL: http://arxiv.org/abs/2106.05526v1
Date: Thu, 10 Jun 2021 06:29:59 GMT
Title: Simplifying Deep Reinforcement Learning via Self-Supervision
Authors: Daochen Zha, Kwei-Herng Lai, Kaixiong Zhou, Xia Hu
Abstract summary: Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses. We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
Score: 51.2400839966489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Supervised regression to demonstrations has been demonstrated to be a stable way to train deep policy networks. We are motivated to study how we can take full advantage of supervised loss functions for stably training deep reinforcement learning agents. This is a challenging task because it is unclear how the training data could be collected to enable policy improvement. In this work, we propose Self-Supervised Reinforcement Learning (SSRL), a simple algorithm that optimizes policies with purely supervised losses. We demonstrate that, without policy gradient or value estimation, an iterative procedure of ``labeling" data and supervised regression is sufficient to drive stable policy improvement. By selecting and imitating trajectories with high episodic rewards, SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time, showing the potential of solving reinforcement learning with supervised learning techniques. The code is available at https://github.com/daochenzha/SSRL

Related papers

SR-Reward: Taking The Path More Traveled [8.818066308133108]
We propose a novel method for learning reward functions directly from offline demonstrations. Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy. Our reward function, called textitSR-Reward, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics.
arXiv Detail & Related papers (2025-01-04T16:21:10Z)
Efficient Offline Reinforcement Learning: The Critic is Critical [5.916429671763282]
Off-policy reinforcement learning provides a promising approach for improving performance beyond supervised approaches. We propose a best-of-both approach by first learning the behavior policy and critic with supervised learning, before improving with off-policy reinforcement learning.
arXiv Detail & Related papers (2024-06-19T09:16:38Z)
Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z)
Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback. We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.