Survival Instinct in Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2306.03286v2
- Date: Wed, 8 Nov 2023 18:46:06 GMT
- Title: Survival Instinct in Offline Reinforcement Learning
- Authors: Anqi Li, Dipendra Misra, Andrey Kolobov, Ching-An Cheng
- Abstract summary: offline RL can produce well-optimal and safe policies even when trained with "wrong" reward labels.
We demonstrate that this surprising property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices.
Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior with imperfect reward but purposely biased data coverage.
- Score: 28.319886852612672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel observation about the behavior of offline reinforcement
learning (RL) algorithms: on many benchmark datasets, offline RL can produce
well-performing and safe policies even when trained with "wrong" reward labels,
such as those that are zero everywhere or are negatives of the true rewards.
This phenomenon cannot be easily explained by offline RL's return maximization
objective. Moreover, it gives offline RL a degree of robustness that is
uncharacteristic of its online RL counterparts, which are known to be sensitive
to reward design. We demonstrate that this surprising robustness property is
attributable to an interplay between the notion of pessimism in offline RL
algorithms and certain implicit biases in common data collection practices. As
we prove in this work, pessimism endows the agent with a "survival instinct",
i.e., an incentive to stay within the data support in the long term, while the
limited and biased data coverage further constrains the set of survival
policies. Formally, given a reward class -- which may not even contain the true
reward -- we identify conditions on the training data distribution that enable
offline RL to learn a near-optimal and safe policy from any reward within the
class. We argue that the survival instinct should be taken into account when
interpreting results from existing offline RL benchmarks and when creating
future ones. Our empirical and theoretical results suggest a new paradigm for
RL, whereby an agent is nudged to learn a desirable behavior with imperfect
reward but purposely biased data coverage.
Related papers
- Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - Align Your Intents: Offline Imitation Learning via Optimal Transport [3.1728695158666396]
We show that an imitating agent can still learn the desired behavior merely from observing the expert.
In our method, AILOT, we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data.
We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks.
arXiv Detail & Related papers (2024-02-20T14:24:00Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Leveraging Reward Consistency for Interpretable Feature Discovery in
Reinforcement Learning [69.19840497497503]
It is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents.
We propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents.
We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment.
arXiv Detail & Related papers (2023-09-04T09:09:54Z) - CLUE: Calibrated Latent Guidance for Offline Reinforcement Learning [31.49713012907863]
We introduce textbfCalibrated textbfLatent gtextbfUidanctextbfE (CLUE), which utilizes a conditional variational auto-encoder to learn a latent space.
We instantiate the expert-driven intrinsic rewards in sparse-reward offline RL tasks, offline imitation learning (IL) tasks, and unsupervised offline RL tasks.
arXiv Detail & Related papers (2023-06-23T09:57:50Z) - Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning [93.99377042564919]
This paper tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages.
The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the "test bed" for offline policies.
We introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces.
arXiv Detail & Related papers (2023-05-24T15:45:35Z) - Benchmarks and Algorithms for Offline Preference-Based Reward Learning [41.676208473752425]
We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning.
Our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps.
arXiv Detail & Related papers (2023-01-03T23:52:16Z) - Offline Meta-Reinforcement Learning with Online Self-Supervision [66.42016534065276]
We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy.
Our method uses the offline data to learn the distribution of reward functions, which is then sampled to self-supervise reward labels for the additional online data.
We find that using additional data and self-generated rewards significantly improves an agent's ability to generalize.
arXiv Detail & Related papers (2021-07-08T17:01:32Z) - Instabilities of Offline RL with Pre-Trained Neural Representation [127.89397629569808]
In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated.
Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold.
This work studies these issues from an empirical perspective to gauge how stable offline RL methods are.
arXiv Detail & Related papers (2021-03-08T18:06:44Z) - Near Real-World Benchmarks for Offline Reinforcement Learning [26.642722521820467]
We present a suite of near real-world benchmarks, NewRL.
NewRL contains datasets from various domains with controlled sizes and extra test datasets for the purpose of policy validation.
We argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward.
arXiv Detail & Related papers (2021-02-01T09:19:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.