Learning One Representation to Optimize All Rewards
- URL: http://arxiv.org/abs/2103.07945v1
- Date: Sun, 14 Mar 2021 15:00:08 GMT
- Title: Learning One Representation to Optimize All Rewards
- Authors: Ahmed Touati and Yann Ollivier
- Abstract summary: We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process.
It provides explicit near-optimal policies for any reward specified a posteriori.
This is a step towards learning controllable agents in arbitrary black-box environments.
- Score: 19.636676744015197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the forward-backward (FB) representation of the dynamics of a
reward-free Markov decision process. It provides explicit near-optimal policies
for any reward specified a posteriori. During an unsupervised phase, we use
reward-free interactions with the environment to learn two representations via
off-the-shelf deep learning methods and temporal difference (TD) learning. In
the test phase, a reward representation is estimated either from observations
or an explicit reward description (e.g., a target state). The optimal policy
for that reward is directly obtained from these representations, with no
planning.
The unsupervised FB loss is well-principled: if training is perfect, the
policies obtained are provably optimal for any reward function. With imperfect
training, the sub-optimality is proportional to the unsupervised approximation
error. The FB representation learns long-range relationships between states and
actions, via a predictive occupancy map, without having to synthesize states as
in model-based approaches.
This is a step towards learning controllable agents in arbitrary black-box
stochastic environments. This approach compares well to goal-oriented RL
algorithms on discrete and continuous mazes, pixel-based MsPacman, and the
FetchReach virtual robot arm. We also illustrate how the agent can immediately
adapt to new tasks beyond goal-oriented RL.
Related papers
- A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback.
Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training.
We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward.
This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z) - CostNet: An End-to-End Framework for Goal-Directed Reinforcement
Learning [9.432068833600884]
Reinforcement Learning (RL) is a general framework concerned with an agent that seeks to maximize rewards in an environment.
There are two approaches, model-based and model-free reinforcement learning, that show concrete results in several disciplines.
This paper introduces a novel reinforcement learning algorithm for predicting the distance between two states in a Markov Decision Process.
arXiv Detail & Related papers (2022-10-03T21:16:14Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Reward prediction for representation learning and reward shaping [0.8883733362171032]
We propose learning a state representation in a self-supervised manner for reward prediction.
We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning.
arXiv Detail & Related papers (2021-05-07T11:29:32Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z) - Inverse Reinforcement Learning via Matching of Optimality Profiles [2.561053769852449]
We propose an algorithm that learns a reward function from demonstrations of suboptimal or heterogeneous performance.
We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.
arXiv Detail & Related papers (2020-11-18T13:23:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.