Distributional Successor Features Enable Zero-Shot Policy Optimization
- URL: http://arxiv.org/abs/2403.06328v3
- Date: Thu, 31 Oct 2024 07:46:00 GMT
- Title: Distributional Successor Features Enable Zero-Shot Policy Optimization
- Authors: Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta,
- Abstract summary: This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs)
DiSPOs learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset.
By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions.
- Score: 36.53356539916603
- License:
- Abstract: Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.
Related papers
- Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Self-Supervised Reinforcement Learning that Transfers using Random
Features [41.00256493388967]
We propose a self-supervised reinforcement learning method that enables the transfer of behaviors across tasks with different rewards.
Our method is self-supervised in that it can be trained on offline datasets without reward labels, but can then be quickly deployed on new tasks.
arXiv Detail & Related papers (2023-05-26T20:37:06Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Fully Decentralized Model-based Policy Optimization for Networked
Systems [23.46407780093797]
This work aims to improve data efficiency of multi-agent control by model-based learning.
We consider networked systems where agents are cooperative and communicate only locally with their neighbors.
In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts.
arXiv Detail & Related papers (2022-07-13T23:52:14Z) - Optimistic Linear Support and Successor Features as a Basis for Optimal
Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set.
We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z) - Evaluating model-based planning and planner amortization for continuous
control [79.49319308600228]
We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning.
We find that well-tuned model-free agents are strong baselines even for high DoF control problems.
We show that it is possible to distil a model-based planner into a policy that amortizes the planning without any loss of performance.
arXiv Detail & Related papers (2021-10-07T12:00:40Z) - Learning Off-Policy with Online Planning [18.63424441772675]
We investigate a novel instantiation of H-step lookahead with a learned model and a terminal value function.
We show the flexibility of LOOP to incorporate safety constraints during deployment with a set of navigation environments.
arXiv Detail & Related papers (2020-08-23T16:18:44Z) - Variational Model-based Policy Optimization [34.80171122943031]
Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL.
We propose an objective function as a variational lower-bound of a log-likelihood of a log-likelihood to jointly learn and improve model and policy.
Our experiments on a number of continuous control tasks show that despite being more complex, our model-based (E-step) algorithm, called emactoral model-based policy optimization (VMBPO), is more sample-efficient and
arXiv Detail & Related papers (2020-06-09T18:30:15Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.