IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive
Control
- URL: http://arxiv.org/abs/2306.00867v1
- Date: Thu, 1 Jun 2023 16:24:40 GMT
- Title: IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive
Control
- Authors: Rohan Chitnis, Yingchen Xu, Bobak Hashemi, Lucas Lehnert, Urun Dogan,
Zheqing Zhu, Olivier Delalleau
- Abstract summary: We introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL)
More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning.
We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents
- Score: 8.374040635931298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model-based reinforcement learning (RL) has shown great promise due to its
sample efficiency, but still struggles with long-horizon sparse-reward tasks,
especially in offline settings where the agent learns from a fixed dataset. We
hypothesize that model-based RL agents struggle in these environments due to a
lack of long-term planning capabilities, and that planning in a temporally
abstract model of the environment can alleviate this issue. In this paper, we
make two key contributions: 1) we introduce an offline model-based RL
algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference
Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL);
2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any
off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train
a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which
roughly correspond to subgoals, via planning. We empirically show that
augmenting state representations with intent embeddings generated by an
IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents'
performance on some of the most challenging D4RL benchmark tasks. For instance,
the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero
normalized evaluation scores on the medium and large antmaze tasks, while our
modification gives an average score over 40.
Related papers
- PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer [47.924941959320996]
We propose a hierarchical planner designed for offline RL called PlanDQ.
PlanDQ incorporates a diffusion-based planner at the high level, named D-Conductor, which guides the low-level policy through sub-goals.
At the low level, we used a Q-learning based approach called the Q-Performer to accomplish these sub-goals.
arXiv Detail & Related papers (2024-06-10T20:59:53Z) - ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models.
Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel.
Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - When should we prefer Decision Transformers for Offline Reinforcement
Learning? [29.107029606830015]
Three popular algorithms for offline RL are Conservative Q-Learning (CQL), Behavior Cloning (BC), and Decision Transformer (DT)
We study this question empirically by exploring the performance of these algorithms across the commonly used D4RL and Robomimicity benchmarks.
We find that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari.
arXiv Detail & Related papers (2023-05-23T22:19:14Z) - Extreme Q-Learning: MaxEnt RL without Entropy [88.97516083146371]
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains.
We introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT)
Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms.
arXiv Detail & Related papers (2023-01-05T23:14:38Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.