Free from Bellman Completeness: Trajectory Stitching via Model-based
Return-conditioned Supervised Learning
- URL: http://arxiv.org/abs/2310.19308v2
- Date: Sat, 2 Dec 2023 11:27:53 GMT
- Title: Free from Bellman Completeness: Trajectory Stitching via Model-based
Return-conditioned Supervised Learning
- Authors: Zhaoyi Zhou, Chuning Zhu, Runlong Zhou, Qiwen Cui, Abhishek Gupta,
Simon Shaolei Du
- Abstract summary: We show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent challenges of Bellman completeness.
We propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories.
- Score: 22.287106840756483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy dynamic programming (DP) techniques such as $Q$-learning have
proven to be important in sequential decision-making problems. In the presence
of function approximation, however, these techniques often diverge due to the
absence of Bellman completeness in the function classes considered, a crucial
condition for the success of DP-based methods. In this paper, we show how
off-policy learning techniques based on return-conditioned supervised learning
(RCSL) are able to circumvent these challenges of Bellman completeness,
converging under significantly more relaxed assumptions inherited from
supervised learning. We prove there exists a natural environment in which if
one uses two-layer multilayer perceptron as the function approximator, the
layer width needs to grow linearly with the state space size to satisfy Bellman
completeness while a constant layer width is enough for RCSL. These findings
take a step towards explaining the superior empirical performance of RCSL
methods compared to DP-based methods in environments with near-optimal
datasets. Furthermore, in order to learn from sub-optimal datasets, we propose
a simple framework called MBRCSL, granting RCSL methods the ability of dynamic
programming to stitch together segments from distinct trajectories. MBRCSL
leverages learned dynamics models and forward sampling to accomplish trajectory
stitching while avoiding the need for Bellman completeness that plagues all
dynamic programming algorithms. We propose both theoretical analysis and
experimental evaluation to back these claims, outperforming state-of-the-art
model-free and model-based offline RL algorithms across several simulated
robotics problems.
Related papers
- Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning [62.984693936073974]
Value-based reinforcement learning can learn effective policies for a wide range of multi-turn problems.
Current value-based RL methods have proven particularly challenging to scale to the setting of large language models.
We propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning problem.
arXiv Detail & Related papers (2024-11-07T21:36:52Z) - Causal prompting model-based offline reinforcement learning [16.95292725275873]
Model-based offline RL allows agents to fully utilise pre-collected datasets without requiring additional or unethical explorations.
Applying model-based offline RL to online systems presents challenges due to the highly suboptimal (noise-filled) and diverse nature of datasets generated by online systems.
We introduce the Causal Prompting Reinforcement Learning framework, designed for highly suboptimal and resource-constrained online scenarios.
arXiv Detail & Related papers (2024-06-03T07:28:57Z) - Distributionally Robust Model-based Reinforcement Learning with Large
State Spaces [55.14361269378122]
Three major challenges in reinforcement learning are the complex dynamical systems with large state spaces, the costly data acquisition processes, and the deviation of real-world dynamics from the training environment deployment.
We study distributionally robust Markov decision processes with continuous state spaces under the widely used Kullback-Leibler, chi-square, and total variation uncertainty sets.
We propose a model-based approach that utilizes Gaussian Processes and the maximum variance reduction algorithm to efficiently learn multi-output nominal transition dynamics.
arXiv Detail & Related papers (2023-09-05T13:42:11Z) - A Neuromorphic Architecture for Reinforcement Learning from Real-Valued
Observations [0.34410212782758043]
Reinforcement Learning (RL) provides a powerful framework for decision-making in complex environments.
This paper presents a novel Spiking Neural Network (SNN) architecture for solving RL problems with real-valued observations.
arXiv Detail & Related papers (2023-07-06T12:33:34Z) - MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion
Control in Real Networks [63.24965775030673]
We propose a novel Reinforcement Learning (RL) approach to design generic Congestion Control (CC) algorithms.
Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return.
We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch.
arXiv Detail & Related papers (2023-02-02T18:27:20Z) - Single-Trajectory Distributionally Robust Reinforcement Learning [21.955807398493334]
We propose Distributionally Robust RL (DRRL) to enhance performance across a range of environments.
Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory.
We design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ)
arXiv Detail & Related papers (2023-01-27T14:08:09Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - When does return-conditioned supervised learning work for offline
reinforcement learning? [51.899892382786526]
We study the capabilities and limitations of return-conditioned supervised learning.
We find that RCSL returns the optimal policy under a set of assumptions stronger than those needed for the more traditional dynamic programming-based algorithms.
arXiv Detail & Related papers (2022-06-02T15:05:42Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - A Unifying Multi-sampling-ratio CS-MRI Framework With Two-grid-cycle
Correction and Geometric Prior Distillation [7.643154460109723]
We propose a unifying deep unfolding multi-sampling-ratio CS-MRI framework, by merging advantages of model-based and deep learning-based methods.
Inspired by multigrid algorithm, we first embed the CS-MRI-based optimization algorithm into correction-distillation scheme.
We employ a condition module to learn adaptively step-length and noise level from compressive sampling ratio in every stage.
arXiv Detail & Related papers (2022-05-14T13:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.