Environment Transformer and Policy Optimization for Model-Based Offline
Reinforcement Learning
- URL: http://arxiv.org/abs/2303.03811v2
- Date: Mon, 16 Oct 2023 12:32:46 GMT
- Title: Environment Transformer and Policy Optimization for Model-Based Offline
Reinforcement Learning
- Authors: Pengqin Wang, Meixin Zhu, Shaojie Shen
- Abstract summary: We propose an uncertainty-aware sequence modeling architecture called Environment Transformer.
Benefiting from the accurate modeling of the transition dynamics and reward function, Environment Transformer can be combined with arbitrary planning, dynamics programming, or policy optimization algorithms for offline RL.
- Score: 25.684201757101267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interacting with the actual environment to acquire data is often costly and
time-consuming in robotic tasks. Model-based offline reinforcement learning
(RL) provides a feasible solution. On the one hand, it eliminates the
requirements of interaction with the actual environment. On the other hand, it
learns the transition dynamics and reward function from the offline datasets
and generates simulated rollouts to accelerate training. Previous model-based
offline RL methods adopt probabilistic ensemble neural networks (NN) to model
aleatoric uncertainty and epistemic uncertainty. However, this results in an
exponential increase in training time and computing resource requirements.
Furthermore, these methods are easily disturbed by the accumulative errors of
the environment dynamics models when simulating long-term rollouts. To solve
the above problems, we propose an uncertainty-aware sequence modeling
architecture called Environment Transformer. It models the probability
distribution of the environment dynamics and reward function to capture
aleatoric uncertainty and treats epistemic uncertainty as a learnable noise
parameter. Benefiting from the accurate modeling of the transition dynamics and
reward function, Environment Transformer can be combined with arbitrary
planning, dynamics programming, or policy optimization algorithms for offline
RL. In this case, we perform Conservative Q-Learning (CQL) to learn a
conservative Q-function. Through simulation experiments, we demonstrate that
our method achieves or exceeds state-of-the-art performance in widely studied
offline RL benchmarks. Moreover, we show that Environment Transformer's
simulated rollout quality, sample efficiency, and long-term rollout simulation
capability are superior to those of previous model-based offline RL methods.
Related papers
- Traffic expertise meets residual RL: Knowledge-informed model-based residual reinforcement learning for CAV trajectory control [1.5361702135159845]
This paper introduces a knowledge-informed model-based residual reinforcement learning framework.
It integrates traffic expert knowledge into a virtual environment model, employing the Intelligent Driver Model (IDM) for basic dynamics and neural networks for residual dynamics.
We propose a novel strategy that combines traditional control methods with residual RL, facilitating efficient learning and policy optimization without the need to learn from scratch.
arXiv Detail & Related papers (2024-08-30T16:16:57Z) - Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot
Learning [52.101643259906915]
We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations.
Existing model-based offline RL methods are not suitable for offline-to-online fine-tuning in high-dimensional domains.
We propose an on-policy model-based method that can efficiently reuse prior data through model-based value expansion and policy regularization.
arXiv Detail & Related papers (2024-01-06T21:04:31Z) - Recurrent neural networks and transfer learning for elasto-plasticity in
woven composites [0.0]
This article presents Recurrent Neural Network (RNN) models as a surrogate for computationally intensive meso-scale simulation of woven composites.
A mean-field model generates a comprehensive data set representing elasto-plastic behavior.
In simulations, arbitrary six-dimensional strain histories are used to predict stresses under random walking as the source task and cyclic loading conditions as the target task.
arXiv Detail & Related papers (2023-11-22T14:47:54Z) - Physics-Informed Model-Based Reinforcement Learning [19.01626581411011]
One of the drawbacks of traditional reinforcement learning algorithms is their poor sample efficiency.
We learn a model of the environment, essentially its transition dynamics and reward function, use it to generate imaginary trajectories and backpropagate through them to update the policy.
We show that, in model-based RL, model accuracy mainly matters in environments that are sensitive to initial conditions.
We also show that, in challenging environments, physics-informed model-based RL achieves better average-return than state-of-the-art model-free RL algorithms.
arXiv Detail & Related papers (2022-12-05T11:26:10Z) - Stabilizing Machine Learning Prediction of Dynamics: Noise and
Noise-inspired Regularization [58.720142291102135]
Recent has shown that machine learning (ML) models can be trained to accurately forecast the dynamics of chaotic dynamical systems.
In the absence of mitigating techniques, this technique can result in artificially rapid error growth, leading to inaccurate predictions and/or climate instability.
We introduce Linearized Multi-Noise Training (LMNT), a regularization technique that deterministically approximates the effect of many small, independent noise realizations added to the model input during training.
arXiv Detail & Related papers (2022-11-09T23:40:52Z) - RLFlow: Optimising Neural Network Subgraph Transformation with World
Models [0.0]
We propose a model-based agent which learns to optimise the architecture of neural networks by performing a sequence of subgraph transformations to reduce model runtime.
We show our approach can match the performance of state of the art on common convolutional networks and outperform those by up to 5% on transformer-style architectures.
arXiv Detail & Related papers (2022-05-03T11:52:54Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z) - Learning to Continuously Optimize Wireless Resource In Episodically
Dynamic Environment [55.91291559442884]
This work develops a methodology that enables data-driven methods to continuously learn and optimize in a dynamic environment.
We propose to build the notion of continual learning into the modeling process of learning wireless systems.
Our design is based on a novel min-max formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2020-11-16T08:24:34Z) - Information Theoretic Model Predictive Q-Learning [64.74041985237105]
We present a novel theoretical connection between information theoretic MPC and entropy regularized RL.
We develop a Q-learning algorithm that can leverage biased models.
arXiv Detail & Related papers (2019-12-31T00:29:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.