Guided Online Distillation: Promoting Safe Reinforcement Learning by
Offline Demonstration
- URL: http://arxiv.org/abs/2309.09408v2
- Date: Thu, 12 Oct 2023 23:55:38 GMT
- Title: Guided Online Distillation: Promoting Safe Reinforcement Learning by
Offline Demonstration
- Authors: Jinning Li, Xinyi Liu, Banghua Zhu, Jiantao Jiao, Masayoshi Tomizuka,
Chen Tang, Wei Zhan
- Abstract summary: We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue.
We propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework.
GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms.
- Score: 75.51109230296568
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Safe Reinforcement Learning (RL) aims to find a policy that achieves high
rewards while satisfying cost constraints. When learning from scratch, safe RL
agents tend to be overly conservative, which impedes exploration and restrains
the overall performance. In many realistic tasks, e.g. autonomous driving,
large-scale expert demonstration data are available. We argue that extracting
expert policy from offline data to guide online exploration is a promising
solution to mitigate the conserveness issue. Large-capacity models, e.g.
decision transformers (DT), have been proven to be competent in offline policy
learning. However, data collected in real-world scenarios rarely contain
dangerous cases (e.g., collisions), which makes it prohibitive for the policies
to learn safety concepts. Besides, these bulk policy networks cannot meet the
computation speed requirements at inference time on real-world tasks such as
autonomous driving. To this end, we propose Guided Online Distillation (GOLD),
an offline-to-online safe RL framework. GOLD distills an offline DT policy into
a lightweight policy network through guided online safe RL training, which
outperforms both the offline DT policy and online safe RL algorithms.
Experiments in both benchmark safe RL tasks and real-world driving tasks based
on the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfully
distill lightweight policies and solve decision-making problems in challenging
safety-critical scenarios.
Related papers
- FOSP: Fine-tuning Offline Safe Policy through World Models [3.7971075341023526]
Model-based Reinforcement Learning (RL) has shown its high training efficiency and capability of handling high-dimensional tasks.
However, prior works still pose safety challenges due to the online exploration in real-world deployment.
In this paper, we aim to further enhance safety during the deployment stage for vision-based robotic tasks by fine-tuning an offline-trained policy.
arXiv Detail & Related papers (2024-07-06T03:22:57Z) - DRNet: A Decision-Making Method for Autonomous Lane Changingwith Deep
Reinforcement Learning [7.2282857478457805]
"DRNet" is a novel DRL-based framework that enables a DRL agent to learn to drive by executing reasonable lane changing on simulated highways.
Our DRL agent has the ability to learn the desired task without causing collisions and outperforms DDQN and other baseline models.
arXiv Detail & Related papers (2023-11-02T21:17:52Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Constrained Decision Transformer for Offline Safe Reinforcement Learning [16.485325576173427]
We study the offline safe RL problem from a novel multi-objective optimization perspective.
We propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment.
arXiv Detail & Related papers (2023-02-14T21:27:10Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - Safe Reinforcement Learning using Data-Driven Predictive Control [0.5459797813771499]
We propose a data-driven safety layer that acts as a filter for unsafe actions.
The safety layer penalizes the RL agent if the proposed action is unsafe and replaces it with the closest safe one.
In a simulation, we show that our method outperforms state-of-the-art safe RL methods on the robotics navigation problem.
arXiv Detail & Related papers (2022-11-20T17:10:40Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - SAFER: Data-Efficient and Safe Reinforcement Learning via Skill
Acquisition [59.94644674087599]
We propose SAFEty skill pRiors (SAFER), an algorithm that accelerates policy learning on complex control tasks under safety constraints.
Through principled training on an offline dataset, SAFER learns to extract safe primitive skills.
In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies.
arXiv Detail & Related papers (2022-02-10T05:43:41Z) - Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones [81.49106778460238]
Recovery RL uses offline data to learn about constraint violating zones before policy learning.
We evaluate Recovery RL on 6 simulation domains, including two contact-rich manipulation tasks and an image-based navigation task.
Results suggest that Recovery RL trades off constraint violations and task successes 2 - 20 times more efficiently in simulation domains and 3 times more efficiently in physical experiments.
arXiv Detail & Related papers (2020-10-29T20:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.