Guided Online Distillation: Promoting Safe Reinforcement Learning by
Offline Demonstration
- URL: http://arxiv.org/abs/2309.09408v2
- Date: Thu, 12 Oct 2023 23:55:38 GMT
- Title: Guided Online Distillation: Promoting Safe Reinforcement Learning by
Offline Demonstration
- Authors: Jinning Li, Xinyi Liu, Banghua Zhu, Jiantao Jiao, Masayoshi Tomizuka,
Chen Tang, Wei Zhan
- Abstract summary: We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue.
We propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework.
GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms.
- Score: 75.51109230296568
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Safe Reinforcement Learning (RL) aims to find a policy that achieves high
rewards while satisfying cost constraints. When learning from scratch, safe RL
agents tend to be overly conservative, which impedes exploration and restrains
the overall performance. In many realistic tasks, e.g. autonomous driving,
large-scale expert demonstration data are available. We argue that extracting
expert policy from offline data to guide online exploration is a promising
solution to mitigate the conserveness issue. Large-capacity models, e.g.
decision transformers (DT), have been proven to be competent in offline policy
learning. However, data collected in real-world scenarios rarely contain
dangerous cases (e.g., collisions), which makes it prohibitive for the policies
to learn safety concepts. Besides, these bulk policy networks cannot meet the
computation speed requirements at inference time on real-world tasks such as
autonomous driving. To this end, we propose Guided Online Distillation (GOLD),
an offline-to-online safe RL framework. GOLD distills an offline DT policy into
a lightweight policy network through guided online safe RL training, which
outperforms both the offline DT policy and online safe RL algorithms.
Experiments in both benchmark safe RL tasks and real-world driving tasks based
on the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfully
distill lightweight policies and solve decision-making problems in challenging
safety-critical scenarios.
Related papers
- Reward-Safety Balance in Offline Safe RL via Diffusion Regularization [16.5825143820431]
Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints.
We propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL)
DRCORL first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference.
arXiv Detail & Related papers (2025-02-18T00:00:03Z) - Safe Reinforcement Learning with Minimal Supervision [45.44831696628473]
Reinforcement learning (RL) in the real world requires procedures that enable agents to explore without causing harm to themselves or others.
The most successful solutions to the problem of safe RL leverage offline data to learn a safe-set, enabling safe online exploration.
This paper investigates the influence of the quantity and quality of data used to train the initial safe learning problem offline on the ability to learn safe-RL policies online.
arXiv Detail & Related papers (2025-01-08T13:04:08Z) - Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy [12.589890916332196]
offline-to-online (O2O) RL can be leveraged to facilitate faster and safer online policy learning.
We introduce textbfMarvel, a novel framework for O2O safe RL, comprising two key components that work in concert.
Our work has the great potential to advance the field towards more efficient and practical safe RL solutions.
arXiv Detail & Related papers (2024-12-05T18:51:18Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Constrained Decision Transformer for Offline Safe Reinforcement Learning [16.485325576173427]
We study the offline safe RL problem from a novel multi-objective optimization perspective.
We propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment.
arXiv Detail & Related papers (2023-02-14T21:27:10Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - SAFER: Data-Efficient and Safe Reinforcement Learning via Skill
Acquisition [59.94644674087599]
We propose SAFEty skill pRiors (SAFER), an algorithm that accelerates policy learning on complex control tasks under safety constraints.
Through principled training on an offline dataset, SAFER learns to extract safe primitive skills.
In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies.
arXiv Detail & Related papers (2022-02-10T05:43:41Z) - Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones [81.49106778460238]
Recovery RL uses offline data to learn about constraint violating zones before policy learning.
We evaluate Recovery RL on 6 simulation domains, including two contact-rich manipulation tasks and an image-based navigation task.
Results suggest that Recovery RL trades off constraint violations and task successes 2 - 20 times more efficiently in simulation domains and 3 times more efficiently in physical experiments.
arXiv Detail & Related papers (2020-10-29T20:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.