Related papers: SAD: State-Action Distillation for In-Context Reinforcement Learning under Random Policies

SAD: State-Action Distillation for In-Context Reinforcement Learning under Random Policies

URL: http://arxiv.org/abs/2410.19982v1
Date: Fri, 25 Oct 2024 21:46:25 GMT
Title: SAD: State-Action Distillation for In-Context Reinforcement Learning under Random Policies
Authors: Weiqin Chen, Santiago Paternain,
Abstract summary: State-Action Distillation (SAD) generates a remarkable pretraining dataset guided solely by random policies. SAD outperforms the best baseline by 180.86% in the offline evaluation and by 172.8% in the online evaluation.
Score: 2.52299400625445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained foundation models have exhibited extraordinary in-context learning performance, allowing zero-shot generalization to new tasks not encountered during the pretraining. In the case of RL, in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, current state-of-the-art ICRL algorithms, such as AD, DPT and DIT, impose stringent requirements on generating the pretraining dataset concerning the behavior (source) policies, context information, and action labels, etc. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all environments during the generation of the pretraining dataset. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be both prohibitively intractable and expensive. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate a remarkable pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling the outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during the pretraining. To the best of our knowledge, this is the first work that enables promising ICRL under (e.g., uniform) random policies and random contexts. We establish theoretical analyses regarding the performance guarantees of SAD. Moreover, our empirical results across multiple ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 180.86% in the offline evaluation and by 172.8% in the online evaluation.

Related papers

Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy. We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Constrained Reinforcement Learning Under Model Mismatch [18.05296241839688]
Existing studies on constrained reinforcement learning (RL) may obtain a well-performing policy in the training environment. However, when deployed in a real environment, it may easily violate constraints that were originally satisfied during training because there might be model mismatch between the training and real environments. We develop a Robust Constrained Policy Optimization (RCPO) algorithm, which is the first algorithm that applies to large/continuous state space and has theoretical guarantees on worst-case reward improvement and constraint violation at each iteration during the training.
arXiv Detail & Related papers (2024-05-02T14:31:52Z)
Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment. Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner. We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z)
A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting. We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z)
Robustness and risk management via distributional dynamic programming [13.173307471333619]
We introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation. Our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate. We derive distributional operators and DP algorithms solving a new control task.
arXiv Detail & Related papers (2021-12-28T12:12:57Z)
Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment. We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return. On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z)
MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical. We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization. Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)
Risk-Averse Offline Reinforcement Learning [46.383648750385575]
Training Reinforcement Learning (RL) agents in high-stakes applications might be too prohibitive due to the risk associated to exploration. We present the Offline Risk-Averse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting.
arXiv Detail & Related papers (2021-02-10T10:27:49Z)
Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation [21.703965401500913]
We propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and 3) we propose a way to interpret ESRL's policy at every state through posterior distributions.
arXiv Detail & Related papers (2020-06-23T17:43:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.