State Regularized Policy Optimization on Data with Dynamics Shift
- URL: http://arxiv.org/abs/2306.03552v4
- Date: Thu, 22 Feb 2024 03:24:16 GMT
- Title: State Regularized Policy Optimization on Data with Dynamics Shift
- Authors: Zhenghai Xue, Qingpeng Cai, Shuchang Liu, Dong Zheng, Peng Jiang, Kun
Gai, Bo An
- Abstract summary: In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics.
In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions.
Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (textbfS textbfRegularized textbfPolicy textbfOptimization) algorithm.
- Score: 25.412472472457324
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In many real-world scenarios, Reinforcement Learning (RL) algorithms are
trained on data with dynamics shift, i.e., with different underlying
environment dynamics. A majority of current methods address such issue by
training context encoders to identify environment parameters. Data with
dynamics shift are separated according to their environment parameters to train
the corresponding policy. However, these methods can be sample inefficient as
data are used \textit{ad hoc}, and policies trained for one dynamics cannot
benefit from data collected in all other environments with different dynamics.
In this paper, we find that in many environments with similar structures and
different dynamics, optimal policies have similar stationary state
distributions. We exploit such property and learn the stationary state
distribution from data with dynamics shift for efficient data reuse. Such
distribution is used to regularize the policy trained in a new environment,
leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy
\textbf{O}ptimization) algorithm. To conduct theoretical analyses, the
intuition of similar environment structures is characterized by the notion of
homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on
policies regularized by the stationary state distribution. In practice, SRPO
can be an add-on module to context-based algorithms in both online and offline
RL settings. Experimental results show that SRPO can make several context-based
algorithms far more data efficient and significantly improve their overall
performance.
Related papers
- OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge.
Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors.
In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z) - Performative Reinforcement Learning in Gradually Shifting Environments [13.524274041966539]
When Reinforcement Learning (RL) agents are deployed in practice, they might impact their environment and change its dynamics.
We propose a new framework to model this phenomenon, where the current environment depends on the deployed policy as well as its previous dynamics.
arXiv Detail & Related papers (2024-02-15T10:00:13Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z) - Policy Information Capacity: Information-Theoretic Measure for Task
Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty.
We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives.
These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z) - Learning to Continuously Optimize Wireless Resource In Episodically
Dynamic Environment [55.91291559442884]
This work develops a methodology that enables data-driven methods to continuously learn and optimize in a dynamic environment.
We propose to build the notion of continual learning into the modeling process of learning wireless systems.
Our design is based on a novel min-max formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2020-11-16T08:24:34Z) - Fast Adaptation via Policy-Dynamics Value Functions [41.738462615120326]
We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training.
PD-VF explicitly estimates the cumulative reward in a space of policies and environments.
We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains.
arXiv Detail & Related papers (2020-07-06T16:47:56Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z) - Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states.
We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization.
Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.