Pausing Policy Learning in Non-stationary Reinforcement Learning
- URL: http://arxiv.org/abs/2405.16053v1
- Date: Sat, 25 May 2024 04:38:09 GMT
- Title: Pausing Policy Learning in Non-stationary Reinforcement Learning
- Authors: Hyunin Lee, Ming Jin, Javad Lavaei, Somayeh Sojoudi,
- Abstract summary: We tackle a common belief that continually updating the decision is optimal to minimize the temporal gap.
We propose forecasting an online reinforcement learning framework and show that strategically pausing decision updates yields better overall performance.
- Score: 23.147618992106867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time inference is a challenge of real-world reinforcement learning due to temporal differences in time-varying environments: the system collects data from the past, updates the decision model in the present, and deploys it in the future. We tackle a common belief that continually updating the decision is optimal to minimize the temporal gap. We propose forecasting an online reinforcement learning framework and show that strategically pausing decision updates yields better overall performance by effectively managing aleatoric uncertainty. Theoretically, we compute an optimal ratio between policy update and hold duration, and show that a non-zero policy hold duration provides a sharper upper bound on the dynamic regret. Our experimental evaluations on three different environments also reveal that a non-zero policy hold duration yields higher rewards compared to continuous decision updates.
Related papers
- Time-Constrained Robust MDPs [28.641743425443]
We introduce a new time-constrained robust MDP (TC-RMDP) formulation that considers multifactorial, correlated, and time-dependent disturbances.
This study revisits the prevailing assumptions in robust RL and opens new avenues for developing more practical and realistic RL applications.
arXiv Detail & Related papers (2024-06-12T16:45:09Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Beyond the Policy Gradient Theorem for Efficient Policy Updates in
Actor-Critic Algorithms [10.356356383401566]
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states.
We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target.
We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
arXiv Detail & Related papers (2022-02-15T15:04:10Z) - Lifelong Hyper-Policy Optimization with Multiple Importance Sampling
Regularization [40.17392342387002]
We propose an approach which learns a hyper-policy, whose input is time, that outputs the parameters of the policy to be queried at that time.
This hyper-policy is trained to maximize the estimated future performance, efficiently reusing past data by means of importance sampling.
We empirically validate our approach, in comparison with state-of-the-art algorithms, on realistic environments.
arXiv Detail & Related papers (2021-12-13T13:09:49Z) - Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning [8.736154600219685]
Policy evaluation in online learning attracts increasing attention.
Yet, such a problem is particularly challenging due to the dependent data generated in the online environment.
We develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning.
arXiv Detail & Related papers (2021-10-29T02:38:54Z) - Pre-emptive learning-to-defer for sequential medical decision-making
under uncertainty [35.077494648756876]
We propose SLTD (Sequential Learning-to-Defer') as a framework for learning-to-defer pre-emptively to an expert in sequential decision-making settings.
SLTD measures the likelihood of improving value of deferring now versus later based on the underlying uncertainty in dynamics.
arXiv Detail & Related papers (2021-09-13T20:43:10Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.