Optimizing for the Future in Non-Stationary MDPs
- URL: http://arxiv.org/abs/2005.08158v4
- Date: Mon, 21 Sep 2020 23:28:01 GMT
- Title: Optimizing for the Future in Non-Stationary MDPs
- Authors: Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White,
Sridhar Mahadevan, Philip S. Thomas
- Abstract summary: We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
- Score: 52.373873622008944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most reinforcement learning methods are based upon the key assumption that
the transition dynamics and reward functions are fixed, that is, the underlying
Markov decision process is stationary. However, in many real-world
applications, this assumption is violated, and using existing algorithms may
result in a performance lag. To proactively search for a good future policy, we
present a policy gradient algorithm that maximizes a forecast of future
performance. This forecast is obtained by fitting a curve to the
counter-factual estimates of policy performance over time, without explicitly
modeling the underlying non-stationarity. The resulting algorithm amounts to a
non-uniform reweighting of past data, and we observe that minimizing
performance over some of the data from past episodes can be beneficial when
searching for a policy that maximizes future performance. We show that our
algorithm, called Prognosticator, is more robust to non-stationarity than two
online adaptation techniques, on three simulated problems motivated by
real-world applications.
Related papers
- Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems.
Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation.
We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z) - Reinforcement Learning in the Wild: Scalable RL Dispatching Algorithm
Deployed in Ridehailing Marketplace [12.298997392937876]
This study proposes a real-time dispatching algorithm based on reinforcement learning.
It is deployed online in multiple cities under DiDi's operation for A/B testing and is launched in one of the major international markets.
The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing.
arXiv Detail & Related papers (2022-02-10T16:07:17Z) - Lifelong Hyper-Policy Optimization with Multiple Importance Sampling
Regularization [40.17392342387002]
We propose an approach which learns a hyper-policy, whose input is time, that outputs the parameters of the policy to be queried at that time.
This hyper-policy is trained to maximize the estimated future performance, efficiently reusing past data by means of importance sampling.
We empirically validate our approach, in comparison with state-of-the-art algorithms, on realistic environments.
arXiv Detail & Related papers (2021-12-13T13:09:49Z) - Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via
Online High-Confidence Change-Point Detection [7.685002911021767]
We introduce an algorithm that efficiently learns policies in non-stationary environments.
It analyzes a possibly infinite stream of data and computes, in real-time, high-confidence change-point detection statistics.
We show that (i) this algorithm minimizes the delay until unforeseen changes to a context are detected, thereby allowing for rapid responses.
arXiv Detail & Related papers (2021-05-20T01:57:52Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs.
This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.