Variance Penalized On-Policy and Off-Policy Actor-Critic
- URL: http://arxiv.org/abs/2102.01985v1
- Date: Wed, 3 Feb 2021 10:06:16 GMT
- Title: Variance Penalized On-Policy and Off-Policy Actor-Critic
- Authors: Arushi Jain, Gandharv Patil, Ayush Jain, Khimya Khetarpal, Doina
Precup
- Abstract summary: We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
- Score: 60.06593931848165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning algorithms are typically geared towards optimizing the
expected return of an agent. However, in many practical applications, low
variance in the return is desired to ensure the reliability of an algorithm. In
this paper, we propose on-policy and off-policy actor-critic algorithms that
optimize a performance criterion involving both mean and variance in the
return. Previous work uses the second moment of return to estimate the variance
indirectly. Instead, we use a much simpler recently proposed direct variance
estimator which updates the estimates incrementally using temporal difference
methods. Using the variance-penalized criterion, we guarantee the convergence
of our algorithm to locally optimal policies for finite state action Markov
decision processes. We demonstrate the utility of our algorithm in tabular and
continuous MuJoCo domains. Our approach not only performs on par with
actor-critic and prior variance-penalization baselines in terms of expected
return, but also generates trajectories which have lower variance in the
return.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean
and Variance [3.062772835338966]
This paper investigates the optimization problem of an infinite stage discrete time Markov decision process (MDP) with a long-run average metric.
A performance difference formula is derived and it can quantify the difference of the mean-variance combined metrics of MDPs under any two different policies.
A necessary condition of the optimal policy and the optimality of deterministic policies are derived.
arXiv Detail & Related papers (2020-08-09T10:35:35Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.