Mean-Variance Efficient Reinforcement Learning by Expected Quadratic
Utility Maximization
- URL: http://arxiv.org/abs/2010.01404v3
- Date: Sun, 5 Sep 2021 10:28:58 GMT
- Title: Mean-Variance Efficient Reinforcement Learning by Expected Quadratic
Utility Maximization
- Authors: Masahiro Kato and Kei Nakagawa and Kenshi Abe and Tetsuro Morimura
- Abstract summary: In this paper, we consider learning efficient policies that achieve efficiency regarding MV trade-off.
To achieve this purpose, we train an agent to maximize the expected quadratic utility function.
- Score: 9.902494567482597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Risk management is critical in decision making, and mean-variance (MV)
trade-off is one of the most common criteria. However, in reinforcement
learning (RL) for sequential decision making under uncertainty, most of the
existing methods for MV control suffer from computational difficulties caused
by the double sampling problem. In this paper, in contrast to strict MV
control, we consider learning MV efficient policies that achieve Pareto
efficiency regarding MV trade-off. To achieve this purpose, we train an agent
to maximize the expected quadratic utility function, a common objective of risk
management in finance and economics. We call our approach direct expected
quadratic utility maximization (EQUM). The EQUM does not suffer from the double
sampling issue because it does not include gradient estimation of variance. We
confirm that the maximizer of the objective in the EQUM directly corresponds to
an MV efficient policy under a certain condition. We conduct experiments with
benchmark settings to demonstrate the effectiveness of the EQUM.
Related papers
- Optimal Policy Adaptation under Covariate Shift [15.703626346971182]
We propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets.
We derive the identifiability assumptions for the reward induced by a given policy.
We then learn the optimal policy by optimizing the estimated reward.
arXiv Detail & Related papers (2025-01-14T12:33:02Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z) - Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement
Learning [12.022303947412917]
This paper aims at optimizing the mean-semivariance criterion in reinforcement learning w.r.t. steady rewards.
We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function.
We propose two on-policy algorithms based on the policy gradient theory and the trust region method.
arXiv Detail & Related papers (2022-06-15T08:32:53Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.