Forward-PECVaR Algorithm: Exact Evaluation for CVaR SSPs
- URL: http://arxiv.org/abs/2303.00672v1
- Date: Wed, 1 Mar 2023 17:10:22 GMT
- Title: Forward-PECVaR Algorithm: Exact Evaluation for CVaR SSPs
- Authors: Willy Arthur Silva Reis, Denis Benevolo Pais, Valdinei Freire, Karina
Valdivia Delgado
- Abstract summary: Conditional Value at Risk (CVaR) is a criterion that allows modeling an arbitrary level of risk.
We propose a new algorithm, Forward-PECVaR, that evaluates exactly stationary policies of CVaR-SSPs with non-uniform costs.
- Score: 1.347733333991357
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Stochastic Shortest Path (SSP) problem models probabilistic
sequential-decision problems where an agent must pursue a goal while minimizing
a cost function. Because of the probabilistic dynamics, it is desired to have a
cost function that considers risk. Conditional Value at Risk (CVaR) is a
criterion that allows modeling an arbitrary level of risk by considering the
expectation of a fraction $\alpha$ of worse trajectories. Although an optimal
policy is non-Markovian, solutions of CVaR-SSP can be found approximately with
Value Iteration based algorithms such as CVaR Value Iteration with Linear
Interpolation (CVaRVIQ) and CVaR Value Iteration via Quantile Representation
(CVaRVILI). These type of solutions depends on the algorithm's parameters such
as the number of atoms and $\alpha_0$ (the minimum $\alpha$). To compare the
policies returned by these algorithms, we need a way to exactly evaluate
stationary policies of CVaR-SSPs. Although there is an algorithm that evaluates
these policies, this only works on problems with uniform costs. In this paper,
we propose a new algorithm, Forward-PECVaR (ForPECVaR), that evaluates exactly
stationary policies of CVaR-SSPs with non-uniform costs. We evaluate
empirically CVaR Value Iteration algorithms that found solutions approximately
regarding their quality compared with the exact solution, and the influence of
the algorithm parameters in the quality and scalability of the solutions.
Experiments in two domains show that it is important to use an $\alpha_0$
smaller than the $\alpha$ target and an adequate number of atoms to obtain a
good approximation.
Related papers
- Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Risk-Sensitive Markov Decision Processes with Long-Run CVaR Criterion [9.792654108885676]
We study the infinite-horizon discrete-time Markov decision process (MDP) with a long-run CVaR criterion.
We derive a pseudo CVaR difference formula which quantifies the difference of long-run CVaR under any two policies.
We develop a policy type algorithm to efficiently optimize CVaR, which is shown to converge to local optima in the mixed policy space.
arXiv Detail & Related papers (2022-10-17T04:29:14Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Risk-Averse No-Regret Learning in Online Convex Games [19.4481913405231]
We consider an online game with risk-averse agents whose goal is to learn optimal decisions that minimize the risk of incurring significantly high costs.
Since the distributions of the cost functions depend on the actions of all agents that are generally unobservable, the Conditional Value at Risk (CVaR) values of the costs are difficult to compute.
We propose a new online risk-averse learning algorithm that relies on one-point zeroth-order estimation of the CVaR gradients computed using CVaR values.
arXiv Detail & Related papers (2022-03-16T21:36:47Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Robust Reinforcement Learning using Least Squares Policy Iteration with
Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces.
We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation.
We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z) - Statistical Learning with Conditional Value at Risk [35.4968603057034]
We propose a risk-averse statistical learning framework wherein the performance of a learning algorithm is evaluated by the conditional value-at-risk (CVaR) of losses rather than the expected loss.
arXiv Detail & Related papers (2020-02-14T00:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.