Related papers: Forward-PECVaR Algorithm: Exact Evaluation for CVaR SSPs

Forward-PECVaR Algorithm: Exact Evaluation for CVaR SSPs

URL: http://arxiv.org/abs/2303.00672v1
Date: Wed, 1 Mar 2023 17:10:22 GMT
Title: Forward-PECVaR Algorithm: Exact Evaluation for CVaR SSPs
Authors: Willy Arthur Silva Reis, Denis Benevolo Pais, Valdinei Freire, Karina Valdivia Delgado
Abstract summary: Conditional Value at Risk (CVaR) is a criterion that allows modeling an arbitrary level of risk. We propose a new algorithm, Forward-PECVaR, that evaluates exactly stationary policies of CVaR-SSPs with non-uniform costs.
Score: 1.347733333991357
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The Stochastic Shortest Path (SSP) problem models probabilistic sequential-decision problems where an agent must pursue a goal while minimizing a cost function. Because of the probabilistic dynamics, it is desired to have a cost function that considers risk. Conditional Value at Risk (CVaR) is a criterion that allows modeling an arbitrary level of risk by considering the expectation of a fraction $\alpha$ of worse trajectories. Although an optimal policy is non-Markovian, solutions of CVaR-SSP can be found approximately with Value Iteration based algorithms such as CVaR Value Iteration with Linear Interpolation (CVaRVIQ) and CVaR Value Iteration via Quantile Representation (CVaRVILI). These type of solutions depends on the algorithm's parameters such as the number of atoms and $\alpha_0$ (the minimum $\alpha$). To compare the policies returned by these algorithms, we need a way to exactly evaluate stationary policies of CVaR-SSPs. Although there is an algorithm that evaluates these policies, this only works on problems with uniform costs. In this paper, we propose a new algorithm, Forward-PECVaR (ForPECVaR), that evaluates exactly stationary policies of CVaR-SSPs with non-uniform costs. We evaluate empirically CVaR Value Iteration algorithms that found solutions approximately regarding their quality compared with the exact solution, and the influence of the algorithm parameters in the quality and scalability of the solutions. Experiments in two domains show that it is important to use an $\alpha_0$ smaller than the $\alpha$ target and an adequate number of atoms to obtain a good approximation.

Related papers

Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL) We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z)
High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms. We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z)
Risk-Sensitive Markov Decision Processes with Long-Run CVaR Criterion [9.792654108885676]
We study the infinite-horizon discrete-time Markov decision process (MDP) with a long-run CVaR criterion. We derive a pseudo CVaR difference formula which quantifies the difference of long-run CVaR under any two policies. We develop a policy type algorithm to efficiently optimize CVaR, which is shown to converge to local optima in the mixed policy space.
arXiv Detail & Related papers (2022-10-17T04:29:14Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Risk-Averse No-Regret Learning in Online Convex Games [19.4481913405231]
We consider an online game with risk-averse agents whose goal is to learn optimal decisions that minimize the risk of incurring significantly high costs. Since the distributions of the cost functions depend on the actions of all agents that are generally unobservable, the Conditional Value at Risk (CVaR) values of the costs are difficult to compute. We propose a new online risk-averse learning algorithm that relies on one-point zeroth-order estimation of the CVaR gradients computed using CVaR values.
arXiv Detail & Related papers (2022-03-16T21:36:47Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)
Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z)
Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL) We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
Statistical Learning with Conditional Value at Risk [35.4968603057034]
We propose a risk-averse statistical learning framework wherein the performance of a learning algorithm is evaluated by the conditional value-at-risk (CVaR) of losses rather than the expected loss.
arXiv Detail & Related papers (2020-02-14T00:58:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.